How AI Hears Accents

An Audible Visualization of Accent Clusters

Today, we’re going to go on a tour of the world's accents in English. Users of BoldVoice, the American accent training app, speak more than 200 different languages, and it is our mission to help them speak English clearly and confidently. While building the accent strength metric we covered in the previous blog post, we needed to understand how our models clustered accents, dialects, native languages, and language families. Today, we will share some of our findings using a 3D latent visualization.

Technical Approach

To begin, we finetuned HuBERT, a pretrained audio-only foundation model for the task of accent identification using our in-house dataset of non-native English speech and self-reported accents. BoldVoice’s own dataset of accented speech is one of the largest of its kind in the world.

hubert + classification head architecture:
Model: boldvoice/hubert-accent-identifier
Total Parameters: 94.6M (all trainable)

ARCHITECTURE:
═════════════

                ┌─────────────┐      ┌─────────────┐      ┌─────────────┐      ┌───────────────┐
Raw Audio  →    │  Feature    │  →   │  Feature    │  →   │ Transformer │  →   │ Classification│ 
(16kHz)         │  Extractor  │      │  Projection │      │   Encoder   │      │      Head     │
                └─────────────┘      └─────────────┘      └─────────────┘      └───────────────┘
                7 CNN layers         LayerNorm→Linear     12 layers            768→256→50
                1→512, 320x ↓        512→768, Dropout     12 heads, dim=768
                                                           (89.8M params)


KEY DETAILS:
• Input: Raw waveform (no spectrograms)
• Downsampling: 320x (5×2×2×2×2×2×2)
• Transformer: 12 layers
            

This model receives only the raw input audio and associated accent label; it gets neither a text prompt nor a transcript. For this "finetuning", we sampled 30 million speech recordings comprising 25,000 hours of English speech - a small fraction of our total accent dataset. Unlike a traditional finetune, we unfroze all layers of the pretrained base model due to the large size of our dataset. We trained the model for roughly a week on a cluster of A100 GPUs.

While the accent identifier performs quite well across the top hundred or so accents (play with it yourself at accentoracle.com), for today, we are less interested in its raw performance, and more interested in the clustering of accents in its latent space.

The Visualization

To observe how accents cluster, we've provided an audible latent space visualization for a small subset of recordings. Hover on the points on the graph to see the language labels.

The visualization is created by applying the UMAP dimensionality reduction technique to reduce the 768-dimensional latent space to just 3 dimensions.

From audio to latent visualization
  FROM AUDIO TO LATENT VISUALIZATION
  ══════════════════════════════════

     ╱│     ╱│     ╱│     ╱│                                z ↑
    ╱ │    ╱ │    ╱ │    ╱ │       ┌─────────────┐            │ ●  
   ╱  │   ╱  │   ╱  │   ╱  │       │ ●●●●●○●●●●● │            │  ●  ●
 ─────│──╱───│──╱───│──╱───│──     │ ●●●●●○●●●●● │            │   ○
      │ ╱    │ ╱    │ ╱    │       │ ●●●●●○●●●●● │            └────────→ y
      │╱     │╱     │╱     │       └─────────────┘           ╱  ●  ●
       Speech Audio(16kHz)        768-dim embedding         ╱ ●    ●
                                    (mean pooled)        x ╱
                                                               3D [x,y,z]
            │                              │                Interactive Plot
            │                              │  
            │      ┌───────────────┐       │      ┌───────────┐    │
            └─────→│     Model     │──────→└─────→│ UMAP(n=3) │───→┘
                   │   Inference   │              └───────────┘
                   └───────────────┘
            

Note that UMAP destroys much of the information in the full-dimensional latent space, but roughly preserves the global structure, including the relative distances between clusters. Each point represents a single recording inferenced by the model after it was fine tuned and the color corresponds to the true accent label.

Finally, in order to denoise the clusters, we cherry-pick only those points for which the predicted and target accents match. Remember, the purpose of this visualization is not to help us assess the performance of the model, but to understand where it has placed accents relative to one another.

Innovative Privacy Protection

By clicking or tapping on a point, you will hear a standardized version of the corresponding recording. The reason for voice standardization is two-fold: first, it anonymizes the speaker in the original recordings in order to protect their privacy. Second, it allows us to hear each accent projected onto a neutral voice, making it easier to hear the accent differences and ignore extraneous differences like gender, recording quality, and background noise. However, there is no free lunch: it does not perfectly preserve the source accent and introduces some audible phonetic artifacts.

This voice standardization model is an in-house accent-preserving voice conversion model.

Exploration

Please explore the latent space visualization. You can click, drag, zoom, and scroll to navigate. You can also isolate accents by double clicking them in the legend to the right (desktop only) – double-clicking again will undo the filter.

Meanwhile, think about the following questions: which accents would you expect to be clustered together? Do you expect them to follow the taxonomy of language families or to cluster in other ways?

Highlights

Our team was most surprised to see that geographic proximity, immigration, and colonialism seem to affect this model's learned accent groupings more than language taxonomy. Click the button below to explore our first grouping.

For example, the Australian cluster is right next to the Vietnamese cluster despite the fact that English and Vietnamese are not related taxonomically. If you listen to the 10 points that make up a bridge between the two clusters, you hear what sounds like native Vietnamese speakers who speak English with an Australian accent. Perhaps these hybrid accents could explain the overall proximity of these clusters.

We see something similar for the French/Nigerian/Ghanaian grouping.

It's important to remember that the distances on this map are not an objective measure of the phonetic similarity between accents. They are a byproduct of a model which has successfully learned to distinguish a variety of accents in L2 English speech from audio alone with no knowledge of language or linguistics.

Next, take a look at the Indian subcontinent accent cluster. Note that the Telugu, Tamil, and Malayalam accents are grouped together at one end of the cluster, and the Nepali and Bengali accents are at the other. This roughly mirrors geography, where Telugu, Tamil, and Malayalam are widely spoken languages in southern India, and Bengali and Nepali are widely spoken in northwest India and Nepal.

Finally, let's scroll to the Mongolian cluster, where the nearest cluster is actually Korean.

Experts and non-experts have observed phonetic similarities between Mongolian and Korean. A now-refuted hypothesis called the "Altaic language family" once grouped them together.

It is interesting that this model, with no concept of language families, has also picked up on the phonetic similarities even as filtered through a second language (English).

What do you think? Is this a meaningless artifact of latent space visualization or evidence of real phonetic features diffusing between Korean and Mongolian?

Conclusion

This exploration highlights how a large scale speech model captures the shared phonetic landscape of global English. By studying how different accents organize in the model’s latent space, we can design pronunciation tools that are not only more accurate but also more effective, reflecting BoldVoice’s mission to help every English learner be understood and confident.

If you are an audio ML engineer, linguist, or just an interested reader, free free to reach out to us at [email protected] - we'd like to hear what you make of this visualization or any of the relationships therein.

Suggestions for what we should cover in the future? Don’t hesitate to share with us.

Special thanks to our in-house dialect coach Ron Carlos for his expertise in interpreting this visualization.

Back to Blogs Home