Today, we’re going to go on a tour of the world's accents in English. Users of BoldVoice, the American accent training app, speak more than 200 different languages, and it is our mission to help them speak English clearly and confidently. While building the accent strength metric we covered in the previous blog post, we needed to understand how our models clustered accents, dialects, native languages, and language families. Today, we will share some of our findings using a 3D latent visualization.
To begin, we finetuned HuBERT, a pretrained audio-only foundation model for the task of accent identification using our in-house dataset of non-native English speech and self-reported accents. BoldVoice’s own dataset of accented speech is one of the largest of its kind in the world.
Model: boldvoice/hubert-accent-identifier
Total Parameters: 94.6M (all trainable)
ARCHITECTURE:
═════════════
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐
Raw Audio → │ Feature │ → │ Feature │ → │ Transformer │ → │ Classification│
(16kHz) │ Extractor │ │ Projection │ │ Encoder │ │ Head │
└─────────────┘ └─────────────┘ └─────────────┘ └───────────────┘
7 CNN layers LayerNorm→Linear 12 layers 768→256→50
1→512, 320x ↓ 512→768, Dropout 12 heads, dim=768
(89.8M params)
KEY DETAILS:
• Input: Raw waveform (no spectrograms)
• Downsampling: 320x (5×2×2×2×2×2×2)
• Transformer: 12 layers
This model receives only the raw input audio and associated accent label; it gets neither a text prompt nor a transcript. For this "finetuning", we sampled 30 million speech recordings comprising 25,000 hours of English speech - a small fraction of our total accent dataset. Unlike a traditional finetune, we unfroze all layers of the pretrained base model due to the large size of our dataset. We trained the model for roughly a week on a cluster of A100 GPUs.
While the accent identifier performs quite well across the top hundred or so accents (play with it yourself at accentoracle.com), for today, we are less interested in its raw performance, and more interested in the clustering of accents in its latent space.
To observe how accents cluster, we've provided an audible latent space visualization for a small subset of recordings. Hover on the points on the graph to see the language labels.
The visualization is created by applying the UMAP dimensionality reduction technique to reduce the 768-dimensional latent space to just 3 dimensions.
FROM AUDIO TO LATENT VISUALIZATION
══════════════════════════════════
╱│ ╱│ ╱│ ╱│ z ↑
╱ │ ╱ │ ╱ │ ╱ │ ┌─────────────┐ │ ●
╱ │ ╱ │ ╱ │ ╱ │ │ ●●●●●○●●●●● │ │ ● ●
─────│──╱───│──╱───│──╱───│── │ ●●●●●○●●●●● │ │ ○
│ ╱ │ ╱ │ ╱ │ │ ●●●●●○●●●●● │ └────────→ y
│╱ │╱ │╱ │ └─────────────┘ ╱ ● ●
Speech Audio(16kHz) 768-dim embedding ╱ ● ●
(mean pooled) x ╱
3D [x,y,z]
│ │ Interactive Plot
│ │
│ ┌───────────────┐ │ ┌───────────┐ │
└─────→│ Model │──────→└─────→│ UMAP(n=3) │───→┘
│ Inference │ └───────────┘
└───────────────┘
Note that UMAP destroys much of the information in the full-dimensional latent space, but roughly preserves the global structure, including the relative distances between clusters. Each point represents a single recording inferenced by the model after it was fine tuned and the color corresponds to the true accent label.
Finally, in order to denoise the clusters, we cherry-pick only those points for which the predicted and target accents match. Remember, the purpose of this visualization is not to help us assess the performance of the model, but to understand where it has placed accents relative to one another.
By clicking or tapping on a point, you will hear a standardized version of the corresponding recording. The reason for voice standardization is two-fold: first, it anonymizes the speaker in the original recordings in order to protect their privacy. Second, it allows us to hear each accent projected onto a neutral voice, making it easier to hear the accent differences and ignore extraneous differences like gender, recording quality, and background noise. However, there is no free lunch: it does not perfectly preserve the source accent and introduces some audible phonetic artifacts.
This voice standardization model is an in-house accent-preserving voice conversion model.
Please explore the latent space visualization. You can click, drag, zoom, and scroll to navigate. You can also isolate accents by double clicking them in the legend to the right (desktop only) – double-clicking again will undo the filter.
Meanwhile, think about the following questions: which accents would you expect to be clustered together? Do you expect them to follow the taxonomy of language families or to cluster in other ways?
Our team was most surprised to see that geographic proximity, immigration, and colonialism seem to affect this model's learned accent groupings more than language taxonomy. Click the button below to explore our first grouping.
For example, the Australian cluster is right next to the Vietnamese cluster despite the fact that English and Vietnamese are not related taxonomically. If you listen to the 10 points that make up a bridge between the two clusters, you hear what sounds like native Vietnamese speakers who speak English with an Australian accent. Perhaps these hybrid accents could explain the overall proximity of these clusters.
We see something similar for the French/Nigerian/Ghanaian grouping.
It's important to remember that the distances on this map are not an objective measure of the phonetic similarity between accents. They are a byproduct of a model which has successfully learned to distinguish a variety of accents in L2 English speech from audio alone with no knowledge of language or linguistics.
Next, take a look at the Indian subcontinent accent cluster. Note that the Telugu, Tamil, and Malayalam accents are grouped together at one end of the cluster, and the Nepali and Bengali accents are at the other. This roughly mirrors geography, where Telugu, Tamil, and Malayalam are widely spoken languages in southern India, and Bengali and Nepali are widely spoken in northwest India and Nepal.
Finally, let's scroll to the Mongolian cluster, where the nearest cluster is actually Korean.
Experts and non-experts have observed phonetic similarities between Mongolian and Korean. A now-refuted hypothesis called the "Altaic language family" once grouped them together.
It is interesting that this model, with no concept of language families, has also picked up on the phonetic similarities even as filtered through a second language (English).
What do you think? Is this a meaningless artifact of latent space visualization or evidence of real phonetic features diffusing between Korean and Mongolian?