TTS: Text to Speech G2P: Graphamine to Phonemes
GPT-SoVITS: SOTA
F5-TTS: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. Diffusion Transformer with ConvNeXt V2. No G2P.
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models: focus on realtime LLM without G2P. Demo
BERT-VITS2: uses g2p_en, very similar to GPT-SoVITS.
g2p_en: A Simple Python Module for English Grapheme To Phoneme Conversion. Examples: $200 -> two hundred dollars, (Attempts to retrieve the correct pronunciation for heteronyms based on their POS), has ML model.
There are several systems to encode pronunciations:
IPA: syllable -> sˈɪləbə͡l, s|ˈɪ|l|ə|b|əl, or s|'I|l|@|b|@L
CMU(ARPAbet/ARPABET): cmu_dict is a dictionary containing ~100k English word pronunciations.
Kirshenbaum: is a es-speak version of IPA
cmu_dict: as mentioned above
WikiPronunciationDict: WikiPronunciationDict is a multilingual pronunciation dictionary for English, French, German, and Italian. 1362k French, 636k German, 91k Italian, 69k English in IPA format. They are from wiktionary.org, in my opinion the biggest dataset for G2P.
ipa-dict: multilingual, broadly phonemic, and should represent what one might expect to find in a dictionary or other popular reference work. It is quiet large in size as well and in txt format, compared to WikiPronunciationDict. It is a overall collection from many dataset.
pron_dictionaries: multilingual in IPA format
ES-Speak: The SOTA one that updates frequently is TextToPhonemes() in es-speak. It supports more than 100 languages and accents.
DeepPhonemizer: multilingual outputing IPA format. Trained with cmu_dict, cmu_dict_ipa, and wikipron
https://github.com/NVIDIA/NeMo
mini-bart-g2p: unpopular BART model, trained with cmu_dict and LibriSpeech Alignments
T5G2P: English and Czech, 128,532 English and 442,029 Czech unique sentences.
T5TTS: English only model trained by Nvidia. Can handle mixed capitalization.
NeMo: a library by Nvidia that includes models like T5TTS and has multilingual G2P dictionary and models built-in. I suggest you search G2P in their issue tab. # TODO
https://github.com/tarling/arpabet-and-ipa-convertor-ts (it has stress, forked from chdzq/ARPAbetAndIPAConvertor) https://github.com/chdzq/ARPAbetAndIPAConvertor/blob/master/arpabetandipaconvertor/model/syllable.py (made by chinese) https://github.com/pettarin/ipapy (way complicated than it needs to be) https://github.com/pettarin/ipapy/blob/master/ipapy/data/kirshenbaum.dat https://github.com/pettarin/ipapy/blob/master/ipapy/kirshenbaummapper.py https://github.com/espeak-ng/espeak-ng/blob/master/docs/phonemes/kirshenbaum.md (guideline) https://github.com/rossellhayes/ipa/blob/main/tests/testthat/test-ipa.R https://github.com/dr-ni/ipa2arpabet (only german)
Table of Content