TTS

TTS: Text to Speech G2P: Graphamine to Phonemes

Models

F5-TTS: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. Diffusion Transformer with ConvNeXt V2. No G2P.

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models: focus on realtime LLM without G2P. Demo

BERT-VITS2: uses g2p_en, very similar to GPT-SoVITS.

g2p_en: A Simple Python Module for English Grapheme To Phoneme Conversion. Examples: $200 -> two hundred dollars, (Attempts to retrieve the correct pronunciation for heteronyms based on their POS), has ML model.

Phonemes

Systems

There are several systems to encode pronunciations:

IPA: syllable -> sˈɪləbə͡l, s|ˈɪ|l|ə|b|əl, or s|'I|l|@|b|@L
CMU(ARPAbet/ARPABET): cmu_dict is a dictionary containing ~100k English word pronunciations.
Kirshenbaum: is a es-speak version of IPA

Libraries

Dictionary

cmu_dict: as mentioned above

amepd: fixes of CMU dictionary

WikiPronunciationDict: WikiPronunciationDict is a multilingual pronunciation dictionary for English, French, German, and Italian. 1362k French, 636k German, 91k Italian, 69k English in IPA format. They are from wiktionary.org, in my opinion the biggest dataset for G2P.

There is another attempt to extract from wiktionary

ipa-dict: multilingual, broadly phonemic, and should represent what one might expect to find in a dictionary or other popular reference work. It is quiet large in size as well and in txt format, compared to WikiPronunciationDict. It is a overall collection from many dataset.

pron_dictionaries: multilingual in IPA format

Rule-Based

ES-Speak: The SOTA one that updates frequently is TextToPhonemes() in es-speak. It supports more than 100 languages and accents.

ML-Based

DeepPhonemizer: multilingual outputing IPA format. Trained with cmu_dict, cmu_dict_ipa, and wikipron https://github.com/NVIDIA/NeMo mini-bart-g2p: unpopular BART model, trained with cmu_dict and LibriSpeech Alignments

T5G2P: English and Czech, 128,532 English and 442,029 Czech unique sentences.

T5TTS: English only model trained by Nvidia. Can handle mixed capitalization.

NeMo: a library by Nvidia that includes models like T5TTS and has multilingual G2P dictionary and models built-in. I suggest you search G2P in their issue tab. # TODO

Converter

https://github.com/tarling/arpabet-and-ipa-convertor-ts (it has stress, forked from chdzq/ARPAbetAndIPAConvertor) https://github.com/chdzq/ARPAbetAndIPAConvertor/blob/master/arpabetandipaconvertor/model/syllable.py (made by chinese) https://github.com/pettarin/ipapy (way complicated than it needs to be) https://github.com/pettarin/ipapy/blob/master/ipapy/data/kirshenbaum.dat https://github.com/pettarin/ipapy/blob/master/ipapy/kirshenbaummapper.py https://github.com/espeak-ng/espeak-ng/blob/master/docs/phonemes/kirshenbaum.md (guideline) https://github.com/rossellhayes/ipa/blob/main/tests/testthat/test-ipa.R https://github.com/dr-ni/ipa2arpabet (only german)

Table of Content