Scripted conversion of pinyin with tone marks to tone numbers / pinyin segmentation

January 10, 2024 at 09:53 AM

Does anyone know a good programming library / script to convert pinyin with tone marks to tone numbers? I need to convert multi-syllable words without spaces.

The biggest challenge is the segmentation. For example bàngōngshì" should be converted to "ban4gong1shi4" and not something messed up like "bang4o1ngshi4". So suggestions for libraries / scripts that do just the segmentation are also welcome, it will be easy to convert the tone numbers once the word/sentence is properly segmented.

PHP is preferred as the project I need this for is in PHP, but other programming languages are also welcome.

Thanks!

January 10, 2024 at 06:21 PM

I'd approach it something like this in JavaScript: [snip]

Seems to give decent results: https://observablehq.com/@lionel-rowe/pinyin-tone-marks-to-numbers

Quote

Input: Bùjiǔ, Liú lǎoshī yòu huílai le, hòumian gēnzhe Shùyùn pàngpàng de wàipó. Wàipó jǔzhe làzhú, yīlù dàshēng de dūnangzhe shénme. Wǒ gēn Shùyùn xiàng liǎng ge mù'ǒu, bù gǎn chū yī shēng. Shùyùn de wàipó yòng Guǎngxīhuà duì wǒmen shuō, “Nǐmen yào sǐ a! Dàshuǐ bǎ shé chōng chūlai; nǐmen bù pà shé lái yǎosǐ nǐmen a?”

Output: Bu4jiu3, Liu2 lao3shi1 you4 hui2lai le, hou4mian gen1zhe Shu4yun4 pang4pang4 de wai4po2. Wai4po2 ju3zhe la4zhu2, yi1lu4 da4sheng1 de du1nangzhe shen2me. Wo3 gen1 Shu4yun4 xiang4 liang3 ge mu4'ou3, bu4 gan3 chu1 yi1 sheng1. Shu4yun4 de wai4po2 yong4 Guang3xi1hua4 dui4 wo3men shuo1, “Ni3men yao4 si3 a! Da4shui3 ba3 she2 chong1 chu1lai; ni3men bu4 pa4 she2 lai2 yao3si3 ni3men a?”

Should be easy enough to port to PHP as it looks like PHP also supports Unicode normalization forms, Unicode character properties in regexes, and regex splitting with capture groups (with PREG_SPLIT_DELIM_CAPTURE).

January 11, 2024 at 04:10 AM

Nice! Thanks!

It does seem to have one omission, it does not support the u-umlaut:

shěnglüè

lǚyóu

I managed to fix it by changing these lines:

Quote

// grab the last diacritic in the syllable
const [mark] = syllable.match(/\p{M}(?!.*\p{M})/u) ?? []

The added negative look-ahead makes it match the last diacritic instead of the first. This seems to work for the test cases I used.

I also need the 5th tone mark for neutral tones. I added this by modifying this line:

Quote

if (!mark) return syllable+'5'

I will try to throw more test cases at it later on. If all works well, I will try to convert it to PHP and post the result here.

January 11, 2024 at 04:31 AM

Here's my attempt at a PHP port: https://replit.com/@lionel_rowe/convert-to-tone-numbers#index.php

January 12, 2024 at 01:49 AM

On 1/11/2024 at 12:10 PM, phj said:

It does seem to have one omission, it does not support the u-umlaut:

shěnglüè

lǚyóu

I managed to fix it by changing these lines:

Quote

// grab the last diacritic in the syllable
const [mark] = syllable.match(/\p{M}(?!.*\p{M})/u) ?? []

The added negative look-ahead makes it match the last diacritic instead of the first. This seems to work for the test cases I used.

Good catch. I think a more robust method would be matching all diacritics until one of the relevant 4 is found — if NFD always spits out the umlaut before any of the 4 the tone marks, that's just an implementation detail.

On 1/11/2024 at 12:10 PM, phj said:

I also need the 5th tone mark for neutral tones. I added this by modifying this line:

Quote

if (!mark) return syllable+'5'

Should work fine as long as the input is guaranteed to only be Pinyin syllables (rather than e.g. Pinyin mixed with English).

Edit: updated both JS and PHP versions with this logic

January 16, 2024 at 07:50 AM

Here's one in Python.

It uses zhon which provides a quite impressive (1697 characters) Regex for extracting accented Pinyins, and unidecode which normalizes all accented characters. Since that also kills the accents from u-umlaut (ü) we have to take care of these.

All non-pinyin substrings are kept unchanged.

Quote

# -*- coding: utf-8 -*-

import zhon.pinyin

import re

from unidecode import unidecode

acc2tone = {'ā': '1', 'ē': '1', 'ī': '1', 'ō': '1', 'ū': '1', 'ǖ': '1',

'á': '2', 'é': '2', 'í': '2', 'ó': '2', 'ú': '2', 'ǘ': '2', 'ḿ': '2', 'ń': '2',

'ǎ': '3', 'ě': '3', 'ǐ': '3', 'ǒ': '3', 'ǔ': '3', 'ǚ': '3', 'ň': '3',

'à': '4', 'è': '4', 'ì': '4', 'ò': '4', 'ù': '4', 'ǜ': '4', 'ǹ': '4'}

accented="".join([a for a in acc2tone])

def accented2tones(sentence):

lastposition = 0

retval = ""

for match in re.finditer(zhon.pinyin.acc_syl,sentence,re.IGNORECASE):

retval += sentence[lastposition:match.start()]

pinyin = sentence[match.start():match.end()]

m = re.findall('[%s]' % accented, pinyin)

tone = acc2tone.get(m[0],'"") if len(m) > 0 else '""

umlaut = re.findall("[üǖǘǚǜ]",pinyin)

pinyin = pinyin.replace(umlaut[0],"ü")+tone if len(umlaut) > 0 else unidecode(pinyin)+tone

retval += pinyin

lastposition = match.end()

retval += sentence[lastposition:]

return retval

teststring = 'Wǒ gēn Shùyùn xiàng liǎng ge mù\'ǒu, bù gǎn chū yī shēng. 我买了一辆车。 lǚyóu . Shùyùn de wàipó yòng Guǎngxīhuà duì wǒmen shuō, “Test Nǐmen yào sǐ a! Dàshuǐ bǎ shé chōng chūlai; nǐmen bù pà shé lái yǎosǐ nǐmen a?”'

print(teststring)

print(accented2tones(teststring))

returns

Quote

Wo3 gen1 Shu4yun4 xiang4 liang3 ge mu4'ou3, bu4 gan3 chu1 yi1 sheng1. 我买了一辆车。 lü3you2 . Shu4yun4 de wai4po2 yong4 Guang3xi1hua4 dui4 wo3men shuo1, “Test Ni3men yao4 si3 a! Da4shui3 ba3 she2 chong1 chu1lai; ni3men bu4 pa4 she2 lai2 yao3si3 ni3men a?”

Sign In

Scripted conversion of pinyin with tone marks to tone numbers / pinyin segmentation

Recommended Posts

phj

Link to comment

Share on other sites

Demonic_Duck

Link to comment

Share on other sites

phj

Link to comment

Share on other sites

Demonic_Duck

Link to comment

Share on other sites

Demonic_Duck

Link to comment

Share on other sites

calculatrix

Link to comment

Share on other sites

Join the conversation