Jump to content
Chinese-Forums
  • Sign Up

Scripted conversion of pinyin with tone marks to tone numbers / pinyin segmentation


phj

Recommended Posts

  • New Members

Does anyone know a good programming library / script to convert pinyin with tone marks to tone numbers? I need to convert multi-syllable words without spaces. 

 

The biggest challenge is the segmentation. For example bàngōngshì" should be converted to "ban4gong1shi4" and not something messed up like "bang4o1ngshi4". So suggestions for libraries / scripts that do just the segmentation are also welcome, it will be easy to convert the tone numbers once the word/sentence is properly segmented.

 

PHP is preferred as the project I need this for is in PHP, but other programming languages are also welcome.

 

Thanks!

Link to comment
Share on other sites

I'd approach it something like this in JavaScript: [snip]

 

Seems to give decent results: https://observablehq.com/@lionel-rowe/pinyin-tone-marks-to-numbers

 

Quote

Input: Bùjiǔ, Liú lǎoshī yòu huílai le, hòumian gēnzhe Shùyùn pàngpàng de wàipó. Wàipó jǔzhe làzhú, yīlù dàshēng de dūnangzhe shénme. Wǒ gēn Shùyùn xiàng liǎng ge mù'ǒu, bù gǎn chū yī shēng. Shùyùn de wàipó yòng Guǎngxīhuà duì wǒmen shuō, “Nǐmen yào sǐ a! Dàshuǐ bǎ shé chōng chūlai; nǐmen bù pà shé lái yǎosǐ nǐmen a?”

 

Output: Bu4jiu3, Liu2 lao3shi1 you4 hui2lai le, hou4mian gen1zhe Shu4yun4 pang4pang4 de wai4po2. Wai4po2 ju3zhe la4zhu2, yi1lu4 da4sheng1 de du1nangzhe shen2me. Wo3 gen1 Shu4yun4 xiang4 liang3 ge mu4'ou3, bu4 gan3 chu1 yi1 sheng1. Shu4yun4 de wai4po2 yong4 Guang3xi1hua4 dui4 wo3men shuo1, “Ni3men yao4 si3 a! Da4shui3 ba3 she2 chong1 chu1lai; ni3men bu4 pa4 she2 lai2 yao3si3 ni3men a?”


Should be easy enough to port to PHP as it looks like PHP also supports Unicode normalization forms, Unicode character properties in regexes, and regex splitting with capture groups (with PREG_SPLIT_DELIM_CAPTURE).

  • Thanks 1
Link to comment
Share on other sites

  • New Members

Nice! Thanks!

 

It does seem to have one omission, it does not support the u-umlaut:

shěnglüè

lǚyóu

 

I managed to fix it by changing these lines:

Quote

      // grab the last diacritic in the syllable
      const [mark] = syllable.match(/\p{M}(?!.*\p{M})/u) ?? []

The added negative look-ahead makes it match the last diacritic instead of the first. This seems to work for the test cases I used.

 

I also need the 5th tone mark for neutral tones. I added this by modifying this line:

Quote

      if (!mark) return syllable+'5'

 

I will try to throw more test cases at it later on. If all works well, I will try to convert it to PHP and post the result here.

Link to comment
Share on other sites

On 1/11/2024 at 12:10 PM, phj said:

It does seem to have one omission, it does not support the u-umlaut:

shěnglüè

lǚyóu

 

I managed to fix it by changing these lines:

Quote

      // grab the last diacritic in the syllable
      const [mark] = syllable.match(/\p{M}(?!.*\p{M})/u) ?? []

The added negative look-ahead makes it match the last diacritic instead of the first. This seems to work for the test cases I used.

 

Good catch. I think a more robust method would be matching all diacritics until one of the relevant 4 is found — if NFD always spits out the umlaut before any of the 4 the tone marks, that's just an implementation detail.

 

On 1/11/2024 at 12:10 PM, phj said:

I also need the 5th tone mark for neutral tones. I added this by modifying this line:

Quote

      if (!mark) return syllable+'5'

 

Should work fine as long as the input is guaranteed to only be Pinyin syllables (rather than e.g. Pinyin mixed with English).

 

Edit: updated both JS and PHP versions with this logic

Link to comment
Share on other sites

Here's one in Python.

It uses zhon which provides a quite impressive (1697 characters)  Regex for extracting accented Pinyins, and unidecode which normalizes all accented characters. Since that also kills the accents from u-umlaut (ü) we have to take care of these.

All non-pinyin substrings are kept unchanged.

Quote

 

# -*- coding: utf-8 -*-

import zhon.pinyin

import re

from unidecode import unidecode

 

acc2tone = {'ā': '1', 'ē': '1', 'ī': '1', 'ō': '1', 'ū': '1', 'ǖ': '1',

   'á': '2', 'é': '2', 'í': '2', 'ó': '2', 'ú': '2', 'ǘ': '2', 'ḿ': '2', 'ń': '2',

   'ǎ': '3', 'ě': '3', 'ǐ': '3', 'ǒ': '3', 'ǔ': '3', 'ǚ': '3', 'ň': '3',

   'à': '4', 'è': '4', 'ì': '4', 'ò': '4', 'ù': '4', 'ǜ': '4', 'ǹ': '4'}

 

accented="".join([a for a in acc2tone])

 

def accented2tones(sentence):

   lastposition = 0

   retval = ""

   for match in re.finditer(zhon.pinyin.acc_syl,sentence,re.IGNORECASE):

      retval += sentence[lastposition:match.start()]

      pinyin = sentence[match.start():match.end()]

      m = re.findall('[%s]' % accented, pinyin)

      tone = acc2tone.get(m[0],'"") if len(m) > 0 else '""

      umlaut = re.findall("[üǖǘǚǜ]",pinyin)

      pinyin = pinyin.replace(umlaut[0],"ü")+tone if len(umlaut) > 0 else unidecode(pinyin)+tone

      retval += pinyin

      lastposition = match.end()

   retval += sentence[lastposition:]

   return retval

 

teststring = 'Wǒ gēn Shùyùn xiàng liǎng ge mù\'ǒu, bù gǎn chū yī shēng. 我买了一辆车。 lǚyóu . Shùyùn de wàipó yòng Guǎngxīhuà duì wǒmen shuō, “Test Nǐmen yào sǐ a! Dàshuǐ bǎ shé chōng chūlai; nǐmen bù pà shé lái yǎosǐ nǐmen a?”'

print(teststring)

print(accented2tones(teststring))

 

 

returns

Quote

Wo3 gen1 Shu4yun4 xiang4 liang3 ge mu4'ou3, bu4 gan3 chu1 yi1 sheng1. 我买了一辆车。 lü3you2 . Shu4yun4 de wai4po2 yong4 Guang3xi1hua4 dui4 wo3men shuo1, “Test Ni3men yao4 si3 a! Da4shui3 ba3 she2 chong1 chu1lai; ni3men bu4 pa4 she2 lai2 yao3si3 ni3men a?”

 

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...