Jump to content
Chinese-forums.com
Learn Chinese in China

  • Why you should look around

    Since 2003, Chinese-forums.com has been helping people learn Chinese faster and get to China sooner. Our members can recommend beginner textbooks, help you out with obscure classical vocabulary, and tell you where to get the best street food in Xi'an. And we're friendly about it too. 

    Have a look at what's going on, or search for something specific. We hope you'll join us. 
wulfgar

Extracting Chinese hardsubs from a video

Recommended Posts

wulfgar

I have heard that this is possible, but haven't been able to find anything that works. Have any of you succeeded at doing this?

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

tysond

I've looked into it for many hours but with no success.

 

Apparently there's some software out there that can OCR softsubs (that are in graphics format like stored on a DVD) but I never managed to get it working, it's very complex and all the documentation is in Chinese and you need old versions of software and I just never had any success.

 

Some of the non-language specific software can sort-of OCR Chinese but frequently the characters are split in half, and as far as I can see there's no way of saving your character set so you end up having to "teach" it thousands of characters but can never re-use them.

 

As for what you actually asked for - hard-subs are even harder.  Recognizing subtitles with no background is very hard, it's going to be even harder if there is background imagery. 

 

The best solutions I have are to avoid the problem by looking for SRT files on sites like shooter.cn.  The problem is they are not available for most native Chinese programs because Chinese people don't need them.  Sometimes movies have them (especially for movies dubbed from English), but it's almost impossible to find TV shows unless someone has gone to the trouble of transcribing them and uploads them somewhere, and there's no central site for them.  

 

It's a bit sad because almost all Chinese TV has subtitles, which means someone had to create them, but you can't find them anywhere.  

 

If anyone has any success with soft or hard subs I'd love to hear about it.

  • Like 2

Share this post


Link to post
Share on other sites
hedwards

Softsubs really ought to be trivial as they're an independent stream that should be extracted without any issues.

 

Hard subs are basically impossible at this point. You have to OCR those and I don't know of any OCR software that can do it reliably and automatically. Probably the best would be to take screen shots and then run those screenshots through Pleco or similar. I don't think that Google Translate is good enough to do the job at this point.

  • Like 1

Share this post


Link to post
Share on other sites
wulfgar

I've read posts in some forums that make it seem quite possible, but upon further investigation I always fail. To compound matters further, I have a mac, so software is limited.

 

I'm glad that chinese shows usually come with Chinese subtitles, and I tend to watch them on Viki, which has english subtitles too. This is a good thing, but I want to go one step further and use a mouse-over dictionary. I'm spoiled that way. I want the transcripts for 3 reasons. 1) English subtitles are great to check my understanding, but I also like being able to instantly look up individual words' meanings, which often get lost in translation. 2) The audio of the video is usually sufficient to figure out correct pronunciation, but it's also nice to be able to instantly look up pinyin when it's unclear. 3) Because I don't have scripts for my shows, and I like to read with a mouse-over dictionary for the reasons mentioned, I have been reading additional material. Unfortunately it's pretty hard to find long, interesting transcripts right at my level (or i+1) that also have audio, and I have milked most of my sources almost dry.  

 

If I can't figure out how to do this with software, I might hire a cheap chinese typist to type them out. A friend is in the middle of doing this with a Russian show, and he just loves the results. Very expensive, but for a one-time deal, it might be a way to pay back the language learning community for all the help it has given me over the years.

 

Anyway, I'm still hopeful that someone will know a way. I know many of you can't see youtube videos, but here the first episode of the show I want to extract from:

盛夏晚晴天杨幂超清版01 HD第1集
  • Like 1

Share this post


Link to post
Share on other sites
c_redman

I had tried this years ago, with limited success. There are a few software tools to attempt this. What they do is to look for contiguous patches of white (or other color) surrounded with black (or other color for contrast). With average quality (non-HD) videos from TV, the anti-aliasing and the bleed-through of the video behind the text would result in about a 60% success rate at best. Correcting the numerous OCR errors made it not much better than simply typing in the subtitles by hand. If you just want bitmap images for each subtitle instead of OCR results, that's slightly less work but still requires a significant amount of cleanup.

I haven't kept up with newer versions of these tools, so I'll just mention the names and you can try them out. It sounds like you'll also need a way to save the FLV files from Youtube.

AVISubDetector - can only read AVI format directly, but can also read an AVISynth stream. Creator's website vanished from the web. Features are incredibly complex

esrXP - haven't tried this one. Can't find creator's original site

SubRip - Not too bad for beginners; you can quickly get started by dragging the subtitle area then clicking for the text and outline color. It can detect the beginning and end of title changes to create a timing (SRT) file. It has built-in OCR, but you either need to do a lot of manual training for Chinese characters, or else manually type in the text

AVISynth - these tools seem to be picky about what files they can read. AVISubDector, for example, only works with AVI format. But AVISynth is a frame server that can serve different formats, and AVISubDetector and SubRip can read these streams directly without extra conversion

sub2srs - I haven't used this myself. If you somehow obtained the timings of a video but not the OCR text, you might be able to use this program to get screenshots of the video frames containing the subtitles. Not as good as a text file of the transcribed text, but a lot less work

Good luck!

  • Like 2

Share this post


Link to post
Share on other sites
c_redman

I just found a previous post on using AVISubDetector. However, I never had as much luck detecting subtitles as that poster apparently had. The problem I have is that TV shows often jump from one subtitle to another one, with no text-free gap between the two frames. The way that AVISubDetector works, it's good at detecting the presence or absence of a subtitle, but can't tell when two adjacent frames are different, especially if they are the same length.

SubRip is more sensitive to text changes, but often too sensitive. However, once you learn the shortcuts, it's easy to tell it to merge a false positive into the previously detected title. It's easier to do it that way, than to manually insert a bunch of missed titles with the correct timings in a later step.

  • Like 2

Share this post


Link to post
Share on other sites
eslang

  • esrXP software

(Embed Subtitle Ripper) is a program to help rip the subtitle embedded in the video.

https://sites.google.com/site/cphktool/esrxp

 

  • OCR software

CAJviewer or 尚书7号 are recommended, however correction and proofreading are still required.

 

使用EsrXP提取__文件中内嵌字幕(硬字幕)的方法

http://www.360doc.com/content/12/0911/23/2793098_235625294.shtml

 

手把手教_如何从RMVB__中提取出外挂字幕文件(轉自HDC)

http://www.360doc.com/content/11/0426/11/6373468_112401863.shtml

 

 

  • Like 3

Share this post


Link to post
Share on other sites
wulfgar

Thanks eslang. Have you ever succeeded doing this for an entire episode? I wonder how long it takes.

Share this post


Link to post
Share on other sites
eslang

For an episode of family drama (e.g.夫妻那些事) about 45 minutes, that are without intensive consecutive dialogue/conversation and technical jargon such as medical/military/historical, etc... maybe around 3 hours. 

Bear in mind, the time factor will depend on the quality of the hard subtitles, color separation, familiarity with the software programs and the individual ability to do editing/proofreading since there isn't any OCR software which have 100% accuracy conversion. And typists can make mistakes or typo-errors too.  

 

If the drama is based on a novel, where a copy of it is available online, then it is easier in that some of the terms or words can be copy-and-paste into the transcript for subtitles.

  • Like 2

Share this post


Link to post
Share on other sites
c_redman

I was curious how long it would take to do the entire process and end up with a transcript, so I went ahead and did it using your Youtube episode. It took about 5 hours in total, with the majority of time spent doing corrections in the OCR. And this is with a good quality video, so this is probably the lower end of effort.

* 5 minutes: Using Freecorder to dowload the video and convert to mpg format.

* 1.5 hour: using esrXP to detect subtitles and convert to framegrab images. It took longer than 1 hour because I spent time getting the detection filters to work reliably.

* 3 hours: Using ABBYY FineReader 10, from the bitmap files generated by esrXP, doing OCR processing to get Chinese text. The OCR was fast, but individually checking characters it flagged as questionable was time-consuming. After pasting the text back into esrXP and matching it with the corresponding bitmap frames, I saved the subtitle file.

* 0.5 Hours: using Aegisub to run through the subtitles and check for errors. I set the subtitle font to be the same size as the hardsub, and quickly go through each line to make sure they match.

So there you go. It's possible but a painfully time-consuming process. I will write a blog post on how to do the steps in detail at some point, since I learned a few tricks along the way that would help in the future.

盛夏晚晴天杨幂超清版01 HD第1集 - YouTube.srt

  • Like 4

Share this post


Link to post
Share on other sites
Kobo-Daishi

I tried typing it out when I came across this at the 2:11 mark.

2mo33f4.png

Instead of c_redman's 是恋爱中爱神的庇护了 it had 是恋爱中爱神的疪护了

I was stumped. I wasn't sure if 疪 was a variant for 庇, so, I entered 庇 into the Dictionary of Chinese Character Variants put out by Taiwan's Ministry of Education and saw that it has no variants.

It wasn't until c_redman's post that I thought to look under 疪. Here they also say no variants for 疪. But 疪 itself is a variant for 痹.

So, definitely a typo. Unless there's a dictionary that says that it's commonly mistaken for...

In my Taishanese dialect, when our leg is numb or has fallen asleep we say GEUHK BEIH.

I entered 脚痹 and 腳痹 (the two common variants for foot/leg) into Google search just to see if they use it in Mandarin and HK Cantonese and got another variant for "numb/paralysis", 痺.

ArggggghhhhhhH!!!!!!!

Kobo, sitting in a corner pulling his hair out by the roots, mind going numb.  :)

Share this post


Link to post
Share on other sites
wulfgar

Wow - 5 hrs. Not the quick fix I was hoping for. Thanks very much for the transcript though!

Share this post


Link to post
Share on other sites
hedwards

Now that I think about it probably the best thing to do would be to crowd source it. If a group of folks chose the same video I'm sure that collectively the extraction could be done in a few minutes.

 

OCR helps, but I've found that even quality packages like Pleco seem to have serious issues at times. Especially if there's a weird font or the character is underlined.

Share this post


Link to post
Share on other sites
wulfgar

第1集.docx

I hired a typist. What do you think of this format? I find it easier to read in some respects, but I'm used to a new line when the character changes.

  • Like 1

Share this post


Link to post
Share on other sites
imron

Out of curiosity, what was the total cost and where did you find the typist?

Share this post


Link to post
Share on other sites
wulfgar

I pm'd you on that imron. I got lucky and it was cheaper than I expected, so I have budget for one more. I'm going to post a link to them here…they will be hosted on my friend's new site, all free, of course. We made a deal, he would get transcripts for a Russian series, and I would get transcripts for a Chinese series. But I also promised to match expenditures, so I need to choose another Chinese one, and maybe you guys can help me. Here are my requirements:

 

1) must have full english subtitles available on viki

2) must be modern real life (no magic cell phones, kung-fu, palace dramas, exploding varmints, etc)

 

Any suggestions? 

Share this post


Link to post
Share on other sites
imron
Any suggestions?

Many.  《奋斗》and 《我的青春谁做主》might fit the bill for modern, although 'real life' is always going to be questionable :mrgreen: .  If comedy is allowed, maybe consider a series of《爱情公寓》and finally, if there is some leeway on modern times, you might also consider 《潜伏》or《黎明之前》both of which are good shows.

 

Edit: Ah sorry, just realised your requirement about English subs.  I don't know how the above shows match up with that.

  • Like 1

Share this post


Link to post
Share on other sites
eslang

TV mini-series “Le Jun Kai” (乐俊凯), based on a short story by Fei Wo Si Cun (匪我思存) is available on Youtube. There are CC (closed captions) with English, Russian and some other languages which can be easily extracted using Google tool. 

 

匪我思存

http://baike.baidu.com/view/937918.htm

  • Like 1

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...