Jump to content
Chinese-Forums
  • Sign Up

Instantly Extract Chinese Subtitles Physically Embedded from Videos to Text File


PandaEye

Recommended Posts

  • New Members

I think there might be a fair bit Chinese Videos with Soft Chinese Subtitles already on youtube, you can search and sort by "cc", however, you cannot sort by cc language.

What do you guys thing about a Chrome Extension: eg. "Chinese Videos with Soft Chinese Subtitles" when searching on youtube beside each result will indicate if there is a mainland/taiwan/cantonese subtitle and the ability to change the subtitle from traditional to simplified and vice-versa (and generate a chinese+english subtitle, etc.)?

 

And couple you with the ability to add the video (by clicking a button) to "ChineseVideosWithSoftChineseSubtitles.com" where you can browse for videos added by other users and rate videos.

 

This could potentially build up a customer base and a video market place where we could then generate subtitles from speech recognition AI (from 科大讯飞, I've already tested it with a Chinese engineer who's also interested in this idea) as a service and even potentially develop video OCR.

 

What do you guys think?

  • Like 1
Link to comment
Share on other sites

Your ideas sound good, PandaEye. In addition to my Mandarin Experiment, which I occasionally write about here, I have a small language institute in Brazil and I have begun investing seriously in technology. I currently have 4 programmers working in Python / Django and plan to move into data analysis, machine learning, and deep learning applications, including offering the type of services that you mention, as early as next year. Currently, however, we're working on more simple and straightforward apps that we need at the language institute.

 

In any case, keep us updated on your project and if let me know if you might be interested in collaborating in the future (you're a few steps ahead of me in your thinking right now).

Link to comment
Share on other sites

On 10/4/2018 at 9:36 PM, XiaoXi said:

How do you look up the words?

 

I use Subs2SRS combined with Imron's Chinese Text Analyser and Anki to have decks of cards with audio, a snapshot, the sentence corresponding to the audio, and the definitions of words I don't know. 

 

On 10/5/2018 at 12:22 AM, XiaoXi said:

Isn't that what viki.com does? I haven't investigated it fully.

 

If you can find them using viki I recommend http://downsub.com/  to download the subtitles. It lets you pick which languages to download.

 

Link to comment
Share on other sites

On 10/5/2018 at 2:41 PM, PandaEye said:

I think there might be a fair bit Chinese Videos with Soft Chinese Subtitles already on youtube, you can search and sort by "cc", however, you cannot sort by cc language.

Are you sure they exist? I've watched a lot of TV shows on youtube and never come across any with independent subtitles. There must be only a very small few that even don't have subtitles burnt in. Maybe you could link to a TV show you found with non burnt in subtitles? Otherwise the tool may be useless.

 

Unless you're referring to just short youtube videos with subtitles rather than actual tv shows.

Link to comment
Share on other sites

On 10/7/2018 at 2:16 PM, Yadang said:

I use Subs2SRS combined with Imron's Chinese Text Analyser and Anki to have decks of cards with audio, a snapshot, the sentence corresponding to the audio, and the definitions of words I don't know. 

Ok but I was asking 艾墨本 specifically how he looked up the words, how he was using the srt file. Does Imron's tool have audio? Otherwise my own tool is probably better.

Link to comment
Share on other sites

7 hours ago, Flickserve said:

now you got me very curious.

 

what do you use and how?

Sorry I was just kind of thinking out loud, they are just tools I made myself for my own study.  I find if I use commercial tools they normally don't do what I want them to do the way I want them to, or they don't have a feature I want etc...or in many cases the tool doesn't exist commercially at all. I'm sure Imron's tool is great, I just meant for my own use I must have audio since I never use pinyin, I just use pure audio and characters to learn like Chinese people do.

Link to comment
Share on other sites

Interesting thread, its wounds like a winner actually 

 

I think if you are watching it on a tablet, phone the PLECO Screen Reader/OCR is enough although it is fiddly to move the box around the place to recognize the characters and for some reason PLECO Screen Reader/OCR  seems to have a step backwards. It struggles to recognize a lot of wechat chats for some reason

 

As for subs2srs i looked at the website, seems great. However I wonder if I would use something like that in practice. If i need to convert all the subs (words I presume) to flash cards then the Chinese content too high for my level. I assume people would be best served looking at movies, videos where they know most of the words anyway and only the few they need to check. This can be done simply by pausing and writing it into PLECO.  Also if you knew 90% of the words does that mean you have to start deleting the 90% of the generated flash cards as you already will have them in a flash card format from previous study??

 

 

 

Ok Edit: i just downloaded the subs2srs file and it seems to create flash card out of sentences rather than the words (correct?) so to create flash cards of  individual words i would need something like Imrons program?

Edited by DavyJonesLocker
Used subs2srs
Link to comment
Share on other sites

15 hours ago, DavyJonesLocker said:

This can be done simply by pausing and writing it into PLECO.

This is not simple, nor is it quick. We need srt files for everything. Please somebody figure out the way to get the srt files for all tv shows!

Link to comment
Share on other sites

16 hours ago, DavyJonesLocker said:

 

Ok Edit: i just downloaded the subs2srs file and it seems to create flash card out of sentences rather than the words (correct?) so to create flash cards of  individual words i would need something like Imrons program?

 

That I don't know because I don't have flashcards of individual words. I do flashcards of sentences. 

 

It would be great to get srt files of other media in an easy fashion. 

 

Popular movies seem to keep to a limited vocabulary. Probably for the mass market sales. 

Link to comment
Share on other sites

5 hours ago, Flickserve said:

Popular movies seem to keep to a limited vocabulary. Probably for the mass market sales. 

While I agree it would be great to get srt files for TV shows as well of course but popular movies certainly aren't all simple. Hero was popular but it's incredibly complex. Well it's not complex, it's all old Chinese but still that isn't limited vocabulary.

Link to comment
Share on other sites

9 hours ago, XiaoXi said:

This is not simple, nor is it quick. We need srt files for everything. Please somebody figure out the way to get the srt files for all tv shows!

 

 

8 hours ago, Flickserve said:

That I don't know because I don't have flashcards of individual words. I do flashcards of sentences. 

 

It would be great to get srt files of other media in an easy fashion. 

 

So how often would you need to add flashcards from Movies / TV shows  into flash card format? I kindof struggle with this aspect too, that is, how "hard" is too hard.  if I need to stop every sentence because I see a new word then it's pretty time consuming but if its just an occasional one then I just pause and use PLECO screen reader to add it into a deck. This assumes you watching a show on a iPad, tablet or phone etc

 

However if I have a deck of 1000 sentences taken from a srt file using subs2srs i would probably find 90% I don't need and thus need to spend time deleting them 

 

Link to comment
Share on other sites

8 hours ago, DavyJonesLocker said:

So how often would you need to add flashcards from Movies / TV shows  into flash card format? I kindof struggle with this aspect too, that is, how "hard" is too hard.

 

It also depends on how you want to use the flashcard. 

 

 

Do you use it for vocabulary learning ? 

(know most of the vocab, missing a word to complete the comprehension) 

 

Do you use it for listening skills? 

(know the vocab, didn't understand it) 

 

Do you use it for training accent?

(know all the words, want to imitate local accent) 

 

The way to do it is have a big library of flashcards and then select out the flashcards you really want. I found I am not so interested in films because there is a lot going on that I can't understand because of two issues - vocabulary and listening skills. I really prefer short conversations to learn and make my own flashcards. Perhaps later, my preferences will change if I get to a better level. 

  • Thanks 1
Link to comment
Share on other sites

On 10/9/2018 at 10:11 PM, XiaoXi said:

Does Imron's tool have audio? Otherwise my own tool is probably better.

No, but I pull the audio from the movie with Subs2SRS. Chinese Text Analyser is used only to get definitions (and pinyin, if you want) for words that I don't know.

 

On 10/10/2018 at 6:51 AM, DavyJonesLocker said:

Also if you knew 90% of the words does that mean you have to start deleting the 90% of the generated flash cards as you already will have them in a flash card format from previous study??

 

14 hours ago, DavyJonesLocker said:

 

However if I have a deck of 1000 sentences taken from a srt file using subs2srs i would probably find 90% I don't need and thus need to spend time deleting them 

 

CTA allows for you to export ONLY unknown words. In addition, CTA will let you decide to export only the words to make you understand a given percentage of the total words. For example, I usually export only the words that will get me to a 95% comprehension (in reality, my comprehension is much higher -- what Imron's program actually allows you to do is export an amount of words so that you understand 95% of the words, but in reality that usually gives higher than 95% comprehension due to context and guessing, etc.).

 

Thus, the part of Imron's program takes care of both not having enough comprehension (it will let you export only the words to get to a certain % comprehension) and also it will save you from having too many flashcards (like if you learned every word you didn't know) instead of only the ones to get you to 95%). Also, if you're using it with Subs2SRS, it will tag the cards that have unknown words, so then you can batch delete any of the cards which don't have any unknown words on them (see link below).

 

On 10/10/2018 at 6:51 AM, DavyJonesLocker said:

Ok Edit: i just downloaded the subs2srs file and it seems to create flash card out of sentences rather than the words (correct?) so to create flash cards of  individual words i would need something like Imrons program?

 

Imron's program will allow you to export all of the definitions and pinyin (if you want) for individual words. So from Subs2SRS you'll end up with sentence flashcards that have a snapshot, audio and the sentence in Chinese (and English if you have both SRT files). Then you'd use Imron's program to add the definitions of words in those sentences such that you'll understand X% of all the words. You can then import all of this into Anki, and using tags, filter out and delete any of the cards that don't have any unknown words on them. You'll end up with a customized deck consisting only of cards that have at least one unknown word on them, and every unknown word will have a definition and pinyin (if you want). See this thread for details. 

  • Like 2
  • Thanks 1
  • Helpful 1
Link to comment
Share on other sites

On 10/11/2018 at 10:34 PM, DavyJonesLocker said:

if I need to stop every sentence because I see a new word then it's pretty time consuming but if its just an occasional one then I just pause and use PLECO screen reader to add it into a deck.

Yes it's very time consuming, that's why we need srt files of everything so words can be looked up on the fly. 

 

22 hours ago, Yadang said:

No, but I pull the audio from the movie with Subs2SRS. Chinese Text Analyser is used only to get definitions (and pinyin, if you want) for words that I don't know.

Ok I see, and how does that audio get into the Chinese Text Analyser from there? Since you're reading it inside the tool right? Or you mean you use the two tools simultaneously?

Link to comment
Share on other sites

1 hour ago, XiaoXi said:

Yes it's very time consuming, that's why we need srt files of everything so words can be looked up on the fly. 

 

 

thanks XiaoXi, yeah i understand now, fully agree. It would indeed be nice to have them for TV shows!

Link to comment
Share on other sites

5 hours ago, DavyJonesLocker said:

thanks XiaoXi, yeah i understand now, fully agree. It would indeed be nice to have them for TV shows!

It seems what we're trying to do quite specifically is OCR hardcoded Chinese subtitles. Seems theoretically possible after a quick search and the success is higher if the font used in the video file is identified.

Link to comment
Share on other sites

On 10/13/2018 at 3:22 PM, XiaoXi said:

Or you mean you use the two tools simultaneously?

3 tools if you throw in Anki.

 

See here and here for more details.

 

If you export the Subs2srs stuff from Anki, it produces a csv/tsv file, and you can process this with the Lua scripting support of CTA to do various different things such as only including words from sentences that you mostly know (one of the provided example scripts included with the program).

 

Although I don't use this myself, I believe the process is, run subs2srs.  Then export that from Anki. Then run CTA and process that with a Lua script, and then import the result of that back to Anki.

Link to comment
Share on other sites

On 10/12/2018 at 10:22 PM, XiaoXi said:

Ok I see, and how does that audio get into the Chinese Text Analyser from there? Since you're reading it inside the tool right? Or you mean you use the two tools simultaneously?

 

No I don't read it inside the tool - I only use it to generate pinyin and definitions and to tag the sentences that have unknown words. So yes, I use Subs2SRS, Anki and CTA simultaneously, as Imron described above.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...