How to extract and convert closed caption files the hard way.

I will be using the terms “closed captions” and “subtitles” interchangeably in this post because it isn’t always possible to know if the source binary image based SUP file you have has either closed captions, which include both descriptive text and dialog, or subtitles, which contain only dialog, in them.

I’ve been watching a few TV shows on Hulu and believe that at least Cloak & Dagger as well as Stichers had their subtitles ripped from an m2ts file using either HdBr Stream Extractor v9 or MeGUI, likely from Blu-ray, and converted using Subtitle Edit. How do I believe that this is the case?

More often than not I will see a sentence that is in italics that will have two or more words touching each other near the middle of the sentence because the distance between the letters in pixels is much smaller than normal letters. Why it doesn’t happen as often across the entire sentence is beyond me at this time, then again I have a massive replace list. Ten and eleven pixels work well for most Blu-ray content. Subtitle Edit likes DVD subtitles to be around 6-8 pixels apart because the letters are lower resolution. Your mileage will vary.

You can adjust Subtitle Edit to look for letters closer together or further apart based on the number of pixels you tell it are in a space, but this is a global setting for each input file and cannot be modified specifically to adjust for italics because everything is in an image based format, specifically a SUP file. For example if you modify it to look for letters/blocks closer together then you will likely have a lot of individual characters instead of words. If you modify it to have letters/blocks further apart you will merge a lot of words together.

Thiscanbea badthing. I t c a n a l s o b e a b a d t h in g.

Subtitle edit has two methods for OCR.
1) Tesseract. This does a decent job but I no longer use it as it has problems with some fonts and italics.

2) Binary Image Compare. Blu-ray MPEG-TS and DVD’s MPEG-PS containers use images for playback of video closed captioning. This is what I use and what I think that Hulu also uses. It is also the recommended option for Subtitle Edit.

Binary Image Compare has to be trained to look at the many different fonts that you can come across right down to the letter, number, punctuation, and symbol level. The process to teach it what each letter looks like, typically multiple times for the same exact letter early on, is onerous and will crush your soul. When it comes across a “block” of information it asks you what it is and if it is italic or not. You can expand the block to fit quotes and the like, but you cannot shrink it from what it originally detects. Sometimes it will detect “rt” as a single block so you have to add “rt” as a letter. This is most common with italics, but depending on the font it can also affect normal letters and numbers.

When Subtitle Edit comes across a word it doesn’t know it will ask you to do one of a few things.

A) Add to names/noise list (case sensitive)
This is for things like Hogwarts or WebRTC.

B) Add to user dictionary
This is for adding words that are case insensitive that are not in it’s default dictionary.

C) Add pair to OCR replace list
A lower case L looks the same as an upper case “I” in most sans-serif fonts. You will need to use this to fix things like.
Iower (with uppercase i's)
to
lower (with lower case L's)

This can also fix words that are too close together.
Thiscanbea
to
This can be a

D) Google it.

Best practice is to have Subtitle Edit just rack up the words it doesn’t know so you can bang the majority of the duplicates out after the first full run. Also set the Max. error% value to 1.0 percent for higher accuracy. Run again to catch some more and then run it until there is nothing left to fix. Don’t be surprised if a few more words pop up on the second or third pass.

I’ve added a lot of characters and words to the database from multiple TV series and movies as some of them use unique fonts and have both unique words and names in them. I use the website copypastecharacter.com frequently because what Subtitle Edit provides in it’s interface is very limited.

In some cases Subtitle Edit will fail to show a letter or will detect it incorrectly. If you see this then you can simply click on the line of text that has the problem in the main window, navigate to the specific character, and update it accordingly. I do not recommend updating an I to be an l or an l to be an I. If you do that you will be playing whack-a-mole forever. Set it, forget it, and add it to the “Add pair to OCR replace list” on the fly as it fails onward.

Do not be surprised if your subtitles are not properly aligned. I currently use either Easy Subtitles Synchronizer or Subtitle Edit to fix this problem depending upon my mood. In a few rare cases I had to tear down the text based subtitle that Subtitle Edit created, remove the portions that didn’t line up at all, and then add them back by hand. Always make a backup before editing. Your mileage may vary.

And last but not least, don’t forget punctuation, specifically when it is in an MPEG-PS VOB file from a DVD. A buddy of mine who gives lectures on advanced SED usage helped me with this almost indecipherable, at least to me, SED filter because regex in Subtitle Edit is insufficient for my needs.

s/\([[:alpha:]]\) ,/\1,/g
s/\([[:alpha:]]\),\([[:alpha:]]\)/\1'\2/g
s/\([[:alpha:]]\) ,\([[:alpha:]]\)/\1'\2/g
s/\([[:alpha:]]\)' \([[:alpha:]]\)/\1'\2/g
s/\([[:digit:]]\)'\([[:digit:]]\)/\1,\2/g
s/\([[:alnum:]]\) \./\1./g
s/ , /, /g
s/ . /. /g

Subtitle Edit, in my experience, is not suitable for automation and requires a lot of hand’s on work to get things right. If Hulu is using Subtitle Edit then I feel that they either likely don’t know better, assume that subtitles that they receive are perfect, or they don’t give a shit. I’m not sure which one is worse.

Please don’t take my word about subtitle automation being sub-optimal. Give the following a whirl and compare it against what you created via Subtitle Edit’s GUI.

"C:\Program Files\Subtitle Edit\SubtitleEdit" /convert "inputfile.sup" SubRip

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s