Language Hopper

Tuesday, July 9, 2013

Using subtitle files

I'm not a big fan of using subtitles in my target language when watching a film in English. I'd much rather watch a film without subtitles in any language, but I think that subtitles can be useful as a reading tool.

In this post I'll go over how I put subtitle files (.srt files) to good use. I'll use the film "Contagion" with Turkish subtitles as my example throughout.

There are many, many websites that house all sorts of subtitles for most popular films - too many to list here, but I'll mention that I got the subtitle file from a website called Subtitlesbank.

Once the .srt file is downloaded, it'll look something like this:

A couple of things to notice about the file: First, it's got timecode in it, which isn't useful for my purposes, so I'll strip the timecode out of the file. Since I use Linux, bash shell commands come in handy, but this can also be done in Windows and Mac, too. For Windows, a set of tools called Cygwin needs to be installed, while Mac users already have the tools needed.

To strip the timecode from the file, simply type in:

awk '/-->/{for(i=1;i<d;i++){print a[i]};delete a;d=0;next}{a[++d]=$0}END{for(i in a)print a[i]}' filename.srt > newfilename.txt

with "filename.srt" being the original .srt file and "newfilename.txt" being the new file without timecode.

The second problem with the file is the ascii encoding. Looking at the above screenshot, I've highlighted a line that has some funky letters that need to be changed throughout. That's easy enough to do with any decent text editor with a Find/Replace All command. I also got rid of any hypertext markup ("<i>" and "</i>") using the same method. Once I'd done that, the resulting file looks like this:

Much easier to read, and, more importantly, this new file can now be imported into Learning With Texts or another language-learning program.

Truthfully, I don't use Learning With Texts all that much for Turkish any more. I'd prefer to just read a regular text file and not worry about what words I've learned or need to learn, and just look up words as needed with a dictionary. This is where GoldenDict comes in handy.

Here's a screenshot:

This particular screenshot is just a simple text editor with GoldenDict, but any other text reader will do fine with GoldenDict, too, whether it's for Epubs or PDFs.

Subtitle files are a great way to do some light reading. Typical subtitle files for TV shows have around 500 or so sentences, while feature film subtitle files contain 1000 or more for moderate dialog.

Since I have Stardict (GoldenDict compatible) on my mobile device, it's also a good alternative to firing up Anki in my wasted minutes throughout the day.

Sunday, June 30, 2013

How to make your own Stardict/Goldendict compatible dictionary

I recently found a PDF online of a fairly decent Ojibwe<>English dictionary that I wanted to incorporate into my list of dictionaries that I use on my system. I currently use Goldendict, which is compatible with Stardict, because it easily incorporates itself in my system and is usable with any application. Both Stardict and Goldendict are currently available for both Windows and Linux. Since I primarily use Ubuntu, both packages are available in the standard repository to install, but there are also installation packages located at the Startdict Project Google Code page. In any case, if you've already installed either Stardict or Goldendict, you'll want to grab stardict-tools (for Linux users) or stardict-editor (for Windows users) and install it.

I'll go through the steps to convert the PDF file to something that can be used within Stardict and Goldendict.

First, you want to create a simple text file with the dictionary. I just copied and pasted all the text I wanted to include into a new text file:

You'll notice that the delimiter between the two languages is a dash. I needed to change that to something that the convert program could understand. I chose a [TAB] as the delimiter. I also made sure to put a space before and after my dash, because Ojibwe uses dashes with some affixes.

I then saved that file as a text file. Once the file was saved, I then called up stardict-editor. This is a simple, single-window application that will do the conversion to a compatible format for use in the Stardict and Goldendict applications.

Click on the "Browse" button to load your saved newly edited text file, then click "Compile". If all goes well, you'll get the following dialog:

I had hundreds of duplicate entries, because the particular dictionary I'm using includes other dialects, and some of the entries were the same for the various dialects. If there are duplicates, an error is shown with a line number. Simply go back and fix/delete the entry, then try again until you get the above dialog.

Once compiled, three files will be created, a dict.dx, .idx, and .ifo file:

Next we want to open the .ifo file in a text editor and change the name of the dictionary to what we want it to be:

This name is what will be visible in the dictionary application.

Save the file and then start Stardict or Goldendict. Make sure that all three of these newly created files are easily accessible to the dictionary program. On my system there is a global user location, and I've also created my own dictionary directory and place all my own user-created dictionaries there.

Now we want to let the application know where the dictionaries are located. Start Goldendict (what I use), go to "Edit... Dictionaries". The following dialog box will appear:

Click on "Rescan". Now click on the "Dictionaries tab in the same dialog box, and you should see your new dictionary recognized.

That's it . You're done! You can now use your new dictionary.

The above screenshot is a simple dictionary lookup, but what makes Stardict and Goldendict so useful is that it can be used with any text application. While you're reading along in an epub, PDF, text program, you can just click on any word and you'll get the definition for it, provided it's in the dictionary:

Keep in mind that this process needs to be done for each language direction. The screenshots I've included here only show the process for an Ojibwe > English dictionary. The same thing must be done if you want a dictionary for the other direction (English > Ojibwe in my case).

I don't know of any direct way to do this for a Mac, but I know that there is something that will convert an already created Stardict/Goldendict dictionary to Mac Dictionary format. It's called the Mac Dictionary Kit and includes DictUnifier. It can be found here.

Thursday, April 4, 2013

Spring is finally on its way

It's been a while since my last update, so this post is overdue.

March, unfortunately, was a really bad month, both personally and professionally for me. As a result, I just didn't have it in me to write anything. Thank God March is over, and I can get on with April and everything renewed.

I really should have at least posted an update for my Turkish B2 test. I passed, so that's a positive. The test took two days, with the oral portion on the second day. I've taken CEFR tests in the past, so there were really no surprises, just an exhausting couple of days. So what now for my Turkish? Well, I continue to watch anywhere from two to three hours of TV a day, so listening maintenance won't be a problem. I've also continued with my conversation partner - now two years strong. I've mentioned this before, but I'm a pretty strong believer that to get beyond a B2 in a second language, living where the language is spoken is a must. At some point, I want to spend at least a year in Turkey, which should up my level. But for now, I'm quite happy with my level and how long it's taken me to get here.

Starting at the beginning of this year, I decided to take another look at Ojibwe. I've completed the Pimsleur course, and have, in fact, added quite a bit of my own material to better round out the course. My progress with the course and language in general can be seen here. I still have to update the blog with the last couple of lessons, but they're complete.

To complement my Ojibwe studies, I also enrolled and completed a course on Aboriginal Worldviews and Education through Coursera. The course was offered by the University of Toronto, and the focus was pretty heavy on Canadian issues. I liked the course, overall. I do have some complaints about Coursera, though. The last week of my class, the Coursera website just collapsed, and I was unable to get to my course at all. As a result, I missed the final test for the course, and took a hit on my grade. Frankly, it's turned me off enough that I don't know that I'd consider taking another course through Courera. I've found plenty of courses through MIT OpenCourseWare that I might try them next. I'm not after another degree, but I do like being able to study different courses (and the structured nature of these courses is nice), so the recent surge of MOOCs is nice to see.

So what's on tap for the rest of the year? Well, as I said, I'm certainly going to continue with my Turkish, and I'm also going to continue with Ojibwe. I purchased Anton Treuer's book "Living Our Language: Ojibwe Tales & Oral Histories", which is some pretty fantastic Ojibwe text, also with English translation. That should definitely keep me busy for a long while.

I think I also mentioned in a previous post that I might like to return to Polish. I will probably take that up again the second half of the year, but I occasionally pull out some material a review it so I don't lose what I've already learned.

Here's to a much brighter April (and beyond)!