Simple Terminology Check

I was translating a large software project the other day and noticed at one point that I had mixed up the translations for words like file, folder and directory. Don’t ask me how that happened but by the time I noticed this the incorrect translations were all over the place and it would have been a time-consuming task to locate them individually since these terms were in almost every other segment. So I decided to utilize the QA checker to find the incorrect translations. This was easy to do with the Regular Expressions function, and the good news is that you don’t need to know or use any regular expressions to do this.

Go to Project Settings and select Verification > QA Checker > Regular Expressions. Select the Search regular expressions check box, if not already selected. Type a brief description or a name in the Description field. This is just for your own information. In this example we are trying to locate all segments where the word “file” is in the source but the Finnish translation does not include the matching term “tiedosto”, so as a description we can just use the word “File”. Type the source language word (“file”) in the RegEx source field and the target language word (“tiedosto”) in the RegEx target field. For the Condition, select Report if source matches but not the target from the pull-down menu. To save the search settings, click Action and select Add item. Create similar searches for other terms, as needed. That’s it and you can then close the dialog box by clicking OK.

simple_term_check

Figure 1. Settings for a search for segments where the source text includes “file” but the target doesn’t include the matching translation “tiedosto”. Note the other similar searches for term pairs “database/tietokan” and “directory/hakemisto” below the “file/tiedosto” search.

When you run the Verification (F8), all segments where the source includes the term “file” but the target doesn’t include “tiedosto” will be flagged in the verification results. It worked beautifully in my case, and I had fixed the problems in less than 5 minutes. Another nice thing with this method is that it works well even with Finnish because you can just use the Finnish word stem without having to worry about the various endings the word might have in the text.

There were a few false positives caused by words like profile (the matching translation would be profiili). These were easy to skip while going through the verification results since there weren’t many of them. However, it’s also possible to fine-tune the search with the help of “real” regular expressions to look for exact matches only, if needed. You can also run the check in the opposite direction for extra security by using the Report if target matches but not the source option.

Purple Haze – Overdose of Tags

I think tags are one of the most disliked things in Studio. This is particularly true with those users who previously translated using Workbench in Word because there you never saw tags. Tags can be annoying but it really helps if one understands how they function and how they can be handled in Studio. There’s a good blog article by Paul Filkin about handling tags here and the Studio Help also has some good info on the topic.

What’s really annoying are files that have a huge number of tags that don’t have any real meaning for the document. These are often tags that apply a different formatting to spaces between words or turn the same formatting on and off constantly. If there are only a few of them, it’s relatively easy to see that there’s no need to include them in the translation. However, dealing with a large quantity of this purple haze makes it difficult to perceive the actual text and it slows down the translation process. It’s also easier to miss the real tags and the tag verification feature becomes practically useless when there are hundreds of unnecessary warning messages.

These types of tags are common in files converted or copied from PDF format but they can also be easily produced in Word by applying and changing formatting incorrectly, for example by leaving a different formatting in spaces between words. This is very easy to do without realizing it because you don’t see the tags in Word.

A friend of mine asked me recently if there’s anything she could do to reduce the number of unnecessary tags in her files, so I thought to expand my original reply and share it here as well. I took one of her DOC files (about 1,200 words) and tested various ways to lower the tag count. When I opened the file directly in Studio there were well over 1,000 formatting tags (see Figure 1). I think this was the worst file I’ve ever seen – in most segments there were two pairs of tags between every word! These were mostly font color and spacing tags that applied a different formatting for spaces or turned the same formatting off and on, and obviously were completely unnecessary.

Anni tanni tags Raw DOC

Figure 1. The DOC file opened directly in Trados Studio without any prepping. (Note that the original French source text has been replaced with a Finnish children’s poem to protect the confidentiality of the original text. You didn’t miss anything. It was a really boring text, at least compared to Anni and her trip across the lawn to the cellar to fetch butter, milk and potatoes.)

I tried the following three methods:

1. Save the source file as DOCX and select the “Skip advanced font formatting” option in the File Types settings (Tools > Options > Microsoft Word 2007-2010 > Common). This option is not available for DOC or RTF files, so this works only with DOCX files (and PPTX and PDF files). When I opened the file in Studio, there were 118 formatting tags (<cf>) left. About half of them seemed to be unnecessary but they were easy to see and skip.

Anni tanni tags DOCX

Figure 2. The same file saved as a DOCX file and opened directly in Trados Studio.


2. Clean the file (DOCX, DOC or RTF) in Word using
CodeZapper. CodeZapper is a Word add-in that includes several cleaning functions. When processing my test file, I used the PDFTidy, PDFFix and CZL functions as a combination and did not test them separately or with any of the other functions. CodeZapper turned out to be clearly the most effective method for this file. There were only 62 formatting tags (<cs>) left in the file and they all seemed to be necessary.

Anni tanni tags CodeZapped

Figure 3. The DOC file opened directly in Trados Studio after it was prepped with CodeZapper. The process removed all tags from the sample sentences.


3.  Clean the file (DOCX, DOC or RTF) in Word using
TransTools Document Cleaner. TransTools is another Word add-in that includes a tag cleaning function. This left 156 formatting tags in the file, and most of them seemed to be unnecessary, and as we can see from the previous example, only about 60 formatting tags are needed in this file.

Anni tanni tags TTooled

Figure 4. The DOC file opened directly in Trados Studio after it was prepped with TransTools.

 
Of course, one hopes that clients would include a “tag-clearance” as part of their file prep procedure before sending files to translators. That would not only make translators’ lives easier and improve the quality of the translation and the resulting translation memory, but it would also increase fuzzy match leverage because the unnecessary tags wouldn’t be there screwing up the analysis results and fuzzy matching.

//

Quality Assurance and Translation Memory Maintenance

You might be interested in this workshop that I will be teaching next week in San Francisco. It’s not a Trados workshop but will cover some Trados Studio QA and TM maintenance issues and many other topics (such as regular expressions) that we all should know.

The workshop will give an overview of QA and TM maintenance functions and tools, and illustrate how they can improve translation productivity and quality when used properly. The main topics covered are:

1. Translation QA

  • Built-in QA functions in CAT tools (such as Trados Studio, memoQ and Wordfast Pro): features, setup and use
  • Stand-alone QA tools, such as QA Distiller, ErrorSpy, CheckMate and Xbench

2. Translation memory maintenance and QA

Built-in functions in CAT tools (such as Trados Studio, memoQ  and Wordfast Pro) for editing, searching, filtering, importing/exporting TMs

  • Stand-alone TM maintenance/QA tools, such as QA Distiller, ErrorSpy, CheckMate and Xbench
  • Editing translation memories in text editors, such as UltraEdit

3. Use of regular expressions in QA functions/tools

  • How to create your own regular expressions

NOTE: Even though Trados Studio, memoQ and Wordfast Pro are used for many of the examples and demonstrations during the workshop, most of the workshop content is not tool-specific and can be applied to any modern CAT tool.

For more information or to register, visit: http://www.ncta.org/displayconvention.cfm?conventionnbr=11323