Purple Haze – Overdose of Tags

I think tags are one of the most disliked things in Studio. This is particularly true with those users who previously translated using Workbench in Word because there you never saw tags. Tags can be annoying but it really helps if one understands how they function and how they can be handled in Studio. There’s a good blog article by Paul Filkin about handling tags here and the Studio Help also has some good info on the topic.

What’s really annoying are files that have a huge number of tags that don’t have any real meaning for the document. These are often tags that apply a different formatting to spaces between words or turn the same formatting on and off constantly. If there are only a few of them, it’s relatively easy to see that there’s no need to include them in the translation. However, dealing with a large quantity of this purple haze makes it difficult to perceive the actual text and it slows down the translation process. It’s also easier to miss the real tags and the tag verification feature becomes practically useless when there are hundreds of unnecessary warning messages.

These types of tags are common in files converted or copied from PDF format but they can also be easily produced in Word by applying and changing formatting incorrectly, for example by leaving a different formatting in spaces between words. This is very easy to do without realizing it because you don’t see the tags in Word.

A friend of mine asked me recently if there’s anything she could do to reduce the number of unnecessary tags in her files, so I thought to expand my original reply and share it here as well. I took one of her DOC files (about 1,200 words) and tested various ways to lower the tag count. When I opened the file directly in Studio there were well over 1,000 formatting tags (see Figure 1). I think this was the worst file I’ve ever seen – in most segments there were two pairs of tags between every word! These were mostly font color and spacing tags that applied a different formatting for spaces or turned the same formatting off and on, and obviously were completely unnecessary.

Anni tanni tags Raw DOC

Figure 1. The DOC file opened directly in Trados Studio without any prepping. (Note that the original French source text has been replaced with a Finnish children’s poem to protect the confidentiality of the original text. You didn’t miss anything. It was a really boring text, at least compared to Anni and her trip across the lawn to the cellar to fetch butter, milk and potatoes.)

I tried the following three methods:

1. Save the source file as DOCX and select the “Skip advanced font formatting” option in the File Types settings (Tools > Options > Microsoft Word 2007-2010 > Common). This option is not available for DOC or RTF files, so this works only with DOCX files (and PPTX and PDF files). When I opened the file in Studio, there were 118 formatting tags (<cf>) left. About half of them seemed to be unnecessary but they were easy to see and skip.

Anni tanni tags DOCX

Figure 2. The same file saved as a DOCX file and opened directly in Trados Studio.

2. Clean the file (DOCX, DOC or RTF) in Word using
CodeZapper. CodeZapper is a Word add-in that includes several cleaning functions. When processing my test file, I used the PDFTidy, PDFFix and CZL functions as a combination and did not test them separately or with any of the other functions. CodeZapper turned out to be clearly the most effective method for this file. There were only 62 formatting tags (<cs>) left in the file and they all seemed to be necessary.

Anni tanni tags CodeZapped

Figure 3. The DOC file opened directly in Trados Studio after it was prepped with CodeZapper. The process removed all tags from the sample sentences.

3.  Clean the file (DOCX, DOC or RTF) in Word using
TransTools Document Cleaner. TransTools is another Word add-in that includes a tag cleaning function. This left 156 formatting tags in the file, and most of them seemed to be unnecessary, and as we can see from the previous example, only about 60 formatting tags are needed in this file.

Anni tanni tags TTooled

Figure 4. The DOC file opened directly in Trados Studio after it was prepped with TransTools.

Of course, one hopes that clients would include a “tag-clearance” as part of their file prep procedure before sending files to translators. That would not only make translators’ lives easier and improve the quality of the translation and the resulting translation memory, but it would also increase fuzzy match leverage because the unnecessary tags wouldn’t be there screwing up the analysis results and fuzzy matching.



10 Responses to “Purple Haze – Overdose of Tags”

  1. paulfilkin Says:

    Good article Tuomas. Another couple of ideas, one from Jerzy Czopik in a number of forum posts.

    “Press CTRL+A, then CTRL+D, chose Arial and press OK.
    Again press CTRL+A and then CTRL+D. Go to the spacing tab.
    Set the scale of characters to 100% and the spacin to normal, deselect kerning.
    Press OK. Save the document and open in Studio. It will now contain only necessary tags.”

    And another idea, if you don’t mind losing all the formatting is just copy and paste into a Text Editor as plain text.

    • Tuomas Says:

      Thanks for pointing out those options, Paul. Those are certainly easy ways to solve the problem in many cases. However, in this case there were a lot of unnecessary font color tags that the Jerzy’s method didn’t delete (there were about 160 formatting tags left). Of course, one could also use the same method to set the color all black (Ctrl+A > Ctrl+D > Font color > Black) but in this case there were colors that needed to be preserved so it wasn’t an option. However, it can be sometimes faster to delete all the formatting or even use the plain text route and then just apply the formatting afterwards.

  2. Kasia Landsberg-Połubok Says:

    Another option could be to use the format painter and copy the format from some plain text. But this also destroys the formatting. Which might still be better than having to cope with all those Trados tags.

  3. Stefano KaliFire Says:

    Another good set of Word-macros to clean up such files is called FormatFixer.

    • Tuomas Says:

      Thanks Stefano. If I remember correctly, the developer (David Turner) combined the functions of FormatFixer into the new CodeZapper.

      • Stefano KaliFire Says:

        Hi Tuomas, I installed both sets in Word, but I think they are different. FormatFixer includes these macros:

        • Del Lead Tabs and Spaces
        • Del Eccess Spaces
        • Del Para Marks
        • Fix Punctuation
        • Add Space Between Num and Letter
        • PDFTidy

        which I do not see in CodeZapper. My CZ version is 2.6.1… Does the latest CodeZapper version include these FormatFixer macros?

        My method: select all the text in the DOC file, press Ctrl-Spacebar (paulfilkin-method), run the FormatFixer macros, save the file and open it in TS2011.

      • Tuomas Says:

        PDFTidy is in the new one (2.9.1) but I don’t see the others. I’ll check with David.

        Here’s what David had to say:

        “FormatFixer is for fixing formatting and not removing unnecessary tags. CodeZapper and FormatFixer do, however, share the PDFTidy macro which tries to remove hard returns intelligently from line ends.
        It’s true that cleaning up a document often involves fixing formatting and removing rogue tags so I had planned to combine the two macro sets into one, at some stage.”

  4. Stanislav Okhvat Says:

    Good afternoon, Tuomas,

    I have updated Document Cleaner. The new version has much better processing of tags. It also has new options for levelling font face and font size if several fonts or font sizes occur within the same paragraph.

    Information about this new version is summarized on this page: http://www.translatortools.net/news/features_v2.2.html

    Best regards,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: