Translating PDF Files in Studio

This turned out to be a much longer posting than I had planned, so if you don’t have time to read it all, you might want to read the summary at the end.

I think this is an example of a feature that sounds much better than it actually is, particularly if you don’t know the “deeper essence” of PDF files and expect more from them than they can deliver. The purpose of a PDF file is usually to be the end product, and they are not really made to be edited. Unfortunately, we are sometimes stuck with a PDF file as the only source file available, and in order to translate it with Trados or any other CAT tool, one needs to convert it to a Word file. There are various tools for that purpose and they work better or worse depending on the tool and the PDF file in question. The new PDF file translation feature in Studio is not some new miracle that all of a sudden makes PDF files translatable — it’s just one of those PDF-to-DOC conversion tools that has been built into Studio as a filter (for details, go to Tools > Options > File Types > PDF > About). Yes, it does provide the convenience of not having to use another conversion tool but you also lose the flexibility of  being able to edit the resulting document (for example, to delete unnecessary hard returns). Of course, you could save the converted file in Studio as a Word file (File > Save Source As…) and then open and edit it in Word before starting the actual translation. However, there goes the simplicity and convenience of not having to use several programs for the conversion…

Anyhow, that much I already knew and for that reason I had not even tried opening PDF files in Studio. I rarely get my source files only in PDF format, and if I do, I convert them with ABBYY Fine Reader or PDF Transformer to Word format and go from there. However, I was recently working on my workshop handouts and thought that I really should try to see how the built-in PDF conversion filter works in real life. And the short answer is… not very well. I tried several different types of PDF files from simple documents (originally Word documents) to more complex ones, and there seemed to be two main problems that often made the converted files practically untranslatable.

The first and most obvious problem was the overabundance of extra tags, as you can see in the screen shot below. These were mostly character spacing or font color tags. Of course, one could leave these out from the translation but with this many tags, reading the source text becomes somewhat nightmarish, not to mention how the QA results would look with all the “missing tag” warnings. Studio’s PDF filter does allow some adjusting of the conversion process (Tools > Options > File Types > PDF > Settings) but none of the settings seemed to help with this issue. Interestingly, this seemed to be a problem particularly with Word files that had been converted to PDF format using Adobe’s “PDF printer” option in Word (Office 2007). If the same file was converted using the “Save as Adobe PDF” option in Word, there were very few of these tags. So, the original (pre-PDF) source file format and the conversion method both affect the suitability of the PDF file for translation in Studio. Unfortunately, knowing this is not much of a consolation to the translator who’s stuck with the crappy PDF file.

However, when I converted these Word-based PDF files using ABBYY Fine Reader (see the screen shot below), I did not get any of these extra tags and there seemed to be much fewer problems with erroneous hard returns as well — which are the second problem. ABBYY and other similar quality conversion tools usually do a very good job and don’t insert these highly annoying hard returns in the middle of sentences at line breaks, and if they occasionally do, these are easy to detect and delete in the editing window. What makes this hard return annoyance even more annoying in Studio is the fact that you can’t edit the source segments in Studio’s Editor. In Trados 2007, you could always use the Restore Source command, delete the hard return and continue translating. Can’t do that in Studio. It’s certainly not a happy moment when you find yourself on the last pages of a long document and realize that the segmentation on those last pages is totally off because of the improperly placed hard returns.

So, what’s the bottom line?

1. Don’t think that you can translate PDF files just like Word files even if you can open them in Studio.

2. If you open a PDF file in Studio, review the converted text in the Editor to see if there are problems with tags or hard returns. You can try to fix the problems by adjusting the PDF filter settings in Studio (and then opening/converting the file again) or by first saving the converted source file in Word format in Studio (File > Save Source As…), fixing the problems in Word and then finally opening the fixed Word file for translation in Studio. Remember that you can’t edit the source segments in Studio, so the errors need to be fixed before you start translating the file.

3. If you need to convert PDF files frequently, consider buying a good conversion program, such as ABBYY Fine Reader (or PDF Transformer) or Nuance OmniPage (or PDF Converter). They do the job much better, offer much more conversion options, allow editing and can be used for many other situations as well.

UPDATE (8/22/11): See this newer article about PDF files in Trados Studio 2011.

Advertisements

6 Responses to “Translating PDF Files in Studio”

  1. david Says:

    quick-pdf pdf to word – also good tool for converting pdf into word http://www.quick-pdf.com/

    • Tuomas Says:

      Thanks for the suggestion, David. It looks a bit too simple though… very limited settings available for fine-tuning the outcome. Also, the Help does not explain the settings at all. As far as the outcome goes, I can’t compare because the program keeps on crashing. So, no stars from me.

  2. David S Says:

    Useful article Tuomas!
    There is one thing that we should be aware of, though, and that is that there are two types of PDF files. One is the file that is created directly from a document, and which can be saved as a .txt file. These files can be worked on, and re-saved to Word.
    The other PDFs are those that result from a scanned document, which only OCR software can turn into an editable document, and which are a nightmare to use CAT tools on, thanks to the superabundance of tags.
    In my experience, Trados 2009 can have difficulties in re-exporting and in any event does not provide a very useful TM.

    • Tuomas Says:

      Yes, scanned (graphics-based) PDF files can be very problematic even if they were relatively clear and the text recognition worked. However, you can help the tag problem by using OCR conversion settings that retain as little of the formatting as possible. In ABBYY FineReader, that would the “Plain Text” option.

  3. Ricardo Fonseca Says:

    Thank a million Tuomas – you just saved me a lot of hours placing text “around” tags (word by word!)

  4. Lionel Says:

    In other words, ask clients to convert pdf to whatever format they prefer and do not take any responsiblity for the formatting of the finished target language file. There’s no 100% solution to the pdf problem, no perfect converter. Able2Extract7 is the best one I know of but that’s not perfect either.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: