Translating PDF Files in Studio

This turned out to be a much longer posting than I had planned, so if you don’t have time to read it all, you might want to read the summary at the end.

I think this is an example of a feature that sounds much better than it actually is, particularly if you don’t know the “deeper essence” of PDF files and expect more from them than they can deliver. The purpose of a PDF file is usually to be the end product, and they are not really made to be edited. Unfortunately, we are sometimes stuck with a PDF file as the only source file available, and in order to translate it with Trados or any other CAT tool, one needs to convert it to a Word file. There are various tools for that purpose and they work better or worse depending on the tool and the PDF file in question. The new PDF file translation feature in Studio is not some new miracle that all of a sudden makes PDF files translatable — it’s just one of those PDF-to-DOC conversion tools that has been built into Studio as a filter (for details, go to Tools > Options > File Types > PDF > About). Yes, it does provide the convenience of not having to use another conversion tool but you also lose the flexibility of  being able to edit the resulting document (for example, to delete unnecessary hard returns). Of course, you could save the converted file in Studio as a Word file (File > Save Source As…) and then open and edit it in Word before starting the actual translation. However, there goes the simplicity and convenience of not having to use several programs for the conversion…

Anyhow, that much I already knew and for that reason I had not even tried opening PDF files in Studio. I rarely get my source files only in PDF format, and if I do, I convert them with ABBYY Fine Reader or PDF Transformer to Word format and go from there. However, I was recently working on my workshop handouts and thought that I really should try to see how the built-in PDF conversion filter works in real life. And the short answer is… not very well. I tried several different types of PDF files from simple documents (originally Word documents) to more complex ones, and there seemed to be two main problems that often made the converted files practically untranslatable.

The first and most obvious problem was the overabundance of extra tags, as you can see in the screen shot below. These were mostly character spacing or font color tags. Of course, one could leave these out from the translation but with this many tags, reading the source text becomes somewhat nightmarish, not to mention how the QA results would look with all the “missing tag” warnings. Studio’s PDF filter does allow some adjusting of the conversion process (Tools > Options > File Types > PDF > Settings) but none of the settings seemed to help with this issue. Interestingly, this seemed to be a problem particularly with Word files that had been converted to PDF format using Adobe’s “PDF printer” option in Word (Office 2007). If the same file was converted using the “Save as Adobe PDF” option in Word, there were very few of these tags. So, the original (pre-PDF) source file format and the conversion method both affect the suitability of the PDF file for translation in Studio. Unfortunately, knowing this is not much of a consolation to the translator who’s stuck with the crappy PDF file.

However, when I converted these Word-based PDF files using ABBYY Fine Reader (see the screen shot below), I did not get any of these extra tags and there seemed to be much fewer problems with erroneous hard returns as well — which are the second problem. ABBYY and other similar quality conversion tools usually do a very good job and don’t insert these highly annoying hard returns in the middle of sentences at line breaks, and if they occasionally do, these are easy to detect and delete in the editing window. What makes this hard return annoyance even more annoying in Studio is the fact that you can’t edit the source segments in Studio’s Editor. In Trados 2007, you could always use the Restore Source command, delete the hard return and continue translating. Can’t do that in Studio. It’s certainly not a happy moment when you find yourself on the last pages of a long document and realize that the segmentation on those last pages is totally off because of the improperly placed hard returns.

So, what’s the bottom line?

1. Don’t think that you can translate PDF files just like Word files even if you can open them in Studio.

2. If you open a PDF file in Studio, review the converted text in the Editor to see if there are problems with tags or hard returns. You can try to fix the problems by adjusting the PDF filter settings in Studio (and then opening/converting the file again) or by first saving the converted source file in Word format in Studio (File > Save Source As…), fixing the problems in Word and then finally opening the fixed Word file for translation in Studio. Remember that you can’t edit the source segments in Studio, so the errors need to be fixed before you start translating the file.

3. If you need to convert PDF files frequently, consider buying a good conversion program, such as ABBYY Fine Reader (or PDF Transformer) or Nuance OmniPage (or PDF Converter). They do the job much better, offer much more conversion options, allow editing and can be used for many other situations as well.

UPDATE (8/22/11): See this newer article about PDF files in Trados Studio 2011.

Advertisements

Workshop Information

Since I seem to have a good audience here, I wanted to advertise a couple of seminars/workshops I will be teaching within the next few weeks. These are not Trados workshops but might still be of interest to you.

As far as Trados workshops go, I will be teaching workshops in Dallas in September and in San Francisco in November. More locations might be added later. I will publish the details here when everything is finalized. If you can’t wait that long or these locations are too much out of the way for you, there’s always the online training option. For details, see my website.

QuickInsert-related Error Message

After installing the most recent Studio update (ver. 9.1.1264.0), I started getting this error message every time I opened an RTF file for translation:

Error thrown by a dependency of object 'QuickTags' defined in 'transformed file [RTF.sdlfiletype]' : Initialization of object failed : Cannot find matching factory method 'Create on Type [Sdl.FileTypeSupport.Framework.IntegrationApi.IconDescriptor].
 while resolving 'constructor argument[0]' to '(inner object)' defined in 'transformed file [RTF.sdlfiletype]'
 while resolving 'Icon' to '(inner object)' defined in 'transformed file [RTF.sdlfiletype]'

QuickInsert-related Error Message

The file opened and I was able to translate it but most of the buttons in the QuickInsert toolbar were inactive and the Preview function didn’t work. Somewhat annoying because I wanted to use my QuickInsert buttons to enter special characters and character pairs that I had defined. I didn’t care that much about the missing Preview since I have to run the spell check in Word anyhow since Studio is still missing support for Finnish spell checking (don’t get me started with that again). Anyhow, the good news is that I was able to fix the problem by deleting all the QuickInsert entries that I had created for RTF files. Then I just recreated them and everything was back to normal — I haven’t seen the error message since.

I thought that maybe this was just an anomaly in my system since the SDL Support didn’t have an immediate solution for this. However, just a couple of days later a colleague of mine told me that he had the same problem but with DOC files. I told him about my experience and he was able to fix the problem with the same method.