Sunday, March 18, 2012

Batch Converting Many Microsoft Word (.doc) Files to PDF - First Try

I had recently figured out how to batch convert many text files to PDF.  Now I was on a roll.  I wanted to know how to do the same thing with word processing documents produced in Microsoft Word 2003.

The approach used for the text files didn't seem likely to produce good results for Word files.  The text approach used Notepad on the command line.  Notepad would lose all the formatting.  It might actually create a mess.  It seemed there would probably be a better way.

In that previous approach, I had configured my default PDF printer, Bullzip, to shut up and stop asking questions.  So it sailed right through the printing task.  When producing complete garbage, I prefer not to be interrupted -- although, in that case, the output actually seemed OK.  In similar spirit, I wished for a Word conversion process that would just follow orders.

I thought of setting Bullzip in minimal-interruption mode, as in the text file approach, and just selecting a gaggle of Word docs in Windows Explorer, right-clicking, and choosing Print.  Sad to say, Windows 7 was not interested in giving me a print option when I selected more than 15 items.  So I would have to repeat the process with groups of 15 files at a time.  This did not fall within my definition of hassle-free.

Seeking some alternate approach, I did a search.  I was thinking, first, that maybe Word had command line options like Notepad.  Microsoft did not seem to offer any such option.  Others concurred that I would probably need some kind of macro, script, or other third-party solution.

Another search led to some relatively less desirable solutions, such as buying Fineprint or A-PDF or easyPDF SDK (seemingly complicated) or using a combination of VBScript and Automation or uploading Word docs to OCR Convert or using AnyToPDF, which admirably developed OpenOffice but would require the system to restart OO for each document being printed.  I found a thread that yielded other possibilities, including an apparent Word command-line possibility after all.  It seemed to require something called Quiet PDF Printer, which I could not locate.

As I was browsing Wikipedia's list of PDF software, not seeing much of relevance, I realized I would much prefer a solution that would use Word, as distinct from some other program, so as to have the greatest likelihood of preserving formatting.  After all, I was not planning to inspect the resulting PDFs closely.  I didn't want to find out, a year down the line, long after I had discarded the original Word docs, that the PDFs were missing the bottom two lines of text, or that important characters were being misprinted or something.  No doubt this approach of opening Word was going to be slow, though, as in the OpenOffice alternative disparaged above.

I saw that Quiet PDF Printer suggestion repeated in another thread, but without any mention of Quiet PDF Printer.  Maybe the first person who mentioned it meant that I should just have a no-hassle PDF printer, like Bullzip with the desired settings.  Anyway, the suggestion was to run this command:

"C:\Program Files\Microsoft Office\Office\winword.exe" "C:\My Documents\doc1.doc" /mFilePrintDefault
Of course, the path to winword.exe would have to be adjusted on some systems, and doc1.doc was just an example.  But the point is, it worked.  One problem:  it left Word running, and another iteration of the command opened another instance of Word.  So unless I wished to have a couple hundred unused Word sessions lounging around, consuming system resources, I would need to kill Word after printing the PDF.  Further reading in that same thread led to a refinement:
"C:\Program Files\Microsoft Office\Office\winword.exe" test.rtf /q /n /mFilePrintDefault /mFileExit
The description seemed to say that (1) those last two items were actually Word's way of calling a macro on the command line; (2) the selection of commands available for such use was visible in this menu pick in Word 2003:  Tools > Macro :> Macros > Macros in Word Commands (in ribbon versions of Word, try this key sequence:  Alt-T, M, M); (3) FilePrintDefault and FileExit were two such commands); and (3) if I went into Tools > Options > Print tab > uncheck Background Printing, I would not have Word exiting before the PDF was done printing.

I decided to try that last command line approach.  I made the stated settings changes in Word, and set Bullzip to stun.  Now it was a question of working up the list of commands, for all these Word documents that I wanted to PDF.  Ordinarily, I would have used a combination of DIR and Excel for that purpose, with one command per file, producing a batch file containing many commands.  But spring had arrived and, you know, in spring a man begins to feel powerful urges.  My social life being what it was, this translated into some recent experimentation with looping batch files.  That is, I believed I might be able to devise a batch program that would provide a simpler (or at least more direct) way to run this printing process.  So, from a command prompt in the folder containing my Word docs, I ran a batch file that I called Printit.bat.  That batch file contained just one line, though it wraps over several lines here.  The line was:
FOR /F %%g IN ('dir /b *.doc') DO "C:\Program Files (x86)\Microsoft Office\OFFICE11\WINWORD.EXE" test.rtf /q /n /mFilePrintDefault /mFileExit
Word immediately gave me a message indicating that it had encountered an error.  I wondered if that was because I had a session of Word open before running the batch file.  But that didn't seem to be the answer.  Well, maybe it was because I already had a PDF printout of the first file in the folder.  I had created that PDF during the process of testing this stuff.  Apparently my batch file and/or Word were not going to dilly-dally to ask me about overwriting.  So now I deleted that preexisting PDF and tried again.  No, that wasn't it; I still got the error.  This time, instead of guessing, I clicked its Show Help button and got an explanation:
The file you tried to open was not found. . . . [If the file exists but] does not open, it is either corrupt, locked by another application, or is protected by file permissions.
So, silly me, I looked again at my batch command.  Test.rtf?  WTF was Test.rtf?  I had copied the foolish thing verbatim, without pausing to reflect.  When your professors try to tell you how important it is to master critical thinking, believe them.  They're right.  As it turned out, there were multiple problems with that first try at a batch command.  One of those problems was that, contrary to initial hopes, Word was actually not postponing the next doc until it had closed the previous doc; therefore, it was stumbling over itself.  The solution was a batch file containing this one long line:
FOR /F "usebackq delims=" %%g IN (`dir /b "*.doc"`) DO "C:\Program Files (x86)\Microsoft Office\OFFICE11\WINWORD.EXE" "%%g" /q /n /mFilePrintDefault /mFileExit && TASKKILL /f /im winword.exe
The changes were mainly to add USEBACKQ and to change quotation marks (and use backquotes) accordingly, and also to add the "&& TASKKILL" part.  The && said that the next part (the taskkill) should proceed only after the previous command on the same line (i.e., printing) ran successfully.  From this point, the process ran pretty smoothly.  I found that it did not seem to matter if I already had a Word session active when this ran.  (If there was such a session, I would get a dialog; maybe I should have added another instance of TASKKILL before starting the FOR loop.)  Also, I found that Word would prompt me before overwriting.  I also had an interruption for a problem encountered when the batch file tried to convert a file created in an earlier version of Word.

There was another problem.  I got a dialog saying, "There is insufficient memory."  A search led to a Microsoft webpage that said this could result from a cramped paging file, or from some antivirus software or from using floppy disks.  None of these seemed to apply in my case.  Another discussion said that maybe this problem came from a corrupted Normal.dot.  That was a possibility in my case; I had occasional error messages involving Normal.dot.  Another potential cause:  abnormal termination of Word (such as I was doing myself, in this batch file, with TASKKILL), leaving junk in the %Temp% folder (located via Start > Run > %Temp% -- in my case, C:\Users\Ray\AppData\Local\Temp).  Cleaning out the %Temp% folder seemed to help:  there were hardly any memory error messages during the rest of the process.  The process seemed suitably restrained, whether by the "&&" device or otherwise, to the point that (judging from system tray icons) there were usually no more than one or two Bullzip processes underway at once.

When the process was done, most but not all of the DOCs had been converted.  I looked at the ones that had not.  (For that, I use an Excel comparison, with VLOOKUP, of filelists obtained by DIR from the input and output folders.)  All gave me an "insufficient memory" error when I tried to open them in Word.  Some seemed to be corrupted to various degrees.  I used Notepad and wReplace to slightly clean up the ones whose corruption prevented them from printing to PDF in a more or less normal fashion.  (In wReplace, the option I used was Replace Many > Open (arrow) > Diacritic to ASCII.)  Several others were printable, but I hadn't printed them.  That is, when the batch file was running, Word kept asking me if I wanted to save changes to (or to print; can't remember for sure) a document with a weird name.  The same name, over and over again.  I thought it was some kind of error, since that name wasn't in my file list.  Possibly this problem had something to do with the fact that these documents were originally created on a Mac and then converted.  So I had to PDF those manually.

Next, I wanted to take a quick look, to see whether any of the resulting PDFs were actually junk -- whether, for any reason, some of them made it through the process in garbled form.  For that, I took the approach of converting just the first page of each PDF to JPG, and then flipping through them in a photo viewer (e.g., IrfanView).  This process did turn up a few corrupted documents.  I was able to verify that they had been corrupted before I started this process; it did not appear that the steps described here had any effect.

I wasn't extremely concerned about these documents.  If I had been, I think a modified strategy would have been advisable:  take a quick look through all of the documents, as just mentioned, and then take a closer look at any that seemed important.  It would have been handy, for that purpose (and others), to have image- (and audio-) viewing (or listening) software that would not only display the item in question, but would also let me shove it into various categories with the touch of a key.  In this case, the categories would have been OK and Not OK and Examine More Closely.

3 comments:

raywood

The TASKKILL part of the command turned out to be redundant of the /mFileExit option, in most if not all cases. I left it in there, first, because I overlooked it and, second, when I did return to it, because I thought it couldn't hurt and there might be some situations where it would provide a helpful return to the starting point. Note also that, on one or two occasions when the process hung up because of some glitch in Word or in the file being converted, killing Word manually would allow the process to resume with the next file, and I could then catch at least an obvious failure during the ensuing output check phase.

raywood

A later post provides a more refined solution to the problem addressed here.

M. Faramawy

Many thanks to your solution for converting word files to pdf files by this easy way.
But the command you does not work well.