The trials and tribulations of PDF to ePub conversion

by Tom Gorham on December 6, 2010

Tom Gorham

I think last week must have been my lucky week. It was only a month ago, thanks to an update to Pages, that I was discovering for the first time the joys of exporting documents to the ePub format. Talk about timely. Just the other day, I was asked for the first time to create an ePub document for an organisation. At least I knew what they were talking about.

The only complication was that I wasn’t creating the ePub from scratch. Instead, I would have to convert it from PDF as the Word document it was based on had been lost.

I suspect that, like many of its kind, the organisation had standardised on PDF some years ago as an ‘accessible’ format for documents on the web, and was now beginning to regret the idea, as a future of mobile devices with smaller screens makes PDF an incomplete solution. Presumably, having seen ePub documents on the iPhone or iPad, it wanted to try out ePub to see if there was demand for documents in this format.

Still, its reasons for choosing PDF rather than HTML for its text-heavy documents were solid, as they contained lots of footnotes, which aren’t easy to handle in HTML. I offered assurances (a little rashly as it turned out) that ePubs could handle footnotes with ease.

But, still, the task of converting from PDF looked at first a little daunting, until I got lucky again. On the same day I was asked to do the conversion, I was reviewing PDF Converter for this magazine. And there, sitting in the middle of its feature list, was a promise to convert PDFs to ePubs directly. It was all I could do to stop myself going out and buying a lottery ticket – things were definitely going my way.

Perhaps it’s as well I didn’t, as my good fortune started to ebb. The promise of PDF Converter’s claims didn’t live up to the reality. It grabbed text from the PDF fine, but ignored images and tables, and its inflexible conversion meant there was no easy way of editing the resulting PDF to add back the missing content, re-link footnotes or add something as simple as a table of contents.

Re-enter Pages, allied to some other built-in Mac OS X features. The first stage in conversion from PDF to ePub was to extract the text from the PDF. While you could always open the PDF in Preview and cut and paste the text, that leaves a lot of unwanted extras, such as page numbers, with it. Handily, Automator includes an action that does a better job in converting PDF to plain or rich text. The latter is generally better, as it maintains more of the PDF’s page structure.

To create a workflow to automatically convert PDFs to text files when they’re dropped into a folder, I opened an Automator Folder Action template and dragged the ‘Extract PDF text’ action to the Action window. This action includes options such as adding page headers and output file names, but I kept it as simple as possible and chose the rich text output, saving to the same folder as the original. The saved action was automatically added to my Library/Workflows/Applications/Folder Actions folder, so that whenever a PDF is added to the folder I specified in the workflow, the conversion will be triggered.

That was the easy part. While I had to do a little bit of tidying up on the converted text, it was in fairly good shape. With ePub, you don’t need to worry about sections or page breaks, or even the table of contents: the styles you apply to the text will handle this for you – just copy them from Apple’s ePub template document (images.apple.com/support/pages/docs/ePub_Best_Practices_EN.zip).

All I had to do was add the images and reorganise the footnotes. Automator doesn’t have a routine that extracts images, so while there are excellent utilities for doing this, such as File Juicer (echoone.com/filejuicer/formats/pdf), which grabs both images and text from the PDF, I confess I cheated. As I worked through the file in Pages, comparing the original PDF in Preview, I simply used Preview’s Select tool to draw a selection around the image to transfer and then copied and pasted it into the relevant spot in Pages as an inline image. It’s a rough-and-ready solution, admittedly, which loses all sorts of detail from the original file. However, while in most situations you’d want to add higher-resolution images if you could source them, trust me, the images I was copying weren’t exactly press quality.

The biggest hassle was translating the footnotes. On conversion, these appeared in the body of the text, and it turned out to be an exhaustive task to recreate the footnotes properly in Pages, with the relationships between the text and the footnote having to be manually rebuilt.

There were a few other frustrations along the way. Page breaks are difficult to get right. While they’re automatically applied to the ePub in some circumstances – if you use the Chapter Heading style on text, it will start a new page in the ePub file, for example – Chapter Headings also appear in the table of contents, so it’s a struggle to find a reliable way of simply forcing a page break without adding the item to the table of contents.

This could be a problem with more complicated documents, and the inability to easily fiddle with Pages’ ePub export could prove frustrating. But fiddling is possible. The ePub file that Pages generates is simply a compressed collection of the XML, CSS and image files that make it up.

To edit these files, you first need to change the extension of the ePub file in the Finder from ‘epub’ to ‘.zip’ and then unzip the file. I couldn’t find an easy way to do this in the Finder, but using the Terminal application, type unzip at the command line, drag the ePub file from the Finder over the Terminal window and press Return. This opens the ePub’s constituent files by default in your user folder. You can edit its files in an HTML editor, and when finished you can re-compress the files and folders, remembering to change the file extension back to ‘.epub’.

On the plus side, footnotes are handled very well by Pages when it creates ePub documents. In a standard word processing document, footnotes are added to the bottom of the page where the relevant text occurs. In the ePub, footnotes are gathered at the end of the chapter – a sensible approach, given the size of the screen. Even better, the resulting ePub creates a link between text and footnotes, so you can navigate quickly between them.

I’d love to say the job was unequivocally worth it. The results looked great on an iPhone, but it took a good morning’s worth of cutting, pasting and adjusting to convert a 60-page document with 120 footnotes, so it’s not something I’d like to do for a library of documents.

The ePub is now available for download on the web and I’ve kept a keen eye on its popularity. Although it’s too early to properly judge, only three days after uploading, it’s fair to say that there hasn’t exactly been a digital stampede for the file: it has been downloaded a modest six times so far.

So the message is clear: while I’m sold on an ePub future, the rest of the world will take some convincing.

For more breaking news and reviews, subscribe to MacUser magazine. We'll give you three issues for £1
  • actess

    Dear Tom,
    This was a great article. I have used Pages to convert documents into ePubs, but have not yet figured out how to avoid the extra blank pages it inserts into my final document. Any help would be appreciated.
    Thanks,
    AC

  • tomgorham

    Thanks Actess – and sorry for the delay in replying – I didn’t spot your comment. Could you give some more details on what’s happening? My guess it’s something to do with sections, but I’d be happy to have a look at the file if it would help.

Previous post:

Next post:

>