Problem: Convert PDF to reflowable text, preferably HTML.
Why: This is because text that reflows based on the size of the screen, or the size of the font, or the length of a page (or indeed, without the concept of a page) is what suits mobile ebook readers, with smaller screens, best.
Apologies for two Geekery posts in a row. The rest of the discussion is under the fold.
Hard things: In PDF, everything is in a fixed position. Pages have fixed lengths and widths; fonts have fixed sizes. The most a viewer can do is magnify the entire page; but that means scrolling left to right because the text has wandered off the screen, unless the PDF is sized for a roughly 2.5″ x 4″ screen.
The only structural unit that PDF provides is that of a line of text (usually). Processing PDF is thus a line by line approach at best. If you could only read ahead line by line, how would you interpret
a series of
very short lines
that don’t wrap
like the dialogue of many works
of literature? Is this simply a fancy
paragraph around an image, or something
else? Or how would you interpret a
page break? Is it another paragraph or not? What if the author was in the middle of poetry? And then there’s the dreaded
and we’ve not yet brought in the issue of margins, images, dividers, and such.
——————————–
In other words, PDF expresses display more than structure, whereas other ebook formats emphasize structure over display.
Approaches: You can do a simple conversion of PDF, and just have line breaks all over the place, and swim headers, footers, margin notes, foot notes, chapters, paragraphs all munged together. This is generally not ideal for reading.
Or you can attempt to divine which lines belong to which paragraphs based on specific rules, but these rules differ from PDF to PDF.
Simply put, you can’t properly translate PDF to a reflowing format without understanding the positional relationship that serves to express text structure in place of actual distinct elements.
Thus, if you aren’t actually writing a PDF parser, you want to work with a format that records the position, height, width, etc., of each line of text, and what text is on what pages. And then you can deduce certain relationships about the lines in relation to one another.
The most useful deductions:
- indentation
-
When you know the x coordinate of where a line begins, you can tell if something’s been indented from the left, because that x coordinate is larger than the x coordinate of most other lines.
Discovering the indentation amount from the page’s left margin will allow you to almost reliably determine paragraphs in every case, from short lines of dialogue to overcoming the page break issue.
The x value may change within a small range, due to left to right alternation (a right page and a left page in a book will have different left margin widths) or to character widths (a double quote often results in a 1 pixel difference).
- spacing
-
If you know the y coordinate of every line of text, you can determine the amount of spacing between each line.
Larger amounts of spacing usually mean scene breaks without a scene division element (e.g., “* * *”), although you would have to take into account the bottom and top margins of pages into your calculations. And if your text has no indentation at all, paragraphs have larger spacing between each other than lines within paragraphs.
Smaller amounts of spacing can help you detect poetry, which is a single paragraph with linebreaks (and thus each line may be spaced closer together than paragraphs), but not always.
- fonts
-
pdf2xml preserves PDF’s font—font family, weight, variant, size—information along with the text (creating a different text block structure around the specially formatted text, much like HTML separates out text in a different font with various tags). This can be used to detect chapter headings.
Indeed, a strange symbol font often designates decorative characters used for scene breaks and the like.
A truly smart PDF to HTML converter would read the text in and use heuristics to determine the particular rules a particular PDF is using.
More practically speaking, a simpler yet still smart PDF converter would allow the user to input parameters and judge the result.
Really hard things: Footnotes, margin notes, flush-left scene dividers that use text, and flush-left headers and footers. Those definitely need to be pruned out by specific parameters (“if a line of text looks like ‘War and Peace’ followed by a page number in italics, ignore…”).
Sample pdf2xml output:
Mobipocket Creator’s PDF import function uses pdf2xml, and will leave both an HTML file in the project directory (aside from the images it extracts from the PDF) and an XML file. The XML looks like this:
Compared to the PDF original:
We can see that the normal left-margin width seems to be 72 pixels (this is just page 1; page 2 may have a small variation if it’s a right-side page for some reason). The first paragraph isn’t indented, which is slightly problematic, but the second and further paragraphs are indented, around 108 pixels (or 30 some pixels from the edge of the left margin).
We can also see that when text is italicized, we see the x and y coordinates of where the italicized line begins, which means that we’ll sometimes see x values larger than 108 or 72 from time to time, attached to italicized or other specially formatted text.
From the XML, we see that all the lines are organized into pages. So what does a page break look like?
That “4” is a page number in the footer, and we can see the header in the next page. Because the header and footer are fixed from page to page, we can rule them out based on their y coordinate on the page (672 and 37 respectively).
We also see that the x coordinate on the actual text lines continue to follow the 72/108 rule, and thus we can detect paragraphs that cross pages versus paragraphs that truly end with a page.
A template ruby program:
This is similar to the ruby code I used to turn the Stepsister Scheme preview PDF into HTML; after pdf2xml, I trimmed away the Chapter and title (inserting my own versions by hand afterwards), and used this script on the leftover xml:
pre.ruby i { color: #003366; }
#!/usr/bin/ruby require 'rexml/document' doc = REXML::Document.new STDIN # I find it easiest to start off with a beginning open paragraph, # because then I can treat all text as if it were all in the middle of # processing rather than worrying about beginning and end # conditions. puts "<p>" doc.elements.each('pdf2xml/page') do |page| page.elements.each('font') do |font_block| # Determine if we're in an italic or bold block of text, # which may include multiple lines, but this is a more or # less cohesive unit on a page. prefix = suffix = '' face = font_block.attributes['face'] if face =~ /italic/i prefix = '<em>' suffix = '</em>' end if face =~ /bold/i prefix = '<strong>' + prefix suffix += '</strong>' end puts prefix # Processing the lines (or partial lines, if this is an italic # or bold block of text that begins within a paragraph) font_block.elements.each('text') do |text| output = ''" # The y coordinate.... case text.attributes['y'].to_i when 37, 672 # The header and footer lines discussed above; skip them. next end # The text content, looking for scene breaks (they're fortunately # deliminated in this PDF; in others, I would be comparing this line's # y value with the previous line's to determine scene breaking. # # Scene breaks very definitely break paragraphs as well, so we # end the possible em/strong tag(s), and begin a new paragraph and # open with the possible em/strong tag(s) as well. case text.text when /^\s*#\s*$/ puts <<-END #{suffix}</p> <p style="text-align:center;margin:1em 0;">* * *</p> <p>#{prefix} END next end # The x coordinate... case text.attributes['x'].to_i when 108 # The indentation as discussed above, indicating a paragraph break. output = "#{suffix}</p><p>#{prefix}#{text.text}" else output = text.text end puts output end puts suffix end end # And close anything left open; we'll have closed every possible suffix # by the time we reach the end. puts "</p>"
This left quite a few empty tags (such as lines of “
“) which I subsequently removed through the regular expression features of Vim (which actually can treat regular expressions involving newlines (represented by \n
), and exceedingly helpful). Alternatively, all the text could be collected in a string instead of spat immediately to standard output, and then run across with a multi-line global substitution or three.
And that’s how I get pleasantly readable HTML (converted to Mobipocket for my Kindle) out of PDFs.
The example used throughout this article was Jim C. Hines’ The Stepsister Scheme PDF sample of chapter 1, which you can download from his site for free.
Of course you could also use one of the commercial programs to convert pdf into readable text – I happen to have Iceni’s Gemini, which does quite a good job here – but I admit it’s quite expensive and I wouldn’t invest in this if it wasn’t needed for my job.
Keep up the good work and maybe you could release a Windows executable one day? I’m bounded to that platform by the software I have to use and lately I become to lazy to boot my secret Linux partition.
BTW, how does your script handle the non-English encodings, like UTF or others?
Hi Wasaty,
I was thinking more of reflowable HTML text. It’s really easy to convert PDF straight to a presentation-style HTML that has no regard for text structure—see the free pdftohtml, which does an incredibly good job of this.
Ruby has a Windows implementation you can download and play around with. There’s a one-click installer and everything.
For UTF/Unicode, Ruby handles such encodings out of the box—in fact, it was developed in Japan initially, so it’s one of the few scripting languages that was prepared from the start. :)
There was a time when I fought many battles against a powerful enemy – the PDF files. (Sorry, too much Stargate and my favorite Teal’c.)
I think that I’ve worked with every free pdf-to-text extraction utility and while there are some useful ones, nothing really beats commercial tools if you want to retain both paragraph structure and some formatting, like italics and bold text – sometimes my clients expects me to translate contents of the pdf files and deliver translation with the exact formatting of the original.
And thanks for the ruby link.
Wasaty,
I can believe that. I’ve tangled with PDF for years, and back when I was in college (or even just a few years ago), free tools weren’t very good.
Though life is easier these days with pdf2xml on the outskirts and scripting languages that can parse XML; the opportunity for a free tool that can parse out commercial-quality text is much higher. It’s a matter of brutally applying heuristics, after all….