To Digital in a Day: Act I

Sat 1:40 PM

I receive a Word and RTF document from a source I won’t disclose just yet. I asked for both because sometimes Word documents don’t open at all on my Mac.

It is 127,000 words long. This actually doesn’t matter much to me, except in terms of the number of CSS classes my various HTML generators might create, but more in time.

Sat 1:45 PM

Open Word document successfully in both OpenOffice and TextEdit, save both as different HTML files. I try the Word document first because it preserves smart quotes. This will be the first time I’ve tried an HTML save from OpenOffice.

On to decipheration and conversion.

Sat 1:50 PM

Dear OpenOffice,

This is not how you win my love.

Not using the OO-generated HTML.

Sat 1:51 PM

Dear TextEdit,

You could use some work, but your otherwise near HTML-4 compliancy coupled with distinct, if overly thorough, CSS directives will assist greatly when I further process this HTML into text for EPub that will pass EPubcheck and Adobe Digital Editions.

Proceeding with TextEdit-generated HTML.

To work!

Sat 1:55 PM

Using Ruby-Epub‘s epub tool to create a work directory that will become the Epub book.

Sat 2:00 PM

Things I proceed to do with the powers of MacVim:

Kill the <meta> lines.

Replace the <title> with the actual name of the book.

Remove the generated ToC; we’ll be creating a new linked on later.

Examine HTML/CSS for anything repeated, redundant, or otherwise not useful. Often includes extraneous/repeated CSS classes, extra linebreaks between paragraphs, overcompensating HTML, empty bold/italic/etc tags.

Important note: this is all different from document to document, even by the same author. Generators are thorough, but not all that smart.

Serious text search and replace follows with regular expressions. Note to those not familiar with regular expressions: what follows in this section will make no sense to you. But here’s what it means:

  • I spend a lot of care in converting things to mean what they’re supposed to mean (like determining scenebreaks versus letters versus typeset characters versus normal text). Many of my fellow hand-converters and all of my fellow generators do not do this.

  • But I also have tools in my hands that allow me to take care of these in seconds when I find them and can determine the patterns.

  • Really, I spend most of my time investigating and understanding the structure and style of the text, although it doesn’t mean I have to read all of the text—just enough.

  • If you’re feeling guilty, it’s not your fault. Like I said, generators are stupid. There is sometimes nothing that convinces them that surrounding black text with more black text is redundant. (We’re a long way from The Singularity.)

We now descend into geekery. You can skip over this if you like.

Vim commands:


\n// :%s/

\n// :%s/[^<]*//g :%s///g :%s///g :%s///g

An interesting case to mention: there are a few places where a break/tab, instead of a paragraph tag, is used. These must be replaced appropriatey.



Manual replacement is needed in some cases.

Removing  because black is still black.
Removing  because Lucida Grande is still Lucida Grande.

Meanwhile, I make note of which CSS classes really matter. They often need to be replaced by appropriate HTML tags for structuring (often they’re chapter headings, for instance), but sometimes they’re needed for special fiction formatting.

If I run into a CSS class with a semantic difference that matters in this way, I rename it to an indication of what it means (such as changing “p15” to “scenebreak”).

Note: this is where I also find out where paragraph classes no longer occur because they had surrounded empty bold/italic/whatever tags. I delete them from the stylesheet.

Now, paragraphs:

(change p.p2 in stylesheet to p.title)
(change p.p3 in stylesheet to p.byline)
(change p.p7 in stylesheet to p.chapter)
(change p.p10 in stylesheet to
(change p.p15 in stylesheet to p.scenebreak)
(p.p19, p.p22, p.p25 mean the same thing as p.p15, remove)
(change p.p21 in stylesheet to p.monospace)
(change p.p27 is stylesheet to p.end-text)
(change p.p33, p35, p37, p38 to p.centered)
(p.p39 has larger fonts, but since this matters less on mobile readers, I'll keep the centering and make the font size normal.  This can simply be done by merging the class with the "centered" class.)
(Redundant paragraph classes that just mean normal text, strip)
:%s/^<p class="p9"/<p//
:%s/^<p class="p11"/<p//
:%s/^<p class="p12"/<p//
:%s/^<p class="p17"/<p//
:%s/^<p class="p18"/<p//
:%s/^<p class="p20"/<p//
[20 more, not covering here]

Many spans will be eaten in the belly of the Slorg, because they are often redundant once their surrounding paragraph becomes an h2 or something. (Amusing alternative: or it becomes a link, and therefore underlining it and marking it in blue is not necessary…. and often unreadable on grayscale readers.) And some are just redundant, and were removed in one of the previous steps.

Another interesting case comes up: Apple-tab-spans that create a list. This is a little troubling, because there are plenty of mobile readers that can’t deal with HTML lists, so I need to be creative. In the end I keep the bullets as explicit text and shift-right the paragraphs with another CSS class. I remove the tabs as well in this instance.

Other ways I could have gone: replaced the tabs with multiple  , risk using HTML lists, used floating divs with set widths.

It’s not perfect, but few things dealing with lists are.

Sat 3:21 PM

Now I clean up the stylesheet itself to remove extraneous CSS directives. Like, for instance, setting margins to 0, or resetting the font to the same one in every class. Or, um, setting left/right margins andindentation on a piece of text that’s going to be dead-centered anyways. Stupid generators.

Sat 3:30 PM

Now I start replacing things like p.chapter with their structural elements. I also add some style of my own to distinguish structural elements of different types.

p.title because h1.title (and I strip out the bold tags).

p.byline stays that way, but I increase the font size and weigh it bold.

p.chapter becomes h2.chapter. Or,




I bold the p.end-text.

left over must be replaced by the XML-compliant

Hyperlinks have been changed by the RTF filters to explicitly list the URL alongside the anchor text and remove the anchor tags, so I change all that back to the way it was.

I scan for missing images. The more images authors use, the harder life becomes for me, but fortunately there’s just the one, the Book View Cafe logo. (To get at it, I needed the OpenOffice conversion, because it extracts the images to files.) I add it back, centered.

I add the proper UTF-8 encoding declaration at the top. (Sometimes I get ISO-encoded files; I have to watch out for that, and use the right one.)

I finish up by adding the proper namespace for the outermost tag.

I check the final HTML in Firefox.

Sat. 3:45 PM

What do we have so far?

  1. I reduced a 100-line embedded stylesheet to 9 lines.
  2. I reduced the number of CSS classes from 100 to 9.
  3. I reduced the number of CSS directives from over 400 to just over 20.
  4. I replaced pseudo-structural elements with real structural elements.

But it’s not ready for prime-time just yet.

I copy the entire working directory to my encrypted remote file share because I’m paranoid like that. I verify the copy.

I’m going to take a small break now.

Sat. 4:00 PM

I post this to my blog. Then the showering, food-eating, other stuff.

ETA: Break might be until tomorrow. Friend and I are contemplating Watchmen again. Yes, I thought it was that good.