To Digital in a Day: Act II

Sat 8:27 PM

Rested off some dizziness and decided to pick this up again so that there’s at least a book for my Kindle.

The HTML looks pretty good. It’s at the state where someone could email it to their Kindle’s email account and have it converted fairly well. It lacks a table of contents, though.

But first, let’s do the Epub.

Sat 8:30 PM

The annoying thing about Epub: most Epub readers don’t deal well when the source contains a very large HTML file. Other mobile formats are internally broken down into separate records, to be loaded in small chunks so as not to stress memory. But many Epub readers try to load the entire file in one go, including Adobe Digital Editions.

So the first task is to break this file up into multiple files. I usually do one file per chapter, but my concern is that the chapters are so large (about 20k words each) but we’ll have to see what happens.

First, breaking out the stylesheet into its own style.css file, which can then be linked to by each chapter html file.

Now I write a ruby script to chop the file up, because I’m just that way. You could do this in perl or python as well, but I prefer ruby.


#!/usr/bin/ruby

#
# Opens a new file handle to a file with the given
# filename (no suffix), writing the initial header for
# the file, also using the given title in the  section.
#
def open_section(name, title)
    section = File.new("#{name}.html", "w")
    section.puts <<-END




  #{title}
  


    END

    return section
end

#
# Closes the given file handle after writing the HTML footer.
#
def close_section(section)
    section.puts <<-END


    END
    section.close
end

File.open("TextilePlanet-te.html") do |file|

    current_section = nil
    section_num = 1
    state = :in_head

    file.each do |line|

        case state
        when :in_head
            if line =~ %r|<body|
                state = :in_body
                current_section = open_section('section-00', 'Title')
            end
        when :in_body
            if line =~ %r|

([^<]+)

| title = $1 if current_section close_section(current_section) end current_section = open_section("section-%02d" % section_num, title) section_num += 1 end current_section.puts line end end close_section(current_section) end

The end result is 11 files, from sectin-00.html to section-10.html.

I check the various HTML files in Firefox to make sure we’ve got everything. Everything is quite nicely split; section-00.html has the title page, section-{01-08}.html each contain an entire chapter, section-09.html contains the author bio, and section-10.html the copyright page.

We need an explicit ToC file for Mobipocket. (Epub doesn’t need it, but I’m working from the same source for all the formats.)

Sat 8:55 PM

I could write the ToC file by hand… or generate it with a script… or just use MacVim and shell commands and some writing by hand.

I do that.

Sat 9:00 PM

Moving the non-toc, non-section, non-stylesheet, non-image files to another directory.

Now using Ruby Epub tools in the root of the work directory. Also, I’ve discovered that an old friend of mine, HTML Tidy, still exists. Big help for finding illegal character sequences and replacing them appropriately.


% epub add-to-opf . content/*
% vim metadata.opf   # (reorder spine)
% epub add-guide . content/toc.html \
      --type "toc" \
      --title "Table of Contents"
% epub add-guide . content/section-01.html \
      --type "text" \
      --title "Start Reading"
% epub add-to-ncx . content/section*
% epub add-to-ncx . content/toc.html
% epub compile .
% epub compile .
  adding: mimetype (stored 0%)
  adding: META-INF/container.xml (deflated 35%)
  adding: metadata.opf (deflated 77%)
  adding: content/style.css (deflated 59%)
  adding: content/section-10.html (deflated 55%)
  adding: content/section-09.html (deflated 33%)
  adding: content/section-02.html (deflated 63%)
  adding: content/section-03.html (deflated 62%)
  adding: content/section-01.html (deflated 62%)
  adding: content/section-06.html (deflated 63%)
  adding: content/section-07.html (deflated 53%)
  adding: content/section-04.html (deflated 62%)
  adding: content/bookview-logo.png (stored 0%)
  adding: content/section-05.html (deflated 63%)
  adding: toc.ncx (deflated 79%)
  adding: content/toc.html (deflated 59%)
  adding: content/section-00.html (deflated 32%)
  adding: content/section-08.html (deflated 33%)
% ~/Software/ebooks/epub/epubcheck *epub
No errors or warnings detected
Sat 9:18 PM

Test reading it successfully in Adobe Digital Editions, with a table of contents and a proper NCX.

Ladies and gentlemen, as of 9:18 PM we have valid Epub. And a blog entry shortly.