I have some screenshots lying around of the eBook I did for Mary Robinette Kowal’s free fiction sampler, and I thought it would be interesting to look at the process I used to convert her .rtf files.
I have a MacBook, so it comes equipped with a lot of text-munging capabilities out of the box, thanks to pre-installed perl and python. With Crossover installed so I could leverage the Windows-only Mobipocket Creator and Reader, I have everything I need to create eBooks from text, HTML, RTF, or PDF (as long as it’s not scanned images), even Word documents.
There’ll be plenty of images, so the rest of this post is under the cut.
The first step (after having read and enjoyed MRK’s stories) was to convert the RTF files to HTML. The MobiPocket Creator doesn’t import RTF, and without an installation of Microsoft Word for Windows (via Crossover), it can’t import Word documents either. Instead, I used iWork‘s Pages to export the RTF to HTML. (There are a couple routes to victory here; I could have used TextEdit, which already comes on a standard Mac install, to Save As HTML.)
I then had to process files that looked like this:
into something that the more limited range of Mobipocket’s HTML renderer could understand. The blue arrows on the left are my editor’s way of telling me that it’s wrapping a very long line around.
The resulting files were much cleaner:
As you can see, the main goal of the first pass is to get the body text into properly opened/closed paragraph tags (<p></p>). This tells the Mobipocket renderer where a paragraph begins and ends, so that it can indent each paragraph and space it from the others appropriately.
Each story is in its own file, which simplifies some things and makes others more complex. One important thing that’s simplified: separate files will be rendered by Mobipocket into one Mobipocket file, with page breaks between each file. I don’t have to worry about inserting <p style="page-break: always;"> in the right places of a single monster file. It’s also easier to create a table of contents for stories or chapters broken into separate files—but more on that later.
I now use header tags to set off the title, author, and in this case, original publication information of each story. Due to the Creative Commons license that Kowal’s work is under, and also out of respect for the author, I’m not changing the text—e.g., dropping her name or the publication information.
I typically use <h3> for chapter titles, and either <h2> or <h1> for story titles. The title/author/etc information was centered for the original RTF, so I also centered them here.
The Bound Man
by Mary Robinette KowalOriginally published in the anthology, Twenty Epics
Screenshot from the Kindle:
Two line breaks (<br>); I don’t normally do this either, but the publication information is the same size as the body text, so I used spacing to set it off. It is a nice effect, though, so maybe I’ll do that from now on.
Now we go through the body of the text and make note of the following:
- What notation is being used to denote scene breaks?
- Are there any characters that require a specific encoding (such as the accented o in “Halldór” in “The Bound Man”)?
- What notation is being used to denote italicized text, or bolded text?
- Notation for em-dashes?
Then we adapt the main body formatting for for Mobipocket.
Because each of these files is in standard manuscript format, scene breaks are # and underlining means italics. I replace scene breaks with
which is just a stylistic thing of mine, and the underlining is replaced with italics (<em></em>). This is where knowledge of CSS comes in handy, because the RTF converters use CSS to denote underlining and centering, not the usual <i></i> or such. More flexible, but harder to root out. Special characters, like
-- denoting an — are replaced with the appropriate HTML entity (—).
Because of the presence of characters that require UTF-8 encoding, I make sure to preserve some indicator at the top of each file that the encoding is UTF-8, such as
at the very top of each file. Otherwise, accented characters and even smart quotes (if present) won’t render correctly at all.
This is a more specialized pass, to pick out semantic clues as to how text should be formatted in certain cases. For instance, Death Comes But Twice is a special case, as a letter:
The initial address at the start of the body of a letter isn’t indented; we have to tell Mobipocket that it shouldn’t be. We do this via a width=0 attribute to the paragraph tag:
My dearest Lily,
(Naturally, this is where having read the text is helpful. I am fussy, so I do this sort of thing.)
Also note that the Creative Commons license at the end of each file needs to be set off from the text (easy to do with a rule, <hr>). It also looks ugly when justified—a concern since all Mobipocket readers tend to default to justifying text in paragraphs. Instead, we explicitly set the justification to be left (aka ragged-right), and also knock off the implicit indent:
This work is licensed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
Note: Publishers should let readers decide, except in special cases like above, whether or not to justify text or allow indentation for the main body of the work. Yes, this is a change from traditional book publishing, where all formatting decisions are made by the publisher, but eBooks need to be a bit more flexible with respect to font family, font size, indentation, and justification. Whoever eBook-ified Dust and other such books, I’m looking at you.
Each story file is processed now, so now we look at structuring the eBook as a whole: putting together all the stories, with a table of contents that allows a reader to jump to each one rather than paging through. eBooks do not, as a rule, page as well as real books do. No, they really don’t. Especially since the concept of a page as a fixed-size element is not present for eBook formats like Mobipocket or ePub, which depend on reflowable text so that users who change font sizes won’t get nasty formatting when they do so.
Any other guide elements that we should have, we’ll also do. A title page is usually nice; if any dedications are present, a dedication is nice. Similarly for preface, colophon, acknowledgments, and so on. Each such guide element is usually a separate file in itself.
A note on cover images: these are very nice when you can get them, but the image rights belong to the artist, not the author. You’d have to ask. Signs typically point to “no” if you’re not an actual publisher.
My title pages are very simple: <h1> for the title, <h2> for by line. Here’s a token screenshot (not very exciting):
The table of contents page is only slightly more complicated. Here you need to have a good working knowledge of <a href... and <a name.... We’d need both if we were working with a super-linked work (many links within a chapter point to other parts of the work) or with a single big file for the stories. Here, because the stories are separate files, we only need the former, and href points to the filenames of the stories.
The links in a table of contents are ones you don’t want justified or indented (most story or chapter titles don’t reach all the way across a screen nicely).
Renders on the Kindle as:
Now you need to order all the files in the Mobipocket file list in the order that you’d like a reader to page through them, if the reader were to go through the book from start to finish without any shortcuts. Some of them do that.
Metadata and Guide Data
And here is where I see a lot of publishers, professional and non-professional, make mistakes. Which isn’t great, considering how important metadata is to catalogs and, in particular, content organization on a Kindle or other Mobipocket device:
What’s wrong with this picture?
The list is ordered by author last name. Unfortunately, “After the Coup” has incorrect author metadata, indicating that “John Scalzi” is the last name, rather than just “Scalzi”.
- In the Garden of Iden by Kage Baker,
- Spirit Gate by Kate Elliott,
- Flash by L. E. Modesitt.
The Birthday of the World has its title correct for a library card, incorrect for a Mobipocket reader (which can library-alphabetize without people re-ordering beginning articles to the end).
Also, The Birthday of the World has incorrect author metadata; a space was missing after the comma between last name, first name, so it’s displayed incorrectly here.
You can’t see it, but Farthing on page 10 is missing author metadata entirely.
Important elements of metadata include:
Title. This should be without the author name, and in normal title order. You do not need to put the beginning “The” or “A” at the end with a comma, because Mobipocket readers are smart enough to alphabetize library-style.
Author. Do not just paste the author’s name in; you must put it in last-name, first-name middle-name. And a space after that comma, folks. This is so the Mobipocket reader can alphabetize by author last name, library-style. Multiple authors are separated by semicolons.
Important guide data:
- Marking which file is the title page.
- Marking which file is the table of contents.
- Marking which file is the [name of guide, like preface/introduction/etc].
It’s especially silly not to mark which page is the table of contents, since then the reader can’t easily access it when they’re deep in the text.
And remember: scroll all the way down and press the Update button, because the “Save” icon doesn’t save metadata or guide data in Mobipocket Creator!
… gaah, like an essay. But anyways, these were my considerations while I was creating the eBook for MRK’s fiction sampler.
At some future point in time I may cover the considerations that went into Psmith in the City. That was a bit more complicated, since I did not get to draw on the electronic copy of an author’s manuscript, and really did require perl scripts for major text massaging and processing.