eBookifying the Scifiction Archives

A long time ago (in Internet time, anyways), Scifi.com had a section called Scifiction, where they published science fiction stories online—both “classics”, from writers hoary with age (well… maybe not that hoary; Robert Silverberg, Avram Davidson, Barry N. Malzberg, etc), and “originals”, from newer writers (you know, like Elizabeth Bear, Lucius Shepard, M. Rickert, etc).

Then, for whatever reason, Scifi killed Scifiction. All links to the stories were evaporated.

But the Scifiction archives live on. Horribly slow, badly formatted, aging and uncared for, and nearly unreadable in mobile readers like the Kindle, but still there. (Of course, I’m inserting extra drama here. Cue timpanis.)

I got tired of this, so I created eBooks of them, one per year. This was actually my second serious endeavor in the world of eBooks. It was amusing, because the archives are huge; some 325+ stories reside there, spread out over five years. I can’t distribute them, of course, because the stories are all under copyright—and tracking down over 50 writers, some dead so I’d have to contact their estate, is not something I’m about to do. Nor would they wish me to, I think. So I don’t distribute them, and never will.

But the knowledge of how to do it, for people who wish to make personal eBooks, is distributable.

So here’s a description of how I did it, after the cut. It’s not complete in every detail, because some of the process was manual—there are multiple pitfalls in how the archives work, a lot of it because the archives are spread out over five years, and templates and presentation change enough to cause unwary scripts to die with gurgles halfway through the work. And even so, you end up needing to massage things by hand anyways.

Note: this is quite a bit of effort, but it was still less effort than doing it all by hand. I’m rather proud of this. And, of course, it’s a very tl;dr, mid-level technical discussion. I think it’s mostly a geeky thing.

More Notes

In all these things, I’m assisted by my Mac, running Mac OS X. This is not a small advantage, because by default I have access to vim, wget, and perl; after all, OS X is a modified BSD distribution. It’s like working on a Linux box with a decent window manager, but less decent process control tools (BSD ps is miserable compared to the procps version, but I digress).

This is not something you get for free on Windows machines, and not even all Linux distributions (although it’s easier to get these tools on a Linux distribution).

These instructions are a bit complex, but informative, I hope.

Legal Note

DO NOT distribute any files you produce. Just don’t. Let people make their own.

DO NOT assume that the writers will not come after you with sticks, stones, and lawyers. I know people like to assume this and all, but dude, just don’t.

DO NOT assume that copyright has expired for the dead guys, or for the “really old” stories. None of them have as of this writing, and none of them will for a very long time.

Downloadation

I use wget for this business. Some people prefer curl, and there are GUI applications and Firefox plugins out there that will give you a nice interface to do what I did—which was not a simple mirroring operation, although it’s not that hard to accomplish.

Your target URL is http://www.scifi.com/scifiction/archive.html, which links to all the stories.

First requirement: recursive downloading. This means that the program will follow links, usually on the same website, and download them along with your target URL.

Second requirement: recursive downloading restriction. By this, I don’t mean simply restricting the number of levels your recursion descends, but also restricting the “directory” your recursion descends into. In this case, we want items under the “scifiction” directory, even though there are links in the target files that point to “movies”, “schedulebot”, “scifiwire”, etc. You really don’t want to download those.

Third requirement: link rewriting. This means localizing all links to be relative links where possible; oftentimes websites will refer to their own links by the full URL, e.g., “http://www.scifi.com/scifiction/classics/classics_archive/sladek/index.html”. This works against any script that’s trying to locate the local version of the files in your downloaded menagerie; a good downloading program will rewrite the links into relative links, so instead you’d get (for instance) references to “classics_archives/sladek/index.html”.

wget has an extensive man page for the various options to satisfy these requirements; so does curl.

Massaging archive.html

If you scan through archive.html in your downloaded files, you’ll notice that while the listing of stories and authors seems regular, there are abnormalities along the way that will trip up a naive script. Examples:

Long titles that contain breaks:
Multiple authors
The 9/11 special, which is completely vaporized from the site
The Periodic Table of Science Fiction by Michael Swanwick, which needs extra special handling
Entries commented out when authors wanted stories pulled.

Either your script needs to be intelligent, or you massage archive.html through a text editor to smooth out the discrepancies that can be smoothed, and remove the ones that can’t (the missing tribute and the difficult to deal with periodic table are the two main ones that can’t be mitigated). It’s easier just to massage archive.html; you’ll have enough problems dealing with other issues than to worry about something that only needs to be done once, rather than 328 times.

Massaging specific story files

In your downloaded directory, you’ll find two subdirectories: classics/classics_archive and originals/originals_archive. Both contain the story HTML files, nestled in subdirectories based on inconsistent author short-naming and a possible number indicating this is a story they have in the archive beyond the first one (like sheckley, sheckley2, sheckley3, etc).

The fun part is some of these HTML files have been corrupted: the line-endings aren’t recognizable by Windows, Mac, or Linux. You’ll recognize these by searching for files where everything has ended up on one line, with a bunch of control characters—specifically, ^M—tainting the file. You’ll want to search-and-replace every “^M” with a newline. It’s important to note that this is not a caret ^ followed by an M; this is actually a single character that’s just represented by ^M, because otherwise it would be invisible to you.

(To nit-pickers out there: yes, I’m being very general and not completely technical and specific, because it’s not necessary right now.)

I detected these files by grepping for “^M” (using a in the terminal so that the ^M is produced); there are under 10 such overall, but they’ll throw most scripts under the bus when processed (this is because processing “line by line” is the best way to deal with transforming the HTML files; huge single lines will mess that up). Correcting them manually is the way to go, especially if you have the vim editor on hand. I simply loaded up the 10 files, and executed the following keystrokes:

:%s///g

Voila. Despite the fact that the will look like a ^M, vim will do the right thing, turning all the bad ^M into newlines, changing huge one-liners into the multiple lines they should be.

Processing the story HTML

This has multiple parts. Multiple non-fun parts.

Any matching you do here must be case-insentive, by the way; sometimes tags are capitalized (like

) and sometimes they’re lower-cased (

) and sometimes the case is… mixed (

Stripping out extraneous HTML

This is extremely necessary, because table code is used to generate those side quotes that will interrupt the actual story text. We’ll need to locate the table cell with the story text, and extract only that part; fortunately, this is always given away by a line that contains ; match against "bodytext" (quotes included), since that appears in no relevant story text (confirmed via a “grep” search that counts the number of times “bodytext” appears; it should appear only three times, twice in the embedded stylesheet, once in the actual HTML).

We also need a way to find where the story text has ended; not all the story texts end with some capitalization/phrasing variations of “end” (END, the end, THE END, THEEND, the

end, or nothing at all). You’d think you could simply figure out where the table cell that contains the main text ends, but this is fraught with danger since some of the stories themselves contain relevant table code inside them for spacing effects.

The key is (by looking through the various files) that the ending is, in almost all cases, immediately succeeded by this sequence, all on one line, sometimes preceded or succeeded by whitespace, sometimes not:

</td></tr>

and the inner tables relevant to the stories don’t follow this pattern.

There are about 3-5 files that don’t follow this pattern (and don’t follow it in all different ways). Unfortunately, you’ll only be able to find them by generating the final HTML files and searching forwards for abnormalities in generation, much later.

Matching authors, stories, and dates

This is fun, because of course no story file is actually named after its own name, and is placed in a somewhat inconsistent directory structure, in a changing manner as the years passed.

So you need to parse archive.html and match up dates with authors, titles, and the specific story file (aldiss1, etc) the story (well… see later) resides in.

On the other hand, you may not care about matching up authors/stories/dates, in which case you’re free to just process all the HTML you can find in the archive directories.

You’ll also want to Europenize the dates or something. Otherwise 11.05.05 is a bit ambiguous and difficult to sort on.

Dealing with the stories themselves

This is extra fun. There are several variations:

index.html contains the story.
index.html contains some introduction and a big graphic but no story.
some other named file contains the story.
the story is split into multiple, inconsistently named files, although not inconsistently within a single directory.
some combination of the above.

I found it easier to just sort the files in each story directory, strip out extraneous HTML (which stripped the bogus index.html’s down to nothing), and concat the results together. The alphanum ordering takes care of everything else.

The Perl

And here’s the code I used. It assumes the massaging done in previous steps above.

And yes, this is heathen magic, in the programming atmosphere today. Not even I worship the Camel anymore. It could have been ruby. I just know perl—and perl is blazing fast with text of this size.


#!/usr/bin/perl
#
# Yes, I'm a bad person for not using strict and warn
# but... I really did not care for a one-off script.
#

open(ARCHIVE, "archive.html") || die;

# Accumulates the story data from archive.html
%STORIES = ();
# Because the date precedes the title on a separate
# line, we'll capture the date separately.  "last"
# refers to "the last date before we reached this line".
$last_date = '';

while() {

    # Dates are "00.00.00" format, with the order
    # being "month, day, year".  We'll Europeanize
    # the order to "year, month, day" and expand the
    # year to be 20xx, to reduce confusion.
    if (/(dd).(dd).(dd)/) {
        $month = $1;
        $day = $2;
        $year = "20$3";

        $last_date = "$year.$month.$day";
    }
    # We found the title of the story. It's a link
    # to the specific story HTML file, too.
    elsif (/"archivetitle"/) {
        /^([^<]+)/;
        $url = $1;
        $title = $2;

        # skip over empties; sometimes there IS no URL.
        # Can't do anything about that.
        next if ($url =~ /^$/);

        # Boldly advance one line into the file, not
        # checking for EOF because we know that will
        # not happen for this case, to get at the author
        # by-line.
        $next = ;
        $next =~ /by ([^<]+) $url,
                             author => $author,
                             title => $title,
                             date => $last_date };
    }
    else { next; }
}

close(ARCHIVE);

# The sort by title only matters because I wanted to
# print to standard output as the process rolled along,
# and print to standard output in the same order each
# time, as hashes do not guarantee order for their keys
# without an explicit sort.
foreach $title (sort(keys %STORIES)) {

    $info = $STORIES{$title};
    $author = $STORIES{$title}->{author};
    $url = $STORIES{$title}->{url};

    # We'll store the output of stripping the
    # HTML and concatenating files into a separate
    # output directory that we pre-made.
    open(OUT, "> output/$title.html") || die;

    # The URL points to a specific HTML file; we
    # want the whole directory because it'll have
    # other goodies inside.
    $dir = $STORIES{$title}->{url};
    $dir =~ s/[^/]+.html$//;

    # Print to stdout what we're currently processing.
    print "$dirn";

    opendir(DIR, $dir) || die "$!";
    @files = readdir(DIR);
    closedir(DIR);

    foreach $file (sort @files) {
        next if ($file eq '.' || $file eq '..');
        open(FILE, "$dir/$file") || die;

        $in_body = 0;
        while() {
            if ($in_body) {
                if (//) {
                    $in_body = 0;
                }
                else {
                    print OUT;
                }
            }
            else {
                if (/"bodytext"/) {
                    $in_body = 1;
                    # sometimes words are on the same
                    # line as the bodytext marker; so
                    # strip out the marker and print
                    # just in case.
                    s/]*">//;
                    print OUT;
               }
               else {
                   next;
               }
            }
        }
        close(FILE);
    }

    close(OUT);
}

# Needed so we can create a reliable anchor name, unique
# to each of our 300+ stories
$storynum = 1;
# We're splitting into separate files by year, because
# otherwise the file is a bit unmanageable at 16MB of
# *just text*.
$curr_year = undef;
# Sort by our special Europeanized year, which allows us
# to easy sort automatically by year, month, day.
foreach $info (sort { $a->{date} cmp $b->{date} } (values %STORIES)) {

    # Right now we're just creating the initial linked
    # ("clickable") table of contents per year;
    # no story texts yet.
    $title = $info->{title};
    $author = $info->{author};
    $url = $info->{url};
    $date = $info->{date};

    $date =~ /(dddd)/;
    $year = $1;

    if ($year ne $curr_year) {
        close(MASTER_HTML) if $curr_year;
        open(MASTER_HTML, ">SciFiction-Archives-$year.html") || die;
        print MASTER_HTML "SciFiction Archives: $yearn";
        print MASTER_HTML "Contentsn";
        $curr_year = $year;
    }

    # Creating the link to that unique story ID
    $aname = "story$storynum";
    print MASTER_HTML "$date: $title by $authorn";
    $STORIES{$title}->{aname} = $aname;

    $storynum++;
}
close(MASTER_HTML);

# Reset the current year.
# NOW we'll append the story texts, with appropriate
# anchor names () using the
# the unique story id.  Because we're just using an
# increasing ID, we don't have to add an extra
# association in %STORY_INFO referring to the id used.
$curr_year = undef;
foreach $info (sort { $a->{date} cmp $b->{date} } (values %STORIES)) {
    $title = $info->{title};
    $author = $info->{author};
    $date = $info->{date};
    $aname = $info->{aname};

    $date =~ /(dddd)/;
    $year = $1;

    if ($curr_year ne $year) {
        if ($curr_year) {
            print MASTER_HTML "n";
            close(MASTER_HTML);
        }
        # *Append*, not overwrite.
        open(MASTER_HTML, ">> SciFiction-Archives-$year.html") || die $!;
        print MASTER_HTML "n";
        print MASTER_HTML " n";

        $curr_year = $year;
    }

    print MASTER_HTML "$daten";
    print MASTER_HTML "$titlen";
    print MASTER_HTML "by $author
n";

    open(IN, "output/$title.html") || die;
    while() { print MASTER_HTML $_ };
    close(IN);

    print MASTER_HTML "n";
    print MASTER_HTML " n";
}
print MASTER_HTML "n";
close(MASTER_HTML);

Run from the directory just above the classics and originals directories, this script takes less than a second to run through all 320+ stories, minus the Swanwick periodic table. I haven’t yet handled the Periodic Table stories, of which there are many (Unca Mike, you crazy crazy man).

This produces six files: SciFiction-Archives-200{0-5}.html.

Locating Abnormalities

I discussed earlier the abnormalities about the ending

; you’ll need to search through each file for

and judge whether or not this is a case of a legitimate table inside the story text and thus relevant, or a case of the

rule not being followed. If the latter, you’ll need to change the offending original story file (not the output files, and not the SciFiction-Archives-200x.html file) so that this is now true.

After the massaging, re-run the script above.

Converting the Big HTML files

Each is 2MB to 3MB. They are suitable, actually, to send off to the Kindle free conversion service. this had the disadvantage of not including some of the story images. So I’ve been building them in the Mobipocket Creator.

HTML is practically a lingua franca for generating eBook formats, so you can generate ePub, LIT, eReader, etc.

Final Thoughts

If Scifi ever approached me for reals and said, “We can distribute these files on the up-and-up,” I’d say, “Yes, sure.”

But that’s never going to happen. I suspect the Scifiction Archives are not actually supposed to exist. Yet there they are.

It’s like X-files! But not.

4 thoughts on “eBookifying the Scifiction Archives”

tikitu says:

September 22, 2008 at 5:04 am

Thanks for the details! I’ve done something similar in the past (converting Project Gutenberg texts to LaTeX for making nicely-typeset pdfs, but the principle is the same). I know just how messy those scripts get… seeing yours all neatly commented is pretty impressive.
Arachne Jericho says:

September 22, 2008 at 5:35 am

Yeah, I really needed to comment the thing so later on I could look at it and mostly understand in a short period of time. Perl is, at times, a “write-only” language, especially without comments—it can lose its read capabilities over time. :)

Processing text is fun. You know what’s really fun? Processing PDF files into HTML. Especially fancy ones. The heuristic scripts I came up with don’t always work, so you get a lot of hand work afterward. Le sigh.
tikitu says:

September 22, 2008 at 6:02 am

I found Conway’s “Perl Best Practises” wonderful for encouraging readable Perl habits. But then those conversion scripts are exactly where I tend to think “It’s just a one-shot, right?”…

I can only imagine the torments that pdf-to-whatever must involve. At least html is designed to make getting at the content easy.
Arachne Jericho says:

September 22, 2008 at 1:55 pm

Yes, same here. “Oh sure, I’ll never use it again….” and then you do.

PDF to whatever is very interesting. I’ll have to write that up sometime. There are three levels: somewhat easy, somewhat hard, and devilish.

Comments are closed.

Spontaneous ∂erivation

Writing is hard, let's go gaming

Menu

eBookifying the Scifiction Archives

More Notes

Legal Note

Downloadation

Massaging archive.html

Massaging specific story files

Processing the story HTML

Stripping out extraneous HTML

Matching authors, stories, and dates

Dealing with the stories themselves

The Perl

Contents

$title

Locating Abnormalities

Converting the Big HTML files

Final Thoughts

4 thoughts on “eBookifying the Scifiction Archives”

Menu

More Notes

Legal Note

Downloadation

Massaging archive.html

Massaging specific story files

Processing the story HTML

Stripping out extraneous HTML

Matching authors, stories, and dates

Dealing with the stories themselves

The Perl

Contents

$title

Locating Abnormalities

Converting the Big HTML files

Final Thoughts

Share this:

Related

4 thoughts on “eBookifying the Scifiction Archives”