Hosting a Digitized Newspaper Collection
At CPL we host the full run of Canton Observer newspapers. You can browse the issues by date or search the full text.
In the past we hosted a few recent years as files, but the current incarnation of the collection is a lot more data — over 3800 issues totaling almost 180 gigabytes. Nearly every aspect of this project was a learning experience, so for other people looking to host tons of big files, here are some useful tips and tricks.
We worked with the vendor to optimize the files so they were easy-to-read and had better OCR success. This is mostly a matter of tweaking the contrast prior to running OCR. Since changing contrast can degrade the visual information of an image irrevocably, it's best to save the original scan as a preservation copy. This can likely be arranged with the vendor.
Contrast ratio is extremely important. Cranking up the contrast to a really high level reduces errors for the OCR engine and makes the text more legible, but tends to make any pictures useless. We found a level that generally worked gave the pages a light grey appearance (original scans were medium grey) and only marginally degraded on-page graphics. The text isn't ideal, but it's good enough for the OCR to be mostly accurate.
With digitization projects you have to worry about both preservation and access. In order to ensure that we don't lose our data, we use a multiply-redundant backup scheme.
New scans get sent to us from the vendor on a flash drive. We load them onto two different internal servers. Each of those servers is mirrored at a satellite location. If our building were to burn down (but not hit by a tornado since the satellite is across the street), we'd be able to restore from those backups. If one server began corrupting the data, we could use the other server to restore the data. We likewise keep the files on a USB hard drive that's kept in a safe. This isn't so much for planned contingencies, but because we had it and it could come in handy.
Geographically-remote offsite storage can often be prohibitively expensive, depending on the size of your storage needs. It's important, though, to have a backup strategy to ensure preservation, so storage planning should be a big part of a digitization project.
Fast storage is expensive. Our web server has 15,000-rpm hard drives that can spool up the data stored on them super quickly. This, combined with a bunch of memory reserved for caching, allows us to fulfill most website requests in less than 2 seconds.
Super-fast infrastructure isn't always practical, especially for a big collection of 40-megabyte files. Instead, we mount a regular-grade file server with lots of storage to the web server and symbolically link the newspaper directory into the web-hosted filesystem.
Before we built out the interface, we made the files navigable using Directory Indexes. However, we immediately disabled this ability once the interface was live, as directory indexes can be a real security concern.
Indexing the files for building the browse and search interfaces in Drupal was a daunting task.
We tried attaching the files to the native search index using the Search Files module. The test server responded by slowing to a crawl. The culprit was the stock drupal search index; it bloated the database with huge numbers of terms.
The next attempt was a pdf-aware crawler for adding the files to an Apache Solr index. This would eventually have been the way to build an in-house solution. It was pretty straightforward to install Solr on a dedicated server and get it talking to our test Drupal site to build a site search index. Adding the newspaper files to the index using Tika (or Aperture) proved to be a big drain of development time, though someone with background in Solr could probably pull it off much quicker.
Since it was the file format causing grief with indexing, we also tried importing the plain OCRed text into Drupal, disabling the regular search index, and letting Solr crawl the site in the usual way. This tested the limits of Drupal, as the size of the OCR text on a given file was larger than the core database functions could handle.
Luckily, a slowdown on our live server led us to discover a good solution. It turns out that since we linked to the directory indexes, Google had started to crawl all the issues. This was a tremendous load on the server, but it led to a complete index.
With the search index problem out of the way, attaching the 3800+ files to Drupal was a much easier proposition. We just wrote a little module that can recursively scan a directory and attach files to new nodes. It checks Drupal's `files` table to see if the file has been attached before, and adds a node creation operation to the cron job queue if it's not in the database yet.
The module also parses the filename to set the node's creation date and format the node title all nice. This left us with 3800 nodes with creation dates that match the published dates of the newspaper issues and titles that can be used by Views to build out a nice browsing interface.
When new issues get delivered (once every 6 months), we can just load the files onto the server and re-scan the directory to add them to the site.
This solution is dynamic, which we've found useful for adding new files and tweaking the interface. However, it's pretty straightforward to create a static listing page. For instance, you could use:
ls -R > staticlisting.html
From there, just run [likely a regex] find/replace to build out the HTML for a nice list.
With the files in Google, we have two search scenarios:
- People who want to search the Observer specifically
- People who do a regular Google search and get results from the newspaper
For our patrons that want to only search the newspaper, we set up a Google Custom Search that essentially performs "query" site:www.cantonpl.org/sites/default/files/observer and put a search box on the archive's home page. It works really well.
Non-patron searchers using regular-ole Google frequently find interesting query strings in the Canton Observer archive. This information may be helpful for people who didn't even know our archive existed, but more frequently the results are probably just bad hits. It makes Google Analytics and Google Webmaster Tools more fun to explore.
A project like this would be inadvisable without your server being connected to the internet via a high-bandwidth connection.
If more performance is needed to accommodate future usage, we have a few options available. For one, a faster file server for the access copies is a possibility. A 256 gb SSD would do the trick. On top of that, we could switch from apache to nginx, and use php as a fastcgi. This would make it so mod_php doesn't load just to serve out files. In a similar vein, we could push the files onto a dedicated server and use redirects to keep the file attachments live.
It took lots of experimentation, but we ended up with a useful tool by using simple Drupal hooks and the good fortune of Google's crawler.