Oldskooler Ramblings

the unlikely child born of the home computer wars

Archive for July 14th, 2020

How to reasonably archive color magazines to PDF

Posted by Trixter on July 14, 2020

During a conversation with one of my archival collectives, the topic of archiving color magazines came up. Our goal was to distribute scans of the material as PDF, primarily because of its ubiquity of viewing software, but also because OCR’d text could follow the images, making the magazine searchable without requiring the user to perform OCR. However, most of us haven’t started archiving our magazines, because it’s an extremely daunting task. Color magazines are notoriously annoying and difficult to scan to digital form because:

  • Most were printed using screened printing, whose tiny high-contrast dots hurt compression ratios, and produce moirĂ© patterns when scanning at, or resizing to, lower resolutions
  • The high number of pages in color magazines (300, 400, or even 500 pages per publication) makes using a flatbed scanner a tedious process, as well as resulting in a very large set of data per magazine (if preserving quality is a concern)
  • Some magazines print almost all the way into the binding, leaving only a few millimeters of margin at the gutter, which prevents traditional book scanners, both flatbed and camera-based, from capturing the inner 1 CM of printed material

However, we’re in possession of several magazines that the original publisher hasn’t archived and aren’t available in the wild, so we decided to experiment with various scanners, software, and methods to see what was possible, while staying within the limits of what is practical.

While everyone has their own views on what’s important (size vs. quality, speed vs. accuracy, effort vs. volume, etc.), I came up with a set of rules and processes for myself that I’ll be following, and would like to share them. I held myself to the following goals:

  • PDF file sizes should not exceed 1MB per page on average. In 2020, and for the next 5 years at current broadband capacities and growth, a file size of 500MB for giant magazines, or 100MB for modest ones, is appropriate. This isn’t because of total size — storage is cheap — but rather because of transfer rates. I could easily scan a 500-page magazine to 30 GB of TIFF files (which I’ve done many times), but it’s not practical to share 30GB per magazine with online repositories. And besides, I’m not made of money, and some online repos may balk at an attempted upload of 7 TB (approx. 20 years of a large magazine’s print run).
  • Pages should be scanned at 600 DPI. This preserves the screening which can be dealt with later if necessary. It also ensures that very fine print will not only be legible, but able to be OCR’d. (Even if 300 DPI material is eventually needed for extremely large publications to stay under 1GB, the 300 DPI material can be obtained by resizing the 600 DPI material, instead of re-scanning the entire document.)
  • No matter the amount of processing, text should never dip below 600 DPI. This is less of a preference and more of a way to ensure that very fine print, such as a magazine’s masthead/impressum, is legible.
  • All screened material should be de-screened. If the scanning system has a proper de-screening option (a real one that asks for the LPI of the source material, not just a dumb blur filter), it will be turned on during scanning (and the results checked afterwards). If no such option exists, all 600 DPI (and better) scans will be run through a proper de-screening process. I have had excellent results with the Sattva Descreen plugin and endorse it for this. Descreening screened material not only improves the quality of screened images by removing the screening pattern, but results in smaller files (no matter the compression method) due to what is effectively noise reduction.
  • Mild degradation of images is appropriate as long as the text legibility itself is preserved. (Acrobat and DjVu can both do this, although some repositories aren’t accepting DjVu any more.)

To achieve these goals at the highest legibility but the smallest file size, I follow these practices:

  • Destroy the magazines. If you cut the binding off, you have flat sheets that you can run through an ADF or sheet-fed scanner. You can cut very close to the binder glue, giving the inner printing a change to be scanned. It’s a sacrifice, but I feel preserving information printed on paper is more important than preserving the paper. I bought a guillotine paper cutter for $120 specifically for this purpose.
  • Use a high-quality sheet-fed duplex scanner with a configurable TWAIN driver. Usually people think of the Fujitsu ScanSnap series for this, and that was what I first purchased, but the ScanSnap series’ software is not configurable, and it’s only 9 inches wide which prevents scanning some material. I was lucky enough to acquire a Fujitsu fi-series scanner second-hand. This line of professional office scanners have an extremely configurable TWAIN driver that allows groups of settings to be saved into profiles appropriate for various kinds of material. And while it’s not a photo scanner, it does a more than acceptable job of scanning color magazines (better than the ScanSnap, which always has washed-out colors). Would I use it for scanning photos or artwork? No, but it’s my first choice for scanning entire books or magazines. This can be a case of spending some real money, but you do get what you pay for.
  • Pay for Acrobat. Real, commercial Acrobat supports JPEG2000 compression, which outperforms JPEG in both size and quality. But more importantly, it has a feature that can drastically reduce large PDFs called Adaptive Compression. It works by separating text and line drawings on a page into their own monochrome layer that is compressed losslessly. Then, the image that remains after the text has been lifted is downsampled and recompressed. This results in much smaller files without compromising the legibility of text and the sharpness of line drawings. (This feature may have been inspired by DjVu, whose early claim to fame was doing exactly this.) Finally, commercial Acrobat can perform OCR without requiring additional software.

With those rules and methods set, I performed many tests with a lot of material, and came up with a set of best practices that met my criteria. I compiled those practices into a handy flowchart:

I’ve continued to put this flowchart into practice with a lot of material, including mixed-content manuals (color, grayscale, and B&W material in the same manual), 500-page color screened magazines, 8.5×11″ photocopied material, dot-matrix printouts, and printed books. In all cases, I follow the flowchart until the size is reasonable for the material, and I’ve never been disappointed or felt like I was giving up too much quality for the file size. (What is “reasonable” is different for everyone according to personal preference, goals, and motivation, so it’s up to you to determine what that size eventually is.)

I hope that this information will help you finally tackle your own stacks of magazines that, like me, have been leering at you ominously for years from the various corners of your abode.

Posted in Lifehacks, Technology | Tagged: , , , | 14 Comments »