Oldskooler Ramblings

the unlikely child born of the home computer wars

How to reasonably archive color magazines to PDF

Posted by Trixter on July 14, 2020

During a conversation with one of my archival collectives, the topic of archiving color magazines came up. Our goal was to distribute scans of the material as PDF, primarily because of its ubiquity of viewing software, but also because OCR’d text could follow the images, making the magazine searchable without requiring the user to perform OCR. However, most of us haven’t started archiving our magazines, because it’s an extremely daunting task. Color magazines are notoriously annoying and difficult to scan to digital form because:

  • Most were printed using screened printing, whose tiny high-contrast dots hurt compression ratios, and produce moiré patterns when scanning at, or resizing to, lower resolutions
  • The high number of pages in color magazines (300, 400, or even 500 pages per publication) makes using a flatbed scanner a tedious process, as well as resulting in a very large set of data per magazine (if preserving quality is a concern)
  • Some magazines print almost all the way into the binding, leaving only a few millimeters of margin at the gutter, which prevents traditional book scanners, both flatbed and camera-based, from capturing the inner 1 CM of printed material

However, we’re in possession of several magazines that the original publisher hasn’t archived and aren’t available in the wild, so we decided to experiment with various scanners, software, and methods to see what was possible, while staying within the limits of what is practical.

While everyone has their own views on what’s important (size vs. quality, speed vs. accuracy, effort vs. volume, etc.), I came up with a set of rules and processes for myself that I’ll be following, and would like to share them. I held myself to the following goals:

  • PDF file sizes should not exceed 1MB per page on average. In 2020, and for the next 5 years at current broadband capacities and growth, a file size of 500MB for giant magazines, or 100MB for modest ones, is appropriate. This isn’t because of total size — storage is cheap — but rather because of transfer rates. I could easily scan a 500-page magazine to 30 GB of TIFF files (which I’ve done many times), but it’s not practical to share 30GB per magazine with online repositories. And besides, I’m not made of money, and some online repos may balk at an attempted upload of 7 TB (approx. 20 years of a large magazine’s print run).
  • Pages should be scanned at 600 DPI. This preserves the screening which can be dealt with later if necessary. It also ensures that very fine print will not only be legible, but able to be OCR’d. (Even if 300 DPI material is eventually needed for extremely large publications to stay under 1GB, the 300 DPI material can be obtained by resizing the 600 DPI material, instead of re-scanning the entire document.)
  • No matter the amount of processing, text should never dip below 600 DPI. This is less of a preference and more of a way to ensure that very fine print, such as a magazine’s masthead/impressum, is legible.
  • All screened material should be de-screened. If the scanning system has a proper de-screening option (a real one that asks for the LPI of the source material, not just a dumb blur filter), it will be turned on during scanning (and the results checked afterwards). If no such option exists, all 600 DPI (and better) scans will be run through a proper de-screening process. I have had excellent results with the Sattva Descreen plugin and endorse it for this. Descreening screened material not only improves the quality of screened images by removing the screening pattern, but results in smaller files (no matter the compression method) due to what is effectively noise reduction.
  • Mild degradation of images is appropriate as long as the text legibility itself is preserved. (Acrobat and DjVu can both do this, although some repositories aren’t accepting DjVu any more.)

To achieve these goals at the highest legibility but the smallest file size, I follow these practices:

  • Destroy the magazines. If you cut the binding off, you have flat sheets that you can run through an ADF or sheet-fed scanner. You can cut very close to the binder glue, giving the inner printing a change to be scanned. It’s a sacrifice, but I feel preserving information printed on paper is more important than preserving the paper. I bought a guillotine paper cutter for $120 specifically for this purpose.
  • Use a high-quality sheet-fed duplex scanner with a configurable TWAIN driver. Usually people think of the Fujitsu ScanSnap series for this, and that was what I first purchased, but the ScanSnap series’ software is not configurable, and it’s only 9 inches wide which prevents scanning some material. I was lucky enough to acquire a Fujitsu fi-series scanner second-hand. This line of professional office scanners have an extremely configurable TWAIN driver that allows groups of settings to be saved into profiles appropriate for various kinds of material. And while it’s not a photo scanner, it does a more than acceptable job of scanning color magazines (better than the ScanSnap, which always has washed-out colors). Would I use it for scanning photos or artwork? No, but it’s my first choice for scanning entire books or magazines. This can be a case of spending some real money, but you do get what you pay for.
  • Pay for Acrobat. Real, commercial Acrobat supports JPEG2000 compression, which outperforms JPEG in both size and quality. But more importantly, it has a feature that can drastically reduce large PDFs called Adaptive Compression. It works by separating text and line drawings on a page into their own monochrome layer that is compressed losslessly. Then, the image that remains after the text has been lifted is downsampled and recompressed. This results in much smaller files without compromising the legibility of text and the sharpness of line drawings. (This feature may have been inspired by DjVu, whose early claim to fame was doing exactly this.) Finally, commercial Acrobat can perform OCR without requiring additional software.

With those rules and methods set, I performed many tests with a lot of material, and came up with a set of best practices that met my criteria. I compiled those practices into a handy flowchart:

I’ve continued to put this flowchart into practice with a lot of material, including mixed-content manuals (color, grayscale, and B&W material in the same manual), 500-page color screened magazines, 8.5×11″ photocopied material, dot-matrix printouts, and printed books. In all cases, I follow the flowchart until the size is reasonable for the material, and I’ve never been disappointed or felt like I was giving up too much quality for the file size. (What is “reasonable” is different for everyone according to personal preference, goals, and motivation, so it’s up to you to determine what that size eventually is.)

I hope that this information will help you finally tackle your own stacks of magazines that, like me, have been leering at you ominously for years from the various corners of your abode.

14 Responses to “How to reasonably archive color magazines to PDF”

  1. brassicgamer said

    This is the kind of article I like – tried & tested methods, clarified and shared so that archiving efforts can be coordinated better. If there’s anything worse than scanning colour magazines, it’s re-scanning colour magazines.

    • Trixter said

      Especially if the originals are no longer available. Some of us discard the paper after 600 DPI TIFFs are archived; others hold onto them in file folders in mountains of banker boxes. But nobody wants to re-scan magazines.

  2. […] How to reasonably archive color magazines to PDF […]

  3. newmikeman said

    It’s all too late! No I’m kidding. I have scanned a few years’ worth of my favourite magazine “Amateur Photographer”, to gain physical space and to make the resulting PDFs searchable for future use by me. This is a weekly magazine of usually 68 pages so you’ll understand how they pile up. I have long since given away the years’ worth of National Geographic that I can no longer read comfortably owing to type size.
    I use a ScanSnap and I have a proper licenced copy of Acrobat X with which I process the PDFs for OCR. I wasn’t happy with the ABBYY OCR that came with the scanner and I don’t get on well with the ScanAnap Organizer and its insistence on only working with the ScanSnap scanner’s output.

    What I should like to ask you is: what software can you suggest or recommend for browsing, viewing and searching a huge pile of “Searchable” PDFs each containing one magazine issue. I might want to make reasonably complex searches and then maybe search the search results. Each result would take me to a specific magazine issue and page. If it could also display the stored page and allow browsing back and forth that would save me from having to follow the Open dialog in SumatraPDF.

    My own searches on ‘tinternet only seem to throw up software for academic searches and the creation of bibliographies. I’ve tried the search built into Acrobat but it doesn’t seem to deal with multiple PDF files and its single file search is very simplistic.

    Any ideas please?

    Oh and BTW I have never come across a magazine of 500 pages! Or is that a year’s worth of issues or something else that I’m misunderstanding :-)

    Kind regards
    Mike Newman

    • Trixter said

      Using acrobat to OCR the PDFs before saving is how I solve this. Acrobat Pro can search for text across one or more directories and multiple PDF files, it’s under Edit -> Advanced Search.

      As for browsing, simply organizing them (ie. consistent names, one folder per magazine, etc.) works best for me. For example:

      L:\Media\Bookshelf\Magazines\Computing\PCjr Magazine>dir /b
      PCjr Magazine – 198402 – Volume 1 Number 1.pdf
      PCjr Magazine – 198403 – Volume 1 Number 2.pdf
      PCjr Magazine – 198404 – Volume 1 Number 3.pdf
      PCjr Magazine – 198405 – Volume 1 Number 4.pdf
      PCjr Magazine – 198406 – Volume 1 Number 5.pdf
      PCjr Magazine – 198407 – Volume 1 Number 6.pdf
      PCjr Magazine – 198408 – Volume 1 Number 7.pdf
      PCjr Magazine – 198409 – Volume 1 Number 8.pdf
      PCjr Magazine – 198410 – Volume 1 Number 9.pdf
      PCjr Magazine – 198411 – Volume 1 Number 10.pdf

      Hope this helps.

  4. rdrg33 said

    My mother is trying to claim back some space and I have been looking for options to archive her magazine collection. I had no idea how to go about it, so I am glad I found this post. It is truly an invaluable resource.

    Since I am not well-versed yet in anything that pertains to scanners for this task, I was wondering if you could give me some advice. There is a Fujitsu Fi-7160 locally available to me, is this model adequate for the job? Incidentally, considering a few years have passed since this post was written, is there a newer model I should rather look into?

    Thank you.

    • Trixter said

      The fi-7160 is an excellent choice if you have access to one. Just remember to install and use the “Pagestream” driver on their website, as the regular TWAIN driver is terrible.

  5. I haven’t used the Fujitsu Fi-7160 but it looks to me like a “grown-up” version of the Fujitsu ScanSnap S-1500 that I have been using for several years and continue to use on a daily basis. As Trixter said, the Fi-7160 should be an excellent choice.

  6. rdrg33 said

    Thank you for the replies @Trixter and @Rialtoroadtrip.

    I notice the fi-7160 has only 8.5 inch wide capacity and I am thinking I will probably need something with bigger capacity. You (@Trixter) mentioned in the post that one of the reasons for opting for a fi series over the ScanSnap was because the latter was only 9 inch wide, which is unsuitable for scanning some material. What model did you get that is able to scan bigger pages?

    In regards to the paper cutter, do you have any brand or type suggestions or any sharp one should suffice?

    • Trixter said

      The one I use in particular is the fi-5530C2, which is 11 in. wide. As for a cutter, any bulk paper cutter should suffice.

      • rdrg33 said

        Understood. Thanks again for all the help.

        • I’m OK with the ScanSnap because my Amateur Photographer magazine as well as all the documents I receive and need to file all fit on A4.
          Best of luck @rdrg33

          • rdrg33 said

            @Rialtoroadtrip I wish I did not need bigger capacity. The availability of used scanners with this characteristic is not great here. I found a Fujitsu fi-6670 for aproximately $250 usd (is this price reasonable?). The problem is that the seller is from another state, and buying without being able to check it personally makes me a little nervous.

          • Trixter said

            I should warn anyone reading this comment that the Fujitsu scanners are tuned for document scanning. High quality photography magazines, or photographs, are not handled very well by the scanner. They are acceptable, but I would hardly call them archival grade.

Leave a Reply to Trixter Cancel reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: