[go: up one dir, main page]

Academia.eduAcademia.edu
Digital Commons - Information and Tools Digital Commons / Institutional Repository Information University of Nebraska - Lincoln Year  The Art of Scanning Paul Royster University of Nebraska-Lincoln, proyster@unl.edu This paper is posted at DigitalCommons@University of Nebraska - Lincoln. http://digitalcommons.unl.edu/ir information/67 The Art of Scanning Paul Royster University of Nebraska–Lincoln January 5, 2011 Yes, it is presumptuous to call scanning an “art,” when it is really more of a craft, but “The Craft of Scanning” doesn’t sound as sexy, so we will consider it for the time being as one of the ine arts, like music, or painting, or dance. This short treatise derives from work done in the process of scanning published and original materials to create PDF iles for online publication or deposit in our institutional repository. This approach assumes you have a scanner and software to drive it, and also three software programs from Adobe (sold together as their Creative Suite): Adobe Acrobat, Photoshop, and InDesign. Contents 1. The Scanner 2. Scanner Software 3. What is a “Good Scan”? 4. Simple Scanning: Text + Line Art 5. Scanning Documents with Grayscale or Color Images 6. Various Issues You May Encounter 7. Adjusting Scanned Images 8. When You Didn’t Do the Scanning, but … 9. OCR Scanning 1. The Scanner for ine art and photography books, it will be higher, usually 150 or 200 lines per inch. If your scanner does not have a “descreen” function (or if it was not used on the ile you have to work with), there are ilters in Photoshop that will also eliminate the moire patterns; see “Adjusting Scanned Images,” below. Scanning requires a scanner. No surprise there, and an adequate scanner is within the reach of most aspirants to the art. Scanners now often come as part of a multi-purpose copy-scan-fax machine, and these are adequate for most uses, and priced under $100. There are many models of scanners on the market; almost all will do photographs, color, line artwork, and text, with resolution up to 600 dpi, which will meet most needs. Two additional features that are welcome are “descreening” and “optical character recognition.” Optical character recognition, often abbreviated OCR, is software that will recognize and read scanned letters and convert them to live text. Thus instead of a picture of an “A” the resultant ile includes an actual “A” that can be manipulated as text in a word processor. Most scanners do include this function, though sometimes it needs to be installed separately from the CD or software that is included. OCR can also be run “after the fact” in Adobe Acrobat, but this is sometimes not as accurate as OCR done as part of the original scan. Note that when you scan a page of text using OCR, what you will get will be a string of words and letters and numbers, not a 2-dimensional copy of the original scanned page. Descreening (or re-screening) is used for photographs that have been printed as halftones. Halftone printing of photographs uses ink on paper to represent the continuous range of grays (or colors) in a photographic print. Since ink is only one color, the darkness is modiied by varying the size of the dot that is printed: a 100% dot produces a solid ink color (usually black); a 10% dot produces the effect of a 10% gray (or color) by reducing the dot to 10% of its full size. These are now produced by computer programs, but before computers they were produced by laying a white gauze net or “screen” over the photograph, thus dividing it into many small squares. The ineness of the screen is given in “lines per inch”—thus a (standard) 133 line screen had 133 lines per inch (each way). If you look at a printed photograph (as opposed to a photographic print, which is produced by a chemical process), you will see it is composed of tiny dots in regular rows and columns. Note that newspapers will use fewer and larger dots than ine-printed photography or art books. When a scanner scans these dots it tends to hit each dot at the same relative point, and this will create a regular checkerboard pattern, called moire (pronounced “more-ray”), that is an unsightly distortion. Descreening is a computer subroutine in the scanner that compensates for this and avoids the appearance of moire. Descreening works best when it is set for the same value as the original screening. For most printed materials, the standard is 133 line screen; There are far too many models of scanning to attempt to describe them all. Probably the most popular line is manufactured and sold by Hewlitt-Packard. These are reasonably good scanners, but the HP software that operates them will often drive you to distraction with its many frustrations. Good scanners, bad software. One other feature that is nice to have on occasion is a transparency adapter—which allow you to scan slides, transparencies, negative ilm, and other see-through media that needs to be lit from the back to show the image. I mainly use a Microtek Scanmaker i700 desktop scanner that retailed for under $300. I have no reservations about recommending that model, with the proviso that (as reviewers have pointed out) it is a veritable “swamp turtle” when it comes to speed. 2 1. the scanner 3 Figure 1. Top: Example of moire patterns. Printed photograph scanned at 200 dpi without descreening. The diagonal and checkerboard patterns are especially noticeable in the gray background. Bottom: Same printed photo, scanned at 200 dpi, with descreening. Note the absence of diagonals and checkerboarding, and also the smoother transition from lights to darks. (Photo of Father Pierre-Jean De Smet, by Matthew Brady [detail]). 2. Scanner Software each dot. However, line art scans cannot be compressed, while color and grayscale scans can be. Resolution — This is how many dots-per-inch vertically as well as horizontally will be recorded and reproduced in the ile. Recommended resolutions for various scan types are: The software that operates the scanner varies for almost every scanner model on the market, so this section will be relatively non-speciic. The software should allow you to set (and not continually need to re-set): 1. original media 2. scan area 3. type of scan 4. resolution 5. threshold 6. output ile type 7. output ile destination and naming • • • • grayscale color (RGB & CMYK) line art/text grayscale or color artwork with text 300 dpi 300 dpi 300 to 600 dpi 400 dpi Resolution is directly related to ile size, so overscanning can lead to unmanageably large iles that are hard to manipulate and take too long to download. Rough typewriter text is adequately reproduced at 300 dpi; smaller type does better at 600 dpi. Sometimes charts and graphs require grayscale or color scanning but also have (usually small) type on them; for these 400 dpi is a good compromise. Original media — Is the object to be scanned opaque (such as a printed book or sheet of paper) or transparent (like a slide)? Is it positive (normal) or negative (reversed)? Threshold — This applies only to line art scans, and it is the point of darkness at which white changes over to black. Normally, 50% is the cut-off, but sometimes the original requires the threshold be adjusted between 30% and 75%to get an appropriate image. If your letters are broken and the thin parts are becoming invisible, lower the threshold. If the letters are too thick and every dirt spot is reproducing or the gutter shadow is creeping into the text, then raise the threshold. Works printed on coated paper tend to have iner and lighter type and may need to have the threshold adjusted. Scan area — Most scanners will allow you to do a prescan and then select the area you wish to have scanned. (Some “bad” scanner software will do this automatically and then perversely “select” for you.) There is no need to scan unnecesary areas of the scanner bed, but sometimes, as when scanning multiple pages from a book, you may wish to set an area larger than needed for a single page to avoid re-setting the scan area for each page. Type of scan — The 3 choices are grayscale, color, and line art (also called bitmap or black & white). Grayscale scans are used for black & white photographs; they produce rows and columns of pixels in 256 levels of gray, from 0 = white to 255 = black. Color scans do the same, but in 3 or 4 layers. RGB color (for monitors, screens, and online presentation) does 3 layers: one each for red, green, & blue. CMYK color (for color process printing) does 4 layers: one each for cyan, magenta, yellow, and black. Logically, RGB color iles will be 3 times as large as grayscale iles (for the same image size and resolution), and CMYK color iles will be 4 times as large. Line art (also called bitmap or black & white) scans recognize only 2 colors—black or white. These are appropriate for text and for artwork that has only black lines on white background. There is no gray. Line art iles are much smaller than grayscale or color iles, since they only have 2 levels, not 256 (or 28) for Output ile type — What kind of ile shall the scanner save the scan as? The choices are text (for OCR scans), TIFF, and JPG. TIFF is required for line art. JPG is appropriate for grayscale and color scans. JPG is a compression formula that reduces ile size with some cost in image quality, depending on the JPG setting: “maximum” quality hardly reduces the ile size; “high” quality reduces the ile size by half with almost no perceptible decline in image resolution; “medium” and “low” quality are not recommended. Some scanners may offer PDF as an output type. You can try it and see what you get, but oftentimes this produces an OCR’ed ile with the type all converted to Times Roman (regardless of the original typeface) and any word or letter not recognized is reproduced as a picture, so the resulting page looks somewhat like a ransom note, and is a mixture of text and images. 4 2. scanner software Figure 2. Examples of grayscale scans; note there are different shades of gray. Figure 3. Examples of line art (bitmap) scans; note there are only 2 colors—black or white. Figure 4. Text (scanned as line art/bitmap): Top: 300 dpi; Bottom: 600 dpi (enlarged 300%). 5 6 Some OCR programs offer MS Word or RTF word processing format as an output type, and this is tempting in that it may capture italics, bold, and other text characteristics. However, some programs attempt to reproduce the page layout by placing everything in separate text boxes, which are awkward and unwieldy to handle. I have found it best to go with unformatted Unicode text, producing a single text stream, and to manually restore italics, boldface, etc. as necessary. Output ile destination and naming — Where shall the scanner store the ile? Some scanner software automatically places the ile in the “My pictures” folder the art of scanning or in some other out-of-the-way or hard-to-ind location and names it “Myscan00001, ” etc. Good scanner software will allow you to choose the location and will let you set up a naming structure so that you can easily ind and manipulate the iles further as needed. Some OCR scanning programs allow scanning multiple pages into one ile, and this is a nice feature. Be careful, though, not to overload the system—to have it crash on the 38th page of a 40-page document will force you to start all over again. Figure 5. Text (scanned as line art/bitmap) with improper “threshold” settings:: Top: too high; Bottom: too low (enlarged 300%). 3. What is a “Good Scan”? A good scan is one where: • • • • • • • The type is black on a white background; i.e. scanned as line art, not as grayscale, which gives dark gray type on a light gray background. The text is searchable and can be copied and pasted. The artwork is grayscale or RGB color when appropriate, without moire effects. The pages are in the correct order. The pages are straight and right-side-up. The pages are all the same size. The inal ile size is reasonable — usually 100 kilobytes per page or less, so that a 20-page document is about 2.0 Mbytes. Sometimes, if there is a lot of art, this is not possible. But, in general, ile sizes over 100 Mb should be avoided if possible. Figure 6. Top: text scanned as grayscale at 600 dpi (shown at 100%). Bottom: as line art. 7 Figure 7. Printed (halftone) photograph scanned as line art/bitmap (top) and as grayscale (bottom) (scanned at 300 dpi; shown at 100%). 4. Simple Scanning: Text + Line Art The most straightforward type of scanning to create a PDF ile involves documents with only text and/or line art. These can be scanned together, one scan per page is all that is required. We scan these as line art, crop them in Adobe Photoshop, assemble them in Adobe Acrobat, run OCR, and standardize the page sizes. This method is appropriate for a book chapter, journal article, or entire book that has no photos, color, or grayscale artwork—just text and charts or graphs with only black lines or areas. With a little practice, and depending on the speed of the scanner, my work-study students have achieved output rates approaching 100 pages per hour. More realistically, I think an estimate of half of that speed (say 50 pages/hour) would be appropriate for most budgetary or planning purposes. Phase 1: Scanning • • • • • • • Place the book on the scanner bed and use the top edge and fore-edge to align the book page as straight as possible. Figure 8. Proper scan area for book scanning. Using the entire width allows you to do both left-hand and righthand pages without resetting the scan area. Do an “overview” or “preview” scan. Select a scan area that is the full height of the book and the full width of the scanner bed (even though this may give you part of the previous or following page; this is so you do not have to reset the scan area for each page). Phase 2: Adobe Photoshop • • Select the settings for “line art” (or “bitmap” or “black & white”). Set the resolution for 600 dpi; if there are more than 40 or 50 pages, you may wish to reduce this to 400 dpi, to keep the inal ile size manageable. 600 × 600 dots makes 360,000 dots per square inch; 400 × 400 dots makes 160,000 dots per square inch, or less than half the size. Scan each page and save the iles as TIFF. These iles will be fairly large but will reduce signiicantly when converted to PDF. You should review the irst few pages (by opening the iles in Adobe Photoshop) before scanning them all to make sure that the threshold cutoff is correct. 8 Open the TIFF iles with Adobe Photoshop Crop the image to the type block (see illustration). You can use the selection rectangle and the keyboard shortcut “Alt + OP” to crop (this is faster than using the pull-down menus for Image > Crop, etc., or the “crop” tool). It is not necessary to crop to the exact edge of the type; just get reasonably close and be sure to crop out all shadow, edges, and other unwanted things. If a page is short—such as the last page of a chapter that has only a few lines—crop to where the full page would end, not to where the type stops. If a page contains part of another (unwanted) article, crop to the normal page size, and then delete (i.e. erase) the unwanted section, leaving the page its full size. 4. simple scanning: text + 9 line art • Using the “Save As” (keyboard shortcut = “Ctrl+Shift S”) function save the ile as a Photoshop PDF ile. You may notice that the ile size will go from several Mb down to less than 100 kb. Interestingly, the size of the PDF ile is a function of how much black there is on the page. Phase 3: Adobe Acrobat • • Use Acrobat to “Combine” the PDF iles. This can be done by opening the irst one and then using “Document > Insert >Pages” to insert the following pages at the end of the document, one at a time. Acrobat also has a “Combine iles” feature that will allow you to drag all the PDF iles into a new PDF ile at once; this is useful for longer articles. This function is under the “File” menu.* Page through to verify that all the pages are there and in the correct order. Figure 9. Cropping to the type block. Include areas where type would normally appear—even if it does not appear on that particular page. Figure 11. Adobe Acrobat’s Combine function is found under the File > Combine > Merge Files into a Single PDF menu. Figure 10. Cropping a short page. Include the full height of the page so it will show properly in the ensuing PDF ile. * The newer releases of Adobe Acrobat will convert TIFF iles to PDF as part of the “Combine” function. That approach is not recommended here for two reasons: 1) the TIFF iles have not been cropped, so they will show the gutter, edges, etc., and 2) PDF iles created by the Acrobat “Combine” function are (for reasons unknown to this author) larger— i.e., more bytes—than the equivalent iles created by the Photoshop “Save as ... Photoshop PDF” function. 10 • • • • • • the art of scanning Run OCR (optical character recognition) on the entire document. This command is normally under the “Document”menu. If the language is not English, “Edit” the settings to select the proper language. Acrobat’s OCR program will also straighten any pages that were not completely square before. Acrobat produces a hidden text string that sits behind the image of the scanned page. The typeface remains whatever it was before, and if something is not recognized or misrecognized, only the hidden text is incorrect, and this is not observable unless the text is copied and pasted into another venue. Use the “Crop pages” command to set all the pages to the same size. Choose (or set) a size that is at least 1 inch larger than the cropped pages you created in Photoshop (Acrobat will show you this as the current page size). A 1-inch excess will create a ½-inch margin all around (skimpy); a 1.5inch excess will produce a ¾-inch margin (better); and a 2-inch excess will give a 1-inch margin (luxurious). Odds are that the original document was one of the following sizes: 6 × 9 inches, 6.13 × 9.25 inches, 7 × 10 inches, 8 × 10 inches, or 8.5 × 11 inches (“Letter” size); although if it was a British document, it’s harder to say. Acrobat places the old page right smack dab in the middle of the new page size (so if you had cropped that short page to be only 4 inches tall, those 4 inches will show in the middle of the new page, not towards the top as they ought to). Be sure to select the box that applies this new size to “All pages” not just the “Current page”. Save the ile. You are done. You can delete all those single-page TIFF and PDF iles. Figure 13. Result of improper cropping of a short type page. Acrobat’s Crop (Page Resize) function puts it in the middle, not at the top. Figure 12. Adobe Acrobat’s Crop screen is found under the Document > Crop pages menu.. 5. Scanning Documents with Grayscale or Color Images If all documents were just text and line art, life (and the professional literature) would be simpler but less interesting. Making PDF iles from documents with grayscale or color illustrations involves some additional steps and the use of an additional software, Adobe InDesign. In this method we scan the pages as line art and delete the illustrations, then scan the illustrations separately as grayscale or color, re-combine the two parts in a page-layout program, and generate PDF pages. Adobe InDesign comes as part of the Creative Suite package (along with Acrobat and Photoshop); Quark Xpress is a similar page-layout program that could be used instead. Following is a step-by-step guide to this method: Phase 1: Scanning • • Follow the instructions in #4 Phase 1, above to scan all the pages as line art and save as TIFF iles. Go back and rescan the grayscale or color images as grayscale or color at 300 dpi. Use the scanner’s “descreen” setting to prevent moire patterns. Save these iles as JPG format. Don’t worry about tight cropping of these images at the scanning stage, just be sure you get the whole image. Phase 2: Adobe Photoshop • • • • Open the TIFF iles with Adobe Photoshop and crop to the type block. If a page had no grayscale or color artwork, then “Save as” a Photoshop PDF. Figure 14. Formerly illustrated page with illustration deleted in Photoshop. Save this page in its original TIFF format. It will later be placed in an InDesign ile. If a page has grayscale or color artwork, delete (erase) this artwork, leaving a blank white space where it used to be. Save these pages in their original TIFF format. goes by degrees: 1 degree is a lot; less than .1 degree is probably not worth doing. Re-convert the image back to bitmap/line art (Image > Mode > Bitmap; or Alt + I M B) before saving as TIFF. If an illustrated page is noticeably crooked, you should straighten it; recombining the type with the illustration is easier if they are both straight. To straighten one of these pages, irst convert it from bitmap/line art to grayscale (Image > Mode > Grayscale; or Alt + I M G), then rotate it clockwise or counterclockwise until a line of type all sits on the same horizontal guideline (which you can pull down from the top ruler bar). Rotation • • 11 Open the illustration JPG iles and straighten and crop them as needed. Sometimes you will discover that a picture was not straight or not square to start with. When in doubt, straighten it so the top is horizontal. 12 the art of scanning Figure 15. Straightening text pages in Photoshop. The light blue guideline is placed by clicking on the ruler at the top and dragging down to the bottom of a line of letters on the left. The ile has been converted from bitmap/line art to grayscale so that Rotation can occur; this is why the “Doc” size (lower left of screen) is so large. This page needs to rotate approximately .5% clockwise to make the type horizontal. Convert back to bitmap/line art before saving. Figure 16. Straightening an image in Photoshop. The light blue guideline is placed by clicking on the ruler at the top and dragging down to the top of the image on the left. This image needs to rotate about .25% clockwise to be straight. 5. 13 scanning documents with grayscale or color images • If moire patterns remain, use the Gaussian blur ilter (Filter > Blur > Gaussian blur; range = 1 or 2 pixels) to ix that. • If adjustments are needed for darkness, contrast, etc., see the section below on “Adjusting Scanned Images” for suggestions and guidance. • Save the cropped, straightened, and adjusted images in their original JPG ile format. • • • Phase 3: Adobe InDesign • Start a new InDesign ile (New > Document > ...). The page size should be larger than the type block you have cropped to and smaller than the inal page size you intend to wind up with. InDesign by default measures space in picas (6 picas = 1 inch); if you wish, this can be changed to inches by the “Edit > Preferences > Units and Measures” command. • Don’t worry about the margins, since you won’t be setting any type. The default margins of 3 picas (½ inch) on all sides are ine. The number of pages should be the number of pages with illustrations that you wish to combine with the type. If you mis-estimate, more pages can always be added with the “Layout > Pages > Insert pages” command. For each page with illustrations, File > Place (Ctrl + D) the TIFF ile containing the type on the InDesign page. Then place (again Ctrl + D) the JPG illustration ile(s) and move it to the appropriate position “Export” the combined pages to an Adobe PDF ile. These pages will show the type as black-on-white line art and the illustrations as either grayscale or color. Alternatively, you may also generate the PDF ile via the “Print” function or via the “Adobe PDF Presets ...” function (which I prefer). Figure 17. New Document dialogue box for Adobe InDesign. Enter the “Number of Pages,” un-select “Facing Pages” and select a page size that is larger than the cropped type pages but smaller than the inal intended page size. Don’t worry about the margins. 14 the art of scanning Figure 18. Place the (straightened if necessary) text page TIFF ile into the InDesign document (File > Place or Ctrl+D). Use the red margin lines to help situate the type on the page: either as centered or (better) with slightly larger outside than inside and bottom than top margins. Figure 19. Place the (straightened if necessary) image JPG ile into the InDesign document (File > Place or Ctrl+D). Situate the image in the hole you created by deleting it from the line art/bitmap version of the page. 5. 15 scanning documents with grayscale or color images Phase 4: Adobe Acrobat • • Combine the PDF iles, either by inserting the single-page type-only iles one-by-one into the PDF ile with the illustrated pages, or by extracting all the pages in the PDF ile with the illustrated pages into single pages (Document > Pages > Extract > as single pages) and then using the “Combine” function to assemble them in proper order. Page through to verify that all the pages are there and in the correct order. • • • Run OCR (optical character recognition) on the entire document. Make sure to select the proper language. This will also straighten any pages that were not completely square before. Use the “Crop pages” command to set all the pages to the same size. Save the ile. You are done. You can delete all those single-page TIFF and PDF iles and the JPG iles of the illustrations and the InDesign ile. 6. Various Issues You May Encounter Scanning Issues sure the spine of the book is oriented at a right angle to the copier’s light source. The shadow from the gutter creeps into the type block 2. Scan the page as grayscale (at 600 dpi) and save as TIFF. Open in Photoshop and convert to bitmap/line art (Image > Mode > Bitmap). You may need to adjust the “levels” to reduce the darkness of the gutter shadow irst, or use any of Photoshop’s image adjustment techniques to push the gutter shadow below the white/black threshold. Whatever you do, be sure that on the conversion to bitmap you select the “50% threshold” option and not one of the “dither” methods (see below “The type has holes in it !”). Some books are bound so tightly and the type is so close to the spine that the volume will not lie lat enough to properly scan the type that is closest to the gutter. There are two possible remedies for this: 1. Photocopy the page and scan the photocopy. Copy machines splash so much light on the page that the gutter shadow is usually reduced or eliminated. In especially problematic cases, be Figure 20. Left: Page scan showing gutter shadow. Right: Same page scanned as grayscale, with levels adjusted, highlights “burned,” and shadows “dodged” in Photoshop, then converted to bitmap/line art. See Section 7 for “Adjusting Scanned Images.” 16 6. various issues you may encounter The type closest to the gutter bends downward at the top and upward at the bottom. This is a result of the book not lying lat and the distortion is created by the perspective. This can be cured using Photoshop. Convert the bitmap/ line art image to grayscale. Use the selection rectangle to select the type that is so affected (usually the irst several letters of all the lines). Go “Edit > 17 Transform > Skew” and boxes will appear at the corners of the selection rectangle. Pull the upper and lower corners to distort the rectangle (into a trapezoid) so that the lines of type run straight. Click on the selection rectangle tool, and accept the transformation. Convert the ile back to bitmap (being sure to use the 50%threshold option). Figure 21. Left: Page scan showing bent type due to gutter distortion. Right: Same page in Photoshop, converted to grayscale, with affected area selected and transformed with “Skew” function. Convert back to bitmap/line art before saving. 18 the art of scanning The type has holes in it ! The image was converted from grayscale to bitmap/line art using one of the “dither” options instead of the “50% threshold.” In the Image > Mode > Bitmap command sequence, there is an “Edit” button at the last stage. Click this and select “50% threshold” instead of one of the “halftone” or “dither” options (which try to fake a grayscale image by removing dots from black to give representation of shades of gray, but actually wind up making your type look as though it was shot with a shotgun). Once you change this setting, it should remain (unless someone changes it or Photoshop gets re-installed). Figure 22. “Shotgun” or “Swiss cheese” type effect produced by converting gray type to bitmap/line art using a “halftone screen” or “dither” method rather than the 50% threshold ilter in Photoshop. The thin parts of the type are disappearing. The image has swirling or checkerboard patterns. It was a printed halftone that was not “descreened” in scanning and so is showing these patterns called moire (“mo-ray”). You can rescan it with the “descreen” function turned on, or you can use the Photoshop Gaussian blur (Filter > Blur > Gaussian blur) to eliminate the patterns. Usually a radius setting of 1 or 2 pixels is enough to eliminate the moire without damaging the sharpness of the image. You need to lower the threshold. The type is illing in. You need to raise the threshold. The image has horizontal lines in it from the type on the reverse page showing through the paper. The best way to ix this problem is to re-scan the image with a black sheet of paper behind the page. Figure 23. Type and igure from the following page are showing through the paper and are visible in the image. The best way to correct this problem is to rescan the image with a black sheet of paper behind it. 6. 19 various issues you may encounter Acrobat Issues Some pages do not resize when I try to make all pages the same size. Acrobat will not change pages sizes if the original page is larger than the new target size. Sometimes a page was not cropped down enough to it the new size. Try a slightly larger size. Sometimes this happens when the OCR program turns a sideways page right-side-up (according to the type orientation, but not the book orientation; i.e. it was originally a “turn page” in the book). To ix this, go to that page and make only that page square by the larger measure (e.g., make it 9 × 9 if you are shooting for 6 × 9). Then use the cropping functions to chop off from the top and bottom so the page is the right size, though a different orientation. If you want to avoid having Acrobat rotate these “turn” pages, then do not OCR those pages. Select the range of pages before and after for OCR. There are pages out of sequence. Go to that page and extract it: Document > Pages > Extract > delete page after extracting. Save the extracted page, then insert it back into the PDF ile in the proper place, using Document > Pages > Insert pages ... Figure 24. Acrobat cropping dialogue screen used to re-crop a “turn” page. Set the new page size as the square of the longer side, then come back and use the upper Margin Controls to crop down to the desired size. 20 the art of scanning The ile is too gigantic (i.e., over 100 Mbytes). Acrobat has an “Optimizer” function that will decrease ile sizes by eliminating unnecessary hidden elements and reducing the resolution of image components. It is currently under the “Advanced” menu. There are six different sets of settings (for images, fonts, transparency, objects, use data, and clean up); the settings for images (shown below) will usually have the largest effect. I normally save the optimized ile under a new name, so that the original is retained, just in case ... Figure 25. One of Acrobat optimizer dialogue screens. This one (for Images) usually has the largest effect on ile size. 7. Adjusting Scanned Images Many books have been written on using Photoshop to adjust and manipulate images, and some of them are very good. This section will only treat some common and elementary methods of improving the appearance of images scanned for publication in PDF documents. The moire issue has already been addressed, twice, so we won’t repeat what has already been presented, except to say “use the descreen function of the scanner or the Gaussian blur function in Photoshop to cure the problem of swirling or checkerboard patterns.” Straightening and cropping should be done in that order: straighten irst, crop second. To straighten, enlarge the upper left corner of the image and pull down a horizontal guideline from the top ruler bar to the top left corner of the image. Then go over to the top right corner and see how far off the guideline is from the top of the image. Rotate the image clockwise or counterclockwise in small increments (usually .5 or .25 degrees) until the guideline and top right corner of the image are aligned. Crop by using the selection rectangle and with the image enlarged. Start at one corner and go to the opposite corner (such as upper left to lower right); then hit Alt + O P. The next most common problem with grayscale or color images scanned from printed materials is that they are too dark and murky. This is usually best addressed by adjusting the “Levels” in Photoshop. Think of a grayscale image as a collection of pixels with darkness ranging from 0% (white) to 100% (black). A scan, however, will usually produce an image with the pixels ranging from about 10% to about 70%, so the optimal range of 100% is reduced to about a 60%difference. Usually such an image needs to be “stretched” so that the darkest areas become 95-100% black and the lightest 0-5%. In Photoshop, select Image > Adjust > Levels and you will see a box pop up with what is called Figure 26. Image and histogram in Adobe Photoshop; Commands = Image > Adjust > Levels (or Ctrl+L). Note that most of the pixels are concentrated in the area between 50% and 75% black. 21 22 the art of scanning Figure 27. Image and adjusted histogram in Adobe Photoshop; the Input Levels have been adjusted to “stretch” the distribution of most pixels across the range from approximately 60% to 95% (instead of from 50% to 75%). Note that the “Preview” box is checked so that the effects on the image are seen as they are made. a histogram of the image. This is a graph (called “Input Levels”) showing the distribution of pixels along a continuum from 0% to 100% or level 0 to level 255. Note that most of the graph falls toward the middle. Below the graph you will see 3 sliding triangles, representing the 100%, 50% and 0% levels. Grab the leftmost triangle (the one for 100% black) and move it to the right until it just begins to touch the area where the pixels are graphed. Normally the right-most (0%) slider does not need to be moved. Areas that show 0% black look burned out, so you don’t want to create more of those. Now grab the middle slider (the one for 50% black) and move it gradually to the left to lighten up the image. Stop when the image is good and contrasty. It should “pop” in a way that it did not before. Be sure the “Preview” box is checked, so you can see the results before “accepting” them. A bit of practice will help build conidence in working on images: don’t be afraid to try to improve them. After you click “OK” to accept the new image, save it. If you don’t like the transformation, use Ctrl + Z to reverse the previous step. (Ctrl + Z is an invaluable tool — it always undoes the previous operation, whatever that may have been.) A similar effect can be achieved with the Image > Adjust > Curves command, but that is somewhat more complicated and more appropriate for use on images that are to be printed rather than displayed onscreen. Photoshop also has an automatic levels adjusting command (Image > Adjust > Auto Levels) that applies a preset ilter to the image, based on its reading of the histogram. Try it if you like; if you don’t like the results, just hit Ctrl + Z. Photoshop also has a Brightness/Contrast control that can be used to adjust images. Again, play with it if you like; you can always Ctrl + Z if you don’t like the results. The Levels, Curves, and Brightness/Contrast controls can also be applied to color images. However, often the problems with these involve the color being skewed in one color direction or another—too yellow, or too blue, 7. adjusting scanned images 23 Figure 28. The “Variations” box in Adobe Photoshop. The colors of this old photo print had migrated over time, so the scan needs to be adjusted back to more natural-looking tones, in this case by selecting the “More Cyan” option and then probably the “Lighter” option as well. etc. The easiest way to address these is through the Image > Adjust > Variations command. This brings up a screen that shows the current image plus variations for darker, lighter, more red, more magenta, more cyan, more blue, more green, and more yellow. Pick the one that looks the best, and accept the transformation. Note that the variations can be selected to apply to the darkest areas, lightest areas, or mid-tones. Selecting the mid-tones has the greatest effect. Sometimes only parts of images need to be ixed, and here the “Dodge” and “Burn” tools are useful. The dodge tool looks like a lollipop or a circle on a stick; rub it over an area to reduce the darkness. The burn tool (which alternates in the same tool square as the dodge tool) looks like a hand making a “OK” sign with the thumb and foreinger. Rub it over an area to increase the darkness. Both of these tools have 2 types of settings. One setting applies the effect to either the 24 the art of scanning shadows, midtones, or highlights only (usually the midtones is what you want to affect). The other setting controls how strong the effect is—usually a setting of about 20% is appropriate—higher ones can quickly go too far. Figure 29. Photoshop’s “Dodge” and “Burn” tools and their effects. Top: Unretouched scan, showing location of Dodge and Burn tools. Bottom: Highlights of the forehead, cheeks, neck and lace have been “Dodged” (made lighter); shadows of the nose, lips, eyes, and lace have been “Burned”(made darker). 8. When You Didn’t Do the Scanning, but … Sometimes you need to work with iles or images that you did not scan. Perhaps they came from Interlibrary Loan or from a well-meaning author or from an online source that didn’t do a particularly good job of scanning or did not make PDF iles that are up to your standards of quality. With Acrobat and Photoshop you can usually improve them, or at least solve some of the worst issues. scale, color, monochrome (=bitmap/line art)—and resolution. You can then work on these pages in Photoshop and recombine them into a new PDF ile. (Note: Some encrypted iles do not allow saving as TIFF; but they do allow printing to MS Ofice Document Writer format, which can be exported to a non-encrypted PDF ile that can be saved as a set of TIFF iles). Pages saved as TIFF iles can be opened in Photoshop and re-saved as PDF iles and then re-combined with Acrobat into a single PDF that will run OCR. If the images are okay, and it is just a matter of running OCR and cropping or standardizing the page sizes, these operations can be done in Acrobat. One common problem with “found” PDF iles is that the type is too light and faint or is full of holes (the Swiss cheese or shotgun effect). In the irst case, it was probably bad scanning, and in the second, scanning as grayscale and converting to bitmap/line art with one of the “dither” options. The solution for these two problems is similar: Working in Photoshop from a grayscale 600 dpi version of the page image, use the “magic wand” tool to select any black area—such as a letter. Do Select > Similar to select all black areas (i.e., all the type). Then do Select > Modify > Expand and expand the selec- Sometimes, however, a header or watermark has been placed on the page images and then OCR cannot be run, because Acrobat inds there is already “live” text on the page. Sometimes you can use Acrobat’s Object Touch-Up Tool to delete the header or watermark or text box, so that OCR can then be run on the pages. A PDF ile can always be converted to a set of TIFF iles using Acrobat’s File > Save As command and selecting TIFF as the format. Each page of the PDF is saved as a separate ile. You get to select the output type—gray- Figure 30. Photoshop’s selection area for ixing faint or “Swiss cheese” type. In a 600 dpi grayscale ile, use Magic wand to select one (black) letter; then Select > Similar; then Select > Modify > Expand > by 1 pixel. Next ill the area (Edit > Fill) with black, and convert back to bitmap/line art. 25 26 the art of scanning tion by 1 pixel. Then do Edit > Fill > Black to ill the selected area. You have just expanded all the type by 1/600th of an inch; any single pixel holes should have been patched and the type should look stronger and blacker without illing in. Convert to bitmap/line art and save as Photoshop PDF. ◘ ◘ ◘ Figure 31. Here the selection area has been illed with black, and the image converted back to bitmap/line art. The type is not perfect, but it is much stronger and more readable. Sometimes a PDF ile of an article will contain the last lines of the previous (unwanted) article or the irst part of the following (also unwanted) article. This is especially common in the Book Reviews sections of journals. The solution here is to cover the unwanted portions with a white text box in Acrobat. In current releases, this is found in the Markup menus; select “no border” and make the background color white. Figure 32. Adobe Acrobat’s Text Box tool used to “white” out unwanted portions of the PDF page. Left: Location of the Text Box tool (Tools > Comment & Markup > Text Box Tool). Right: Page with unwanted portion obscured; the blue outline of the box shows only because it has been selected with the TouchUp Object tool to show its position for this illustration; normally there would be no border and the white box would blend with the page background. 8. when you didn’t do the scanning, but ... 27 ◘ ◘ ◘ Some publications fail to put their name, citation, and copyright information on the irst page of an article. In such cases, we try to add this information in an Acrobat header or footer. You cannot style the type for italics, so we put the publication name in all capitals. Be sure to select the Page Range Options and declare page 1 only, or it will repeat on every page. Figure 33. Adobe Acrobat’s Header and Footer dialogue box and setting. You can control the typeface and size, position, and margins. This can also be used for adding pagination if needed. The “Page Range Options” allows you to put the header (or footer) on the irst page only, or on all pages. 28 the art of scanning 9. OCR Scanning Sometimes you do not wish to reproduce the image of the page but you want to capture the text for purposes of revision, typesetting, etc. In such cases, OCR (optical character recognition) at the scanning stage is the most useful approach. You will need a scanner that has the OCR function; if your scanner does not have this, then scan the pages as line art, compile them into a PDF ile, and run Adobe Acrobat’s OCR function (as outlined in Section 4). You can then copy the text and paste it into a word processing ile. OCR scanning has improved greatly over the past 15 years: where it used to be about 95% accurate (50 errors per 1000 characters), it is now closer to 99%(10 errors per 1000 characters). Given that an average page has about 2000 characters, that’s a reduction from 100 errors to 20 errors per page. Most OCR software will work reasonably well on text in columns, except when a page starts out as 1 column and then switches to 2 columns, or when there are large vertical breaks in the columns. In such cases, it may help to photocopy the pages and cut or fold them into single columns; the scanning may take longer but the improvement in accuracy makes the effort worthwhile. Alternatively, one can use the scanner’s selection function to scan each area on the scanner bed separately, but this requires previewing each page. OCR has dificulty with odd or unusual letter forms: roman characters do better than italic ones; oldstyle igures are often problematic; and Greek characters usually don’t work at all. OCR programs use a dictionary to help recognize words; they verify the reading against the word list to improve the accuracy. It is important for the program to know what language it is reading, and unfortunately text with multiple languages (such as English with occasional French or Spanish terms) will show lots of errors on the unrecognized language. Thus French “thé” (tea) will be recognized as English “the.” If you change the language to French, then all the instances of English “the” will be recognized as “thé”. Spanish “ó” is usually recognized as “6” (unless Spanish is selected as the base language). Greek “α” (alpha) is recognized as “a”, and “β” (beta) as ß (German double-s) or as “13”; and so forth. So clearly, spell-check and proofreading are required on all text that has been OCR scanned. Normally, OCR scanning works best when the image type is set for line art, but if type is especially faint, or is printed on a background, or is reversed out of a dark color, or the gutter shadow is a problem, then using the grayscale setting may improve the results. Usually, a resolution setting between 300 and 600 dpi works best. Very small type requires the higher setting, but a setting that is too high will increase the misrecognition of every dirt spot and ly dropping on the page. Old fashioned typewriter text sometimes shows a unique problem: the letters are spaced so far apart that the OCR engine thinks each one is a separate word. In such cases, I ind it is easier to search-and-delete all the spaces and then divide the words by hand than it is to delete each unwanted space one by one. At the risk of repetitiveness, I will re-state two comments from the Scanner Software section: • • Some OCR scanning programs allow scanning multiple pages into one ile, and this is a nice feature. Be careful, though, not to overload the system—to have it crash on the 38th page of a 40-page document will force you to start all over again. Some OCR programs offer MS Word or RTF word processing format as an output type, and this is tempting in that it may capture italics, bold, and other text characteristics. However, some programs attempt to reproduce the page layout by placing everything in separate text boxes, which are awkward and unwieldy to handle. I have found it best to go with unformatted Unicode text, producing a single text stream, and to manually restore italics, boldface, etc. as necessary.