Prepping Materials

The only requirement for using the method this work is outlining is that you need to be working with digital versions of your text material and that these documents have a layer of OCR or Optical Character Recognition.

Visualization of a digital book's OCR layer
Visualization of a digital book's OCR layer

This process involves scanning pages, segmenting different areas of text (like recognizing all of the different columns in a newspaper, for instance), extracting features and finally recognizing individual characters. This interpretation of the text is then overlaid onto the image or document where it can now be keyword searchable and used with text to speech assistive technology. If you are using downloadable ebooks or request a digitization physical books through our interlibrary loan these items will come with a layer of OCR.

If you need to generate a layer of OCR for any other reason, you can do this with Adobe Acrobat, which you have a subscription to through your enrollment at the university. For alternative approaches, see Notes section. While the OCR function of Adobe is adequate, you may encounter with book scans in particular that there will sometimes be errors of OCR reading a two page spread all of the way across.

Demo of Book Splitter Python Tool
Demo of Book Splitter Python Tool

With this in mind, I built a Book Splitter Python tool which will convert all of your pages into left and right sides. Instructions for how to set-up the tool are included in the Appendix but are very similar to the set-up for the Annotation Extraction tool we will walk through in a moment.

Now that you have an OCR layer, you will need to read and highlight your material. This can either be done either in Adobe Acrobat or a completely open source option that I like called Okular, where you can easily build out an unlimited number of custom highlights sets.