Annotation Extraction Tool

Visualization of two book pages with highlights being extracted and categorized — Identifying and Categorizing Highlighted Text

This Python tool detects a series of six highlight colors which all have designations and then extracts the text with those highlights into two different markdown files where these excerpts are displayed either categorically or chronologically, from the beginning to the end of the document you annotated.

✺

Here are the colors and their corresponding categories as they are currently set up in the tool but both of these values are customizable depending on your area of study. The numbers after each of the color names is the percent of red, green and blue in the RGB color value.

   light_blue = (0.659, 0.929, 1.000)    
    yellow = (1.0, 1.0, 0.039)            
    orange = (0.992, 0.502, 0.031)        
    red = (1.0, 0.255, 0.494)             
    purple = (0.902, 0.522, 1.0)          
    gray = (0.902, 0.902, 0.902)          

color_map = 
        "General Notes": light_blue,
        "Definitions, Locations, People, Organizations": yellow,
        "Author Thesis and Methodology": orange,
        "Important": red,
        "Stats": purple,
        "Quotes": gray

Both the colors and the designations can be customized to your research but note that the colors need to be distinct enough that the tool can differentiate between them.

◒

This tool works by importing the fitz module of PyMuPDF, a library used for working with PDF files, including extracting text and handling annotations. The header material also includes importing Python’s os module, which allows you to interact with the operating system, reading directories and file paths, and the re module, which provides regular expression operations as well as searching and modifying the text we will be extracting.

Here are all of the steps of the Python code:

Cleaning OCR text:
- Removing hyphenated line breaks
- Removing line breaks within text and replacing them with a space (with varying degrees of success… see the Future Refinement section)
- Removing corrupted characters and extra whitespace as it occurs in the text
Extracting highlighted text, mapping colors to their respective categories and using a tolerance to determine if the detected color is close enough to the reference color
Generating these extracted passages into two different files in the B folder within the Python tool
- One arranged categorically (important, definitions, quotes, etc.)
- One arranged chronologically (page 1, page 2, page 3, etc.)