A good PDF parser should emit a relational set of DataFrames, not flat text. The model includes tables for lines, pages, TOC, images, cross-references, captions, spans, and a parsing summary. Retrieval, generation, and highlighting all read these tables, never the raw PDF.
Tap to vote and see what everyone thinks.