Testing Claude 3.5 Sonnet for document text extraction
Kirk Marple
August 2, 2024
There are many approaches for performing text extraction from documents, like PDFs and Word documents.
Classically, Optical Character Recognition (OCR) has been used to identify blocks of text, tables, figures and headings from documents. There are many options for OCR text extraction such as Azure AI Document Intelligence and Amazon Textract, which provide robust capabilities.
With the advent of vision-capable models, such as OpenAI GPT-4o, there has been more interest in using Large Multimodal Models (LMMs) for visual text extraction.
There also can be cost benefit for using multimodal models for text extraction, compared to existing cloud services.
We have been doing some testing with the latest multimodal models, including Anthropic Claude 3.5 Sonnet, to see how accurate the text extraction can be without requiring OCR.
Anthropic Claude 3.5 Sonnet
In this case I used a customized prompt which instructed the model on how to analyze the provided image, and how to return a formatted JSON response.
The JSON schema is based on the internal mezzanine JSON format used by Graphlit, as a canonical storage mechanism for extracted text.
Multimodal Prompt
Here is the prompt I provided to Sonnet along with the PNG image:
<instructions>
Follow these steps:
1. Carefully analyze the provided image and identify ALL elements, including text, tables, radio buttons, checkboxes, figures, and charts. Pay special attention to tables and ensure they are not overlooked.
2. Treat the image as one section. Partition into a hierarchy of chunks, where Tables are made up of a child array of chunks of rows and columns with role assigned.
3. For each section, extract everything you found into the JSON schema structure provided below. For tables, ensure that you create a separate chunk with the role "Table" and include all table contents as child chunks with appropriate row and column indices. For tables, assign 't' as Markdown formatted version of the table chunks. For checkboxes or radio buttons in a table cell, mark only the selected item in the Markdown text as '[x]'. Don't add HTML tags in Markdown text.
4. Double-check that all identified elements, especially tables, have been properly extracted and represented in the JSON structure.
5. Return the final extracted text in JSON.
Make sure to escape any JSON strings.
For each radio button, have a closer look at the image and make sure you extract all the available options in the same order as you found them. Mark the one option which appeared to be selected. Treat checkbox as a list with a single item.
Before finalizing your response, review the extracted content to ensure no tables or other significant elements have been missed.
</instructions>
Answer only in JSON and conform to this schema:
{"ss":{"type":"array","description":"Array of document sections","items":{"type":"object","properties":{"i":{"type":"integer","description":"Index of the section"},"cc":{"type":"array","description":"Array of text chunks within the section","items":{"type":"object","properties":{"i":{"type":"integer","description":"Index of the text chunk"},"t":{"type":"string","description":"Text content of the chunk"},"r":{"type":"string","description":"Role of the text chunk, defaults to Paragraph","enum":["Paragraph","PageHeader","PageFooter","PageNumber","Title","SectionHeading","Heading1","Heading2","Heading3","Heading4","Heading5","Table"]},"cc":{"type":"array","description":"Array of table chunks, which are nested within the parent table chunk","items":{"type":"object","properties":{"i":{"type":"integer","description":"Index of the table chunk"},"ri":{"type":"integer","description":"Row index in table"},"ci":{"type":"integer","description":"Column index in table"},"t":{"type":"string","description":"Text content of the table chunk"},"s":{"type":"boolean","description":"Selected state of radio button or checkbox within table chunk"},"r":{"type":"string","description":"Role of the table chunk, defaults to TableCell","enum":["TableColumnHeader","TableRowHeader","TableCornerHeader","TableCell","TableCaption","RadioButton","Checkbox"]}}}}}}}}}}}
```json
Structured JSON Output
Here is the output from the Sonnet model:
{
"ss": [
{
"i": 1,
"cc": [
{
"i": 1,
"t": "EcoTouch® PINK® Fiberglas™ Insulation - Foil Faced by Owens Corning",
"r": "Title"
},
{
"i": 2,
"t": "Health Product Declaration v2.1.1",
"r": "SectionHeading"
},
{
"i": 3,
"t": "created with: HPDC Online Builder",
"r": "Paragraph"
},
{
"i": 4,
"t": "CLASSIFICATION: 07 21 16 Blanket Insulation\nPRODUCT DESCRIPTION: EcoTouch® Insulation can be used in a wide range of exterior wall, floor, roofing and ceiling applications for residential and commercial installations. EcoTouch® Insulation is provided with a wide range of R-values and thicknesses with excellent thermal control. The R30C, and R38C for use in the USA and R28 and R31 as used in Canada, provide excellent thermal performance in limited space of cathedral ceilings. EcoTouch® Insulation enhances interior noise control by improving Sound Transmission Class (STC) of walls and flooring assemblies.",
"r": "Paragraph"
},
{
"i": 5,
"t": "Section 1: Summary",
"r": "SectionHeading"
},
{
"i": 6,
"t": "Basic Method / Product Threshold",
"r": "SectionHeading"
},
{
"i": 7,
"t": "CONTENT INVENTORY",
"r": "Heading1"
},
{
"i": 8,
"r": "Table",
"cc": [
{
"i": 1,
"ri": 0,
"ci": 0,
"t": "Inventory Reporting Format",
"r": "TableColumnHeader"
},
{
"i": 2,
"ri": 0,
"ci": 1,
"t": "Threshold level",
"r": "TableColumnHeader"
},
{
"i": 3,
"ri": 0,
"ci": 2,
"t": "Residuals/Impurities",
"r": "TableColumnHeader"
},
{
"i": 4,
"ri": 1,
"ci": 0,
"t": "[ ] Nested Materials Method\n[x] Basic Method",
"r": "TableCell"
},
{
"i": 5,
"ri": 1,
"ci": 1,
"t": "[x] 100 ppm\n[ ] 1,000 ppm\n[ ] Per GHS SDS\n[ ] Per OSHA MSDS\n[ ] Other",
"r": "TableCell"
},
{
"i": 6,
"ri": 1,
"ci": 2,
"t": "[ ] Considered\n[x] Partially Considered\n[ ] Not Considered\n\nExplanation(s) provided\nfor Residuals/Impurities?\n[ ] Yes [x] No",
"r": "TableCell"
}
]
},
{
"i": 9,
"t": "Threshold Disclosed Per",
"r": "Heading2"
},
{
"i": 10,
"t": "[ ] Material\n[x] Product",
"r": "Paragraph"
},
{
"i": 11,
"t": "All Substances Above the Threshold Indicated Are:",
"r": "Heading2"
},
{
"i": 12,
"r": "Table",
"cc": [
{
"i": 1,
"ri": 0,
"ci": 0,
"t": "Characterized",
"r": "TableColumnHeader"
},
{
"i": 2,
"ri": 0,
"ci": 1,
"t": "[ ] Yes Ex/SC [x] Yes [ ] No",
"r": "TableCell"
},
{
"i": 3,
"ri": 1,
"ci": 0,
"t": "% weight and role provided for all substances.",
"r": "TableCell"
},
{
"i": 4,
"ri": 2,
"ci": 0,
"t": "Screened",
"r": "TableColumnHeader"
},
{
"i": 5,
"ri": 2,
"ci": 1,
"t": "[ ] Yes Ex/SC [x] Yes [ ] No",
"r": "TableCell"
},
{
"i": 6,
"ri": 3,
"ci": 0,
"t": "All substances screened using Priority Hazard Lists with results disclosed.",
"r": "TableCell"
},
{
"i": 7,
"ri": 4,
"ci": 0,
"t": "Identified",
"r": "TableColumnHeader"
},
{
"i": 8,
"ri": 4,
"ci": 1,
"t": "[ ] Yes Ex/SC [x] Yes [ ] No",
"r": "TableCell"
},
{
"i": 9,
"ri": 5,
"ci": 0,
"t": "One or more substances not disclosed by Name (Specific or Generic) and Identifier and/ or one or more Special Condition did not follow guidance.",
"r": "TableCell"
}
]
},
{
"i": 13,
"t": "CONTENT IN DESCENDING ORDER OF QUANTITY",
"r": "Heading1"
},
{
"i": 14,
"t": "Summary of product contents and results from screening individual chemical substances against HPD Priority Hazard Lists and the GreenScreen for Safer Chemicals®. The HPD does not assess whether using or handling this product will expose individuals to its chemical substances or any health risk. Refer to Section 2 for further details.",
"r": "Paragraph"
},
{
"i": 15,
"t": "Number of Greenscreen BM-4/BM3 contents ... 0\nContents highest concern GreenScreen\nBenchmark or List translator Score ... LT-1\nNanomaterial ... No\nINVENTORY AND SCREENING NOTES:\nSubstances representing 100% of the product weight meet the 100 ppm\nThreshold and are Screened.",
"r": "Paragraph"
},
{
"i": 16,
"t": "MATERIAL | SUBSTANCE | RESIDUAL OR IMPURITY\nGREENSCREEN SCORE | HAZARD TYPE",
"r": "Heading2"
},
{
"i": 17,
"t": "ECOTOUCH® PINK® FIBERGLAS™ INSULATION - FOIL FACED [ FIBER GLASS, BIOSOL UBL BINDER WITH ADDED UREA, EXTENDED WITH EARTH OXIDE CONTENT <1% BY WEIGHT LT-UNK | STARCH LT-UNK | CELLULOSE, MICROCRYSTALLINE NoGS | ASPHALT, OXIDIZED LT-1 | CAN | ALUMINUM POWDER (PYROPHORIC) LT-P1 | RES | PHY | END | VEGETABLE OIL NoGS | SODIUM LIGNOSULFONATE LT-UNK | AMINOPROPYLTRIETHOXYSILANE LT-UNK | SKI OILS, SUNFLOWER SEED NoGS | UNDISCLOSED NoGS | PETROLEUM RESINS LT-1 | CAN ]",
"r": "Paragraph"
},
{
"i": 18,
"t": "VOLATILE ORGANIC COMPOUND (VOC) CONTENT",
"r": "Heading1"
},
{
"i": 19,
"t": "VOC Content data is not applicable for this product category.",
"r": "Paragraph"
},
{
"i": 20,
"t": "CERTIFICATIONS AND COMPLIANCE See Section 3 for additional listings.",
"r": "Heading1"
},
{
"i": 21,
"t": "VOC emissions: UL/GreenGuard Gold Certified\nVOC emissions: UL/GreenGuard Certified\nVOC emissions: UL-Formaldehyde Free Claim Validation\nOther: LFT1 Declare - LBC Compliant - Third Party Verified",
"r": "Paragraph"
},
{
"i": 22,
"t": "CONSISTENCY WITH OTHER PROGRAMS",
"r": "Heading1"
},
{
"i": 23,
"t": "No pre-checks completed or disclosed.",
"r": "Paragraph"
},
{
"i": 24,
"r": "Table",
"cc": [
{
"i": 1,
"ri": 0,
"ci": 0,
"t": "Third Party Verified?",
"r": "TableColumnHeader"
},
{
"i": 2,
"ri": 0,
"ci": 1,
"t": "PREPARER: Self-Prepared\nVERIFIER:\nVERIFICATION #:",
"r": "TableCell"
},
{
"i": 3,
"ri": 1,
"ci": 0,
"t": "[ ] Yes\n[x] No",
"r": "TableCell"
},
{
"i": 4,
"ri": 0,
"ci": 2,
"t": "SCREENING DATE: 2019-08-26\nPUBLISHED DATE: 2019-08-30\nEXPIRY DATE: 2022-08-26",
"r": "TableCell"
}
]
},
{
"i": 25,
"t": "EcoTouch PINK Fiberglas Insulation - Foil Faced\nhpdrepository.hpd-collaborative.org",
"r": "PageFooter"
},
{
"i": 26,
"t": "HPD v2.1.1 created via HPDC Builder Page 1 of 9",
"r": "PageFooter"
}
]
}
]
}
Issues
Sonnet 3.5 has a little trouble with the accuracy of radio button extraction. For example, the "Identifier" radio button has 'No' selected, but Sonnet thought 'Yes' was selected. However it correctly extracted the selected state on many of the other radio buttons.
Comparison to OpenAI GPT-4o
From initial testing, OpenAI GPT-4o does a much worse job with the same image and prompt, and doesn't provide an accurate representation of the document structure.
Sonnet 3.5 shines for its potential in document text extraction.
{
"ss": [
{
"i": 0,
"cc": [
{
"i": 0,
"t": "EcoTouch PINK Fiberglas Insulation - Foil Faced by Owens Corning",
"r": "Title"
},
{
"i": 1,
"t": "CLASSIFICATION: 07 21 16 Blanket Insulation",
"r": "Paragraph"
},
{
"i": 2,
"t": "PRODUCT DESCRIPTION: EcoTouch Insulation can be used in a wide range of exterior wall, floor, roofing and ceiling applications for residential and commercial installations. EcoTouch Insulation is provided with a wide range of R-values and thicknesses with excellent thermal control. The R30C, and R38C for use in the USA and R28 and R31 as used in Canada, provide excellent thermal performance in limited space of cathedral ceilings. EcoTouch Insulation enhances interior noise control by improving Sound Transmission Class (STC) of walls and floor/ceiling assemblies.",
"r": "Paragraph"
},
{
"i": 3,
"t": "Health Product Declaration v2.1.1",
"r": "PageFooter"
},
{
"i": 4,
"t": "Section 1: Summary",
"r": "Heading1"
},
{
"i": 5,
"t": "Basic Method / Product Threshold",
"r": "Heading2"
},
{
"i": 6,
"t": "Table 1: Content Inventory",
"r": "Table",
"cc": [
{
"i": 0,
"ri": 0,
"ci": 0,
"t": "Inventory Reporting Format",
"r": "TableColumnHeader"
},
{
"i": 1,
"ri": 0,
"ci": 1,
"t": "Threshold level",
"r": "TableColumnHeader"
},
{
"i": 2,
"ri": 0,
"ci": 2,
"t": "Residuals/Impurities",
"r": "TableColumnHeader"
},
{
"i": 3,
"ri": 0,
"ci": 3,
"t": "All Substances Above the Threshold Indicated Are:",
"r": "TableColumnHeader"
},
{
"i": 4,
"ri": 1,
"ci": 0,
"t": "Nested Materials Method",
"r": "TableCell"
},
{
"i": 5,
"ri": 1,
"ci": 1,
"t": "< 100 ppm",
"r": "TableCell"
},
{
"i": 6,
"ri": 1,
"ci": 2,
"t": "Partially Considered",
"r": "TableCell"
},
{
"i": 7,
"ri": 1,
"ci": 3,
"t": "Characterized \\[x\\] Yes Ex/SC \\[ \\] No",
"r": "TableCell"
},
{
"i": 8,
"ri": 2,
"ci": 0,
"t": "Basic Method",
"r": "TableCell"
},
{
"i": 9,
"ri": 2,
"ci": 1,
"t": "1,000 ppm",
"r": "TableCell"
},
{
"i": 10,
"ri": 2,
"ci": 2,
"t": "Considered",
"r": "TableCell"
},
{
"i": 11,
"ri": 2,
"ci": 3,
"t": "Screened \\[x\\] Yes Ex/SC \\[ \\] No",
"r": "TableCell"
}
]
},
{
"i": 7,
"t": "Volatile Organic Compound (VOC) Content: VOC Content data is not applicable for this product category.",
"r": "Paragraph"
},
{
"i": 8,
"t": "CERTIFICATIONS AND COMPLIANCE: VOC emissions: UL/GreenGuard Gold Certified, VOC emissions: UL/GreenGuard Certified, VOC emissions: UL-Formaldehyde Free Claim Validation, Other: ILFI Declare - LBC Compliant - Third Party Verified",
"r": "Paragraph"
},
{
"i": 9,
"t": "CONSISTENCY WITH OTHER PROGRAMS: No pre-checks completed or disclosed.",
"r": "Paragraph"
},
{
"i": 10,
"t": "Third Party Verified? \\[ \\] Yes \\[x\\] No",
"r": "Checkbox"
}
]
}
]
}
We will continue to do more investigation on using multimodal models for text extraction, and look to add these as an option in the Graphlit preparation workflow.
Summary
Please email any questions on this article or the Graphlit Platform to questions@graphlit.com.
For more information, you can read our Graphlit Documentation, visit our marketing site, or join our Discord community.