30 Days of Graphlit (Day 1): Extract Markdown from PDF
Kirk Marple
September 1, 2024
Welcome to the '30 Days of Graphlit', where all month during September 2024 we will show a new Python notebook example of how to use a feature (or features) of the Graphlit Platform.
Extract Markdown from PDF
In our first example, we'll show you how to use Graphlit for ingesting a PDF and extracting Markdown text from the document.
If you have used tools like Unstructured.IO or LlamaParse before, Graphlit provides an end-to-end approach for PDF document extraction as well as text chunking and the creation of vector embeddings. Files are ingested and extracted, but they are also made RAG-ready and searchable.
When you ingest the PDF, we will leverage Azure AI Document Intelligence to OCR the document, and do smart partitioning of the document layout, prior to extracting the text (and the role of each text chunk) into our internal JSON structure.
When using one of our SDKs, you can access a Markdown formatted version of the final extracted text through the content.markdown
property.
In these examples, we're using our Python SDK, but we also have a Node.js and .NET SDK which work similarly.
You can open the Python notebook example in Google Colab to follow along.
Initialization
First, we need to install the Graphlit Python SDK.
Then you can initialize the Graphlit client. Here were are assigning the required environment variables from Google Colab secrets.
If you haven't already signed up for Graphlit, you can learn more about signup here. You can learn more about creating a project here.
By clicking on the 'key' icon in the Google Colab side menu, you can add these secrets and the values you get from your project in the Graphlit Developer Portal.
Get started
Now that we have the Graphlit client initialized, let's start writing some code.
The first step extracting text from a PDF is to ingest it into Graphlit.
Let's write a function to ingest the file from URI. Graphlit supports a wide variety of document types, such as PDF, DOCX, PPTX, XLSX, HTML, MD and more. So, you can replace this URI with any hosted file, and it doesn't have to be a PDF. It even works with audio or video files, and will automatically transcribe them, and we'll show more examples about media formats later this month.
Here we are calling the ingest_uri
function of the Graphlit client with the URI of the PDF (or any file that Graphlit supports).
We are setting is_synchronous
to True, since Graphlit will ingest asynchronously by default. The function call will wait until the ingestion has completed on the server-side, before returning the ID of the new content
object that is created.
In Graphlit, we wrap the metadata and physical storage of files, web pages or other unstructured data as a content
object.
Next, let's write a function to return all the properties of the ingested content.
You could just call get_content
directly, but our examples show how to handle any GraphQLClientError
exceptions that are thrown by the Graphlit SDK.
Run the example
That's all we need to test this out.
We call ingest_uri
with a sample PDF, and the notebook will wait for that to be ingested into your Graphlit project.
Once the function completes successfully, we get all the properties of the content
object, including content.markdown
.
We then display the entire Markdown text of the PDF in the notebook. You also get access to all the content metadata properties, not just the Markdown text, such as content.document.title
.
Let's run it and see the results.
You can see how it has extracted all the text from the "Attention Is All You Need" PDF, and then rendered it in the notebook.
No need to use other packages or APIs like Unstructured.IO or LlamaParse - Graphlit does it all for you!
Summary
Please email any questions on this tutorial or the Graphlit Platform to questions@graphlit.com.
For more information, you can read our Graphlit Documentation, visit our marketing site, or join our Discord community.