30 Days of Graphlit (Day 1): Extract Markdown from PDF

Kirk Marple

September 1, 2024

Welcome to the '30 Days of Graphlit', where all month during September 2024 we will show a new Python notebook example of how to use a feature (or features) of the Graphlit Platform.

Extract Markdown from PDF

In our first example, we'll show you how to use Graphlit for ingesting a PDF and extracting Markdown text from the document.

If you have used tools like Unstructured.IO or LlamaParse before, Graphlit provides an end-to-end approach for PDF document extraction as well as text chunking and the creation of vector embeddings. Files are ingested and extracted, but they are also made RAG-ready and searchable.

When you ingest the PDF, we will leverage Azure AI Document Intelligence to OCR the document, and do smart partitioning of the document layout, prior to extracting the text (and the role of each text chunk) into our internal JSON structure.

When using one of our SDKs, you can access a Markdown formatted version of the final extracted text through the content.markdown property.

In these examples, we're using our Python SDK, but we also have a Node.js and .NET SDK which work similarly.

You can open the Python notebook example in Google Colab to follow along.


Initialization

First, we need to install the Graphlit Python SDK.

!pip install --upgrade graphlit-client

Then you can initialize the Graphlit client. Here were are assigning the required environment variables from Google Colab secrets.

If you haven't already signed up for Graphlit, you can learn more about signup here. You can learn more about creating a project here.

import os
from google.colab import userdata
from graphlit import Graphlit
from graphlit_api import input_types, enums, exceptions

os.environ['GRAPHLIT_ORGANIZATION_ID'] = userdata.get('GRAPHLIT_ORGANIZATION_ID')
os.environ['GRAPHLIT_ENVIRONMENT_ID'] = userdata.get('GRAPHLIT_ENVIRONMENT_ID')
os.environ['GRAPHLIT_JWT_SECRET'] = userdata.get('GRAPHLIT_JWT_SECRET')

graphlit = Graphlit()

By clicking on the 'key' icon in the Google Colab side menu, you can add these secrets and the values you get from your project in the Graphlit Developer Portal.


Get started

Now that we have the Graphlit client initialized, let's start writing some code.

The first step extracting text from a PDF is to ingest it into Graphlit.

Let's write a function to ingest the file from URI. Graphlit supports a wide variety of document types, such as PDF, DOCX, PPTX, XLSX, HTML, MD and more. So, you can replace this URI with any hosted file, and it doesn't have to be a PDF. It even works with audio or video files, and will automatically transcribe them, and we'll show more examples about media formats later this month.

Here we are calling the ingest_uri function of the Graphlit client with the URI of the PDF (or any file that Graphlit supports).

We are setting is_synchronous to True, since Graphlit will ingest asynchronously by default. The function call will wait until the ingestion has completed on the server-side, before returning the ID of the new content object that is created.

In Graphlit, we wrap the metadata and physical storage of files, web pages or other unstructured data as a contentobject.

async def ingest_uri(uri: str):
    if graphlit.client is None:
        return;

    try:
        # Using synchronous mode, so the notebook waits for the content to be ingested
        response = await graphlit.client.ingest_uri(uri=uri, is_synchronous=True)

        return response.ingest_uri.id if response.ingest_uri is not None else None
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

Next, let's write a function to return all the properties of the ingested content.

async def get_content(content_id: str):
    if graphlit.client is None:
        return;

    try:
        response = await graphlit.client.get_content(content_id)

        return response.content
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

You could just call get_content directly, but our examples show how to handle any GraphQLClientError exceptions that are thrown by the Graphlit SDK.


Run the example

That's all we need to test this out.

We call ingest_uri with a sample PDF, and the notebook will wait for that to be ingested into your Graphlit project.

Once the function completes successfully, we get all the properties of the content object, including content.markdown.

We then display the entire Markdown text of the PDF in the notebook. You also get access to all the content metadata properties, not just the Markdown text, such as content.document.title.

from IPython.display import display, Markdown

content_id = await ingest_uri(uri="https://graphlitplatform.blob.core.windows.net/samples/Attention%20Is%20All%20You%20Need.1706.03762.pdf")

if content_id is not None:
    print(f'Ingested content [{content_id}]:')

    content = await get_content(content_id)

    if content is not None:
        display(Markdown(content.markdown))


Let's run it and see the results.

You can see how it has extracted all the text from the "Attention Is All You Need" PDF, and then rendered it in the notebook.

No need to use other packages or APIs like Unstructured.IO or LlamaParse - Graphlit does it all for you!


Summary

Please email any questions on this tutorial or the Graphlit Platform to questions@graphlit.com.

For more information, you can read our Graphlit Documentation, visit our marketing site, or join our Discord community.