30 Days of Graphlit (Day 2): Scrape Website

Kirk Marple

September 2, 2024

Welcome to the '30 Days of Graphlit', where all month during September 2024 we will show a new Python notebook example of how to use a feature (or features) of the Graphlit Platform.

Scrape Website

In this example, we'll show you how to use Graphlit for scraping a website and extracting text from the webpages.

If you have used tools like Firecrawl before, Graphlit provides an end-to-end approach for web scraping and text extraction as well as text chunking and the creation of vector embeddings. Webpages are scraped, but they are also made RAG-ready and searchable.

You can provide the URL of a website, and Graphlit will read the sitemap, if it has been defined, and then crawl the pages of the website. You can also filter the webpages to be ingested by the allowedPaths and excludedPaths regex patterns.

In these examples, we're using our Python SDK, but we also have a Node.js and .NET SDK which work similarly.

You can open the Python notebook example in Google Colab to follow along.

Initialization

First, we need to install the Graphlit Python SDK.

!pip install --upgrade graphlit-client

Then you can initialize the Graphlit client. Here were are assigning the required environment variables from Google Colab secrets.

If you haven't already signed up for Graphlit, you can learn more about signup here. You can learn more about creating a project here.

import os
from google.colab import userdata
from graphlit import Graphlit
from graphlit_api import input_types, enums, exceptions

os.environ['GRAPHLIT_ORGANIZATION_ID'] = userdata.get('GRAPHLIT_ORGANIZATION_ID')
os.environ['GRAPHLIT_ENVIRONMENT_ID'] = userdata.get('GRAPHLIT_ENVIRONMENT_ID')
os.environ['GRAPHLIT_JWT_SECRET'] = userdata.get('GRAPHLIT_JWT_SECRET')

graphlit = Graphlit()

By clicking on the 'key' icon in the Google Colab side menu, you can add these secrets and the values you get from your project in the Graphlit Developer Portal.

Get started

Now that we have the Graphlit client initialized, let's start writing some code.

First, we need to create a Web feed. Feeds are used to ingest from data sources like Google Drive, Notion, Slack or websites. They support 'one-shot' ingestion, as well as 'recurring' ingestion, where Graphlit will poll for changes on a periodic schedule.

Let's write a function called create_feed which takes the URL to the website, and an optional list of allowed URL paths (regex patterns, which we show below).

We are asking the feed to just read 5 webpages from the website, but you can pick any number of pages. It defaults to reading 100 pages (sorted alphabetically by the sitemap).

async def create_feed(uri: str, allowed_paths: Optional[List[str]] = None):
    if graphlit.client is None:
        return;

    input = input_types.FeedInput(
        name=uri,
        type=enums.FeedTypes.WEB,
        web=input_types.WebFeedPropertiesInput(
            uri=uri,
            allowedPaths=allowed_paths,
            readLimit=5 # limiting to 5 pages from website
        )
    )

    try:
        response = await graphlit.client.create_feed(input)

        return response.create_feed.id if response.create_feed is not None else None
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

    return None

Feeds run asynchronously, and we will need to poll for completion. Since feeds can read 100s or 1000s of files from a data source, you can decide how often to poll, depending on the volume of data being ingested.

We can write a small helper function is_feed_done which safely checks if the feed has been completed, given the feed_id.

async def is_feed_done(feed_id: str):
    if graphlit.client is None:
        return;

    response = await graphlit.client.is_feed_done(feed_id)

    return response.is_feed_done.result if response.is_feed_done is not None else None

Any data that a feed ingests into Graphlit is wrapped by a content object. Content can be in a variety of content types, such as FILE, PAGE, MESSAGE or ISSUE. (These are defined by the ContentTypes enum, and assigned to the property content.type.)

Here we will write a query_contents function, which queries for all contents which were ingested by the web feed.

You'll notice that the feeds property in the content filter accepts an array of feed references. You can filter by more than one feed at a time.

async def query_contents(feed_id: str):
    if graphlit.client is None:
        return;

    try:
        response = await graphlit.client.query_contents(
            filter=input_types.ContentFilter(
                feeds=[
                    input_types.EntityReferenceFilter(
                        id=feed_id
                    )
                ]
            )
        )

        return response.contents.results if response.contents is not None else None
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

Run the example

That's all we need to test this out.

We can call create_feed with the URL to the Graphlit blog, but also add a regex pattern to ingest URLs with the term graphlit in them.

We poll for completion with is_feed_done, and then when the feed has completed, we query the ingested webpages (contents) by the feed_id.

For each webpage content we get back, we display the URL of the webpage, and the extracted Markdown text.

from IPython.display import display, Markdown
import time

# Find URLs with the word 'graphlit' in them.
feed_id = await create_feed(uri="https://www.graphlit.com/blog", allowed_paths=["^/blog/.*graphlit.*$"])

if feed_id is not None:
    print(f'Created feed [{feed_id}].')

    # Wait for feed to complete, since ingestion happens asychronously
    done = False
    time.sleep(5)
    while not done:
        done = await is_feed_done(feed_id)

        if not done:
            time.sleep(2)

    print(f'Completed feed [{feed_id}].')

    # Query contents by feed
    contents = await query_contents(feed_id)

    if contents is not None:
        for content in contents:
            if content is not None:
                display(Markdown(f'# Webpage: {content.uri}:\n{content.markdown}'))

Let's run it and see the results.

It correctly found only the pages in the Graphlit blog with 'graphlit' in the URL, as you can see, and then displayed the extracted Markdown text.

No need to use other web scraping APIs or build this yourself - it's all integrated natively into Graphlit!

Summary

Please email any questions on this tutorial or the Graphlit Platform to questions@graphlit.com.

For more information, you can read our Graphlit Documentation, visit our marketing site, or join our Discord community.