30 Days of Graphlit (Day 2): Scrape Website
Kirk Marple
September 2, 2024
Welcome to the '30 Days of Graphlit', where all month during September 2024 we will show a new Python notebook example of how to use a feature (or features) of the Graphlit Platform.
Scrape Website
In this example, we'll show you how to use Graphlit for scraping a website and extracting text from the webpages.
If you have used tools like Firecrawl before, Graphlit provides an end-to-end approach for web scraping and text extraction as well as text chunking and the creation of vector embeddings. Webpages are scraped, but they are also made RAG-ready and searchable.
You can provide the URL of a website, and Graphlit will read the sitemap, if it has been defined, and then crawl the pages of the website. You can also filter the webpages to be ingested by the allowedPaths
and excludedPaths
regex patterns.
In these examples, we're using our Python SDK, but we also have a Node.js and .NET SDK which work similarly.
You can open the Python notebook example in Google Colab to follow along.
Initialization
First, we need to install the Graphlit Python SDK.
Then you can initialize the Graphlit client. Here were are assigning the required environment variables from Google Colab secrets.
If you haven't already signed up for Graphlit, you can learn more about signup here. You can learn more about creating a project here.
By clicking on the 'key' icon in the Google Colab side menu, you can add these secrets and the values you get from your project in the Graphlit Developer Portal.
Get started
Now that we have the Graphlit client initialized, let's start writing some code.
First, we need to create a Web feed. Feeds are used to ingest from data sources like Google Drive, Notion, Slack or websites. They support 'one-shot' ingestion, as well as 'recurring' ingestion, where Graphlit will poll for changes on a periodic schedule.
Let's write a function called create_feed
which takes the URL to the website, and an optional list of allowed URL paths (regex patterns, which we show below).
We are asking the feed to just read 5 webpages from the website, but you can pick any number of pages. It defaults to reading 100 pages (sorted alphabetically by the sitemap).
Feeds run asynchronously, and we will need to poll for completion. Since feeds can read 100s or 1000s of files from a data source, you can decide how often to poll, depending on the volume of data being ingested.
We can write a small helper function is_feed_done
which safely checks if the feed has been completed, given the feed_id
.
Any data that a feed ingests into Graphlit is wrapped by a content
object. Content can be in a variety of content types, such as FILE
, PAGE
, MESSAGE
or ISSUE
. (These are defined by the ContentTypes
enum, and assigned to the property content.type
.)
Here we will write a query_contents
function, which queries for all contents which were ingested by the web feed.
You'll notice that the feeds
property in the content filter
accepts an array of feed references. You can filter by more than one feed at a time.
Run the example
That's all we need to test this out.
We can call create_feed
with the URL to the Graphlit blog, but also add a regex pattern to ingest URLs with the term graphlit
in them.
We poll for completion with is_feed_done
, and then when the feed has completed, we query the ingested webpages (contents) by the feed_id
.
For each webpage content we get back, we display the URL of the webpage, and the extracted Markdown text.
Let's run it and see the results.
It correctly found only the pages in the Graphlit blog with 'graphlit' in the URL, as you can see, and then displayed the extracted Markdown text.
No need to use other web scraping APIs or build this yourself - it's all integrated natively into Graphlit!
Summary
Please email any questions on this tutorial or the Graphlit Platform to questions@graphlit.com.
For more information, you can read our Graphlit Documentation, visit our marketing site, or join our Discord community.