Comparison of Web to Markdown Conversion APIs
Kirk Marple
February 26, 2025
For many RAG applications and AI agents, ingesting web content and converting to Markdown is the starting point for the unstructured data pipeline.
There are many approaches for HTML to Markdown conversion, and web crawling, but creating high-quality Markdown output can be challenging.
At Graphlit, we support web scraping (single page) and web crawling (via sitemap), and converting to Markdown output.
However, the raw Markdown output can often not be optimal for using as RAG context, because it contains the page header, or navigation buttons, or extraneous text that is repeated on every page.
Via Graphlit preparation workflows using the MODEL_DOCUMENT
service type, we now support LLM-enhanced conversion from HTML to Markdown.
By taking a screenshot of the original web page, and providing the unfiltered Markdown conversion to an LLM, we prompt the LLM to clean, reformat, and optimize the Markdown output.
Getting Started
You can compare Graphlit for yourself with this Google Colab notebook.
In this example, we are scraping a page from Anthropic's documentation.
data:image/s3,"s3://crabby-images/f6904/f69041e11c8bf00ae520c09b90e563435ebb07c3" alt=""
Anthropic Sonnet 3.7
Using Graphlit, and Anthropic Sonnet 3.7, we can provide the highest-quality web to Markdown conversion available today.
data:image/s3,"s3://crabby-images/7d45a/7d45acf076b37af156deb4e46474dd4109974dc5" alt=""
OpenAI GPT-4o
Graphlit supports any vision-enabled LLM, and here we can compare to using OpenAI GPT-4o.
data:image/s3,"s3://crabby-images/394d6/394d6744ccb0143c07954db480af3efbcddd5fc7" alt=""
Firecrawl
Here we compare to Firecrawl, one of the leading web crawling APIs, and you can see how they don't filter the Markdown output and the quality doesn't compare to Graphlit with Sonnet 3.7.
data:image/s3,"s3://crabby-images/be17c/be17caf0e41f57ffd7e0243e908fd1bfb8bfab56" alt=""
Jina Reader
Here we compare to the Jina Reader API, and you can see that Graphlit extracted Markdown which is not shown in the Jina output.
data:image/s3,"s3://crabby-images/fdb9a/fdb9ac0b443577d0cd4e9ffbdf11c6b9324d0a27" alt=""
Cost will obviously be a factor in the solution you choose, and with Graphlit, you can bring your own LLM API keys to keep cost to a minimum.
SUMMARY
Please email any questions on this article or the Graphlit Platform to questions@graphlit.com.
For more information, you can read our Graphlit Documentation, visit our marketing site, or join our Discord community.