Web Scraper

About

Web Scraper makes it effortless to scrape websites. Just provide a URL and CSS selector and it will return JSON containing the text contents of the matching elements.

How is Web Scraper built with Workers?

Web Scraper uses Cloudflare Workers in a few distinct ways:

The scraping itself

The scraping functionality is built using the HTMLRewriter API within Cloudflare Workers. Using this API the script is able leverage a fast and powerful HTML parser to quickly scan a document for the given selector. This is all accomplished with fewer than 100 lines of code.

Serving the site and its API

The actual response you get when you visit the site is returned by the Worker script itself. This is done by storing the source of the page in a JS template literal, importing it, and allowing Wrangler, the Workers CLI, to bundle up the site into a single script using Webpack.

The API is similarly implemented. When the required query params are added to the requested URL, the Worker script returns a application/JSON response instead of an HTML document.

The site’s design

The site’s CSS is served by another Worker script, ui.adam.workers.dev. This is a UI API which can return the CSS for a specified list of components, on the fly. This is done by passing query parameters to the script:

https://ui.adam.workers.dev/?
  components=
    link,
    button,
    formField,
    input,
    checkbox,
    stack,
    row,
    dialog

The resulting CSS is generated on the fly at the edge, meaning the request contains nothing more than what’s needed for the components on the page. And the result is cached at the edge so it’s as fast as if it were a static CSS file.

The domain name

By registering the free subdomain “scraper”, then naming the script “web”, the site is able to be published at the fun and descriptive domain web.scraper.workers.dev.