Python SEO Crawler: Automate Technical Audits Without Screaming Frog

/01 What it is & why I built it

SEO Audit Crawler is a small, domain-agnostic Python crawler that extracts the signals you actually need for a technical SEO audit. You point it at a site, or feed it a URL list or an XML sitemap, pick the User-Agent you want to crawl as, and it returns a per-page CSV + JSON report plus a console summary of the most common issues. Nothing is hardcoded to one site; it works on any domain.

I built it because the technical-audit market is dominated by a handful of excellent but paid, GUI-bound tools. Screaming Frog is the standard, and it's great. But the moment you want to run an audit on a schedule, diff two crawls in version control, plug the output into a data pipeline, or simply hand a teammate a command they can run for free, a desktop app with a licence and a render budget gets in the way.

The problem this solves is the gap between "I need this signal right now, in a script" and "let me open a paid GUI and export a sheet." It's a free, scriptable, Python-native alternative for the practitioner who already lives in the terminal: the output is plain CSV and JSON you can diff, commit, and pipe; the whole thing is a single command driven by flags; and there are no crawl credits, seats, or SaaS dependencies to manage. When you need rendered SPA pages or Core Web Vitals you'll still reach for heavier tooling, but for the bread-and-butter technical checks, this is the no-friction option.

i

Who this is for

SEO practitioners, developers, and analysts who want repeatable, automatable technical audits without a per-seat licence, and who are comfortable running a Python command. If you've ever wished Screaming Frog had a clean CLI you could cron, this is that.

/02 Why manual crawls still matter

It's tempting to think Google Search Console and analytics tell you everything. They don't, and the gap is exactly where a crawler earns its keep.

GSC and analytics are rear-view mirrors. They only show you pages Google has already visited, indexed, and served impressions for. A crawler shows you what Google will see on its next visit: the pages, redirects, and tags as they exist right now, before they ever make it into a report.
Catch issues before they cost you rankings. Broken internal links, missing or conflicting canonical tags, duplicate titles, and missing meta descriptions rarely announce themselves. By the time a ranking drops and it surfaces in GSC, the damage is done. A crawl flags all of these up front, while they're still cheap to fix.
Audit hreflang and structured data at scale. International tagging and rich-result markup are easy to get subtly wrong: a missing x-default, a non-reciprocal return link, an invalid language code, a JSON-LD type missing a required property. Checking those by hand across hundreds of pages is infeasible; a crawler does it in one pass.
Pre-launch QA for new pages and migrations. Before a new section ships or a site moves, you want to confirm canonicals point where they should, redirects chain correctly, and nothing is accidentally noindex. A crawl against staging (or the new URLs) is the cheapest insurance you'll buy.
Scheduled crawls catch SEO regressions over time. Because the output is plain CSV/JSON, you can run the same crawl on a cron and diff today's report against last week's. A title that silently changed, a canonical that flipped, a page that started returning 404: regressions show up as a diff, not a mystery.
No third-party SaaS or crawl credits. The tool runs entirely on your machine (or your CI). There are no per-URL credits to ration, no seat limits, and no external service that has to be up for your audit to run.

+

Crawl as the bot that matters

Some sites serve different markup, redirects, or blocks to Googlebot than to a desktop browser. This crawler lets you pick the User-Agent, and it applies that agent to every request, including robots.txt, so robots rules are evaluated for that specific agent. Crawling as the bot you care about surfaces exactly the differences a browser would hide from you.

/03 How to install & run

You'll need Python and Git installed locally. Clone the repository, then install the dependencies. There are only three required packages.

bash / terminal

# 1. Clone the repository
git clone https://github.com/kiranbabuthatha/SEO-crawler.git
cd SEO-crawler

# 2. (Recommended) create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate          # macOS / Linux
# .venv\Scripts\activate           # Windows (PowerShell)

# 3. Install dependencies
pip install -r requirements.txt

requests, beautifulsoup4, and lxml are all that's required. To let the crawler accept Brotli-compressed responses, optionally install brotli. It advertises only the compression it can actually decode, so this is safe to skip:

bash / terminal

pip install brotli   # optional, adds Brotli-compressed response support

Verify it works offline first

Before pointing it at a live site, run the bundled offline test suite. No network required. You should see ALL PASSED:

bash / terminal

python test_crawler.py        # expect: ALL PASSED

Run your first crawl

The crawler is a single command driven by flags. The simplest run crawls a site as Googlebot mobile, following internal links up to depth 3, capped at 100 pages:

bash / terminal

python seo_crawler.py https://www.example.com --ua googlebot-mobile --max-pages 100

When it finishes you'll have seo_audit.csv and seo_audit.json in the working directory, plus a console summary of status codes and the most common issues. Override the output prefix with --out.

i

Three ways to seed a crawl

You can provide a positional start URL, a --urls-file (one URL per line), or --from-sitemap (auto-discovered from robots.txt or a URL you supply), in any combination. All sources are merged and deduplicated before the crawl begins.

/04 Features & what it checks

For every page it fetches, the crawler records a deep set of signals. Here's what each one is and why it matters for SEO.

Response & indexability

Response details: final URL, status code, redirect chain, response time, and content type. Why it matters: redirect chains waste crawl budget and dilute link equity; the final URL tells you where a request actually lands.
Indexability verdict: a single verdict that combines meta robots, the X-Robots-Tag header, the HTTP status, and the canonical, plus the reason a page is non-indexable. Why it matters: a page can be silently excluded from Google by any one of these; one verdict tells you whether it can rank at all, and why not.

On-page & content

Title & meta description: both their character length and estimated SERP pixel width. Why it matters: Google truncates on pixels, not characters, so a "55-character" title can still get cut off. Pixel width is the honest measure.
Word count & text-to-HTML ratio: visible word count and how much of the page is content vs. markup. Why it matters: a fast way to spot thin or boilerplate-heavy pages.
Heading structure: full H1 to H6 counts plus a document-order check that flags skipped levels (e.g. H2 to H4) and a first heading that isn't H1. Why it matters: a clean heading outline helps both accessibility and how search engines parse page structure.

Canonical, international & links

Canonical detection: read from both the <link> tag and the HTTP Link header, with self-referencing detection and conflicting/multiple-canonical flagging. With --check-links the canonical target's HTTP status is validated too. Why it matters: conflicting canonicals are a classic, hard-to-spot cause of the wrong URL being indexed.
Hreflang (international): every annotation (language and URL) plus validation: x-default presence, self-reference, malformed language codes, and reciprocal return-links between crawled pages. Why it matters: hreflang only works when return links are reciprocal and codes are valid; one broken link breaks the cluster.
Link analysis: internal, external, nofollow, sponsored, and ugc counts, plus empty and generic anchor-text ("click here") detection. With --check-links, every link is HEAD-checked for broken (4xx/5xx) and redirecting targets. Why it matters: broken internal links leak crawl budget and frustrate users; descriptive anchors pass clearer relevance signals.

Media, social & structured data

Images: totals, missing alt, missing width/height (a CLS risk), and lazy-loaded counts. Why it matters: missing alt text hurts accessibility and image search; missing dimensions cause layout shift that drags down Core Web Vitals.
Social tags: Open Graph and Twitter Card tags with completeness validation (missing og:image, og:url, twitter:card, etc.). Why it matters: these control how your pages look when shared, which affects click-through from social.
Structured data: JSON-LD @type values parsed recursively, with required-property validation for common rich-result types, plus Microdata (itemtype) and RDFa detection. Why it matters: invalid or incomplete markup means missed rich results; this catches it before Google does.

Hygiene, headers & site-level

Page hygiene: <base href>, favicon, rel="next"/"prev" pagination, a non-responsive-viewport check, and mixed-content (insecure HTTP subresources on an HTTPS page) detection. Why it matters: mixed content breaks the secure padlock and can block resources; a missing responsive viewport hurts mobile ranking.
Security & performance headers: HTTPS, HSTS, content-encoding, and cache-control. Why it matters: these are baseline trust and performance signals.
Site-level duplicate clustering: duplicate title, meta description, H1, and self-canonical clusters across the entire crawl. Why it matters: these are the fingerprints of keyword cannibalisation and index bloat, which you can only see by looking across the whole site at once.
Per-page issues list: a consolidated list of flagged problems (missing title, thin content, multiple H1s, canonical conflicts, and everything above). Why it matters: it's your prioritised to-do list, page by page.

User-Agent flexibility

The chosen User-Agent is applied to every request, including robots.txt. Pick the profile that matches the bot you're auditing for:

text / --ua profiles

googlebot          Googlebot (desktop), the default
googlebot-mobile   Googlebot (smartphone)
bingbot            Bingbot
chrome-desktop     Chrome on Windows
chrome-mobile      Chrome on Android
custom             Your own string via --ua-string

!

Known limitations

The crawler reads raw HTML and does not execute JavaScript. Server-rendered sites are fine, but client-rendered SPAs need a headless-browser layer (e.g. Playwright). It also doesn't measure Core Web Vitals (LCP/INP/CLS), which need a real browser; pair it with the PageSpeed Insights API for field data.

/05 Usage examples

A few real runs from the README, with notes on what the output means and how to act on it.

Audit exactly a list of URLs

When you want to audit a specific set of pages, not crawl outward from them, point --urls-file at a text file and add --list-only to disable link discovery. Here, 8 workers fetch in parallel:

bash / terminal

python seo_crawler.py --urls-file examples/sample_urls.txt --list-only --workers 8

The input file is plain text, one URL per line. Blank lines and # comments are ignored:

text / sample_urls.txt

# examples/sample_urls.txt
https://www.example.com/
https://www.example.com/pricing
https://www.example.com/de/pricing

What to do with the output: open seo_audit.csv and sort by the parse_ok column first. Any false rows are fetch/decode failures to fix before you trust the rest. Then scan the indexability verdict and issues columns for the pages you care about.

Audit a whole sitemap with broken-link checks

Seed from the site's sitemap (auto-discovered via robots.txt), crawl every URL (--max-pages 0 = unlimited), and HEAD-check every link and canonical target:

bash / terminal

python seo_crawler.py https://www.example.com --from-sitemap --list-only --max-pages 0 --check-links

What to do with the output: --check-links attributes every broken (4xx/5xx) and redirecting target back to the pages that contain it, so the report tells you not just that a link is broken but where to fix it. Triage 5xx and 404s first; then clean up redirecting internal links so they point straight at the final URL.

Crawl as a bot, following links to depth 3

bash / terminal

python seo_crawler.py https://www.example.com --ua googlebot-mobile --max-pages 100 --depth 3

What the output means: the console summary tallies the status-code distribution, how many pages are indexable, and the most common issues across the crawl. That's your headline health check. Drop into seo_audit.json for the full nested detail on any single page, or seo_audit.csv for a spreadsheet-friendly overview.

+

Common gotcha: "everything reports as missing"

That's almost always a compression-decode problem: the server sent Brotli/zstd your environment can't decompress. The crawler avoids it by advertising only gzip/deflate unless a decoder is installed, and flags a single clear warning (with parse_ok = false) instead of a wall of false "missing tag" issues. Install brotli if you want Brotli support.

i

Fast, without getting blocked

Speed comes from concurrent workers (--workers, default 5); politeness keeps you from getting blocked. Per-domain rate limiting spaces out requests by --delay (default 0.3s, jittered ±30%), automatic retry honours Retry-After and backs off on 429/5xx, and robots.txt Crawl-delay is respected for the chosen User-Agent. The defaults are tuned to be safe on sites you don't control; raise --workers freely on your own.

/06 When to use this tool

This crawler is built for the recurring, scriptable technical checks, the work you'd otherwise do by hand or rationing crawl credits for. Reach for it when you're doing:

Routine site audits. Run a crawl across the site to surface broken links, canonical conflicts, duplicate titles, missing meta descriptions, and thin content, then work the per-page issues list as a prioritised backlog.
Pre-launch checks. Before new pages or a new section ships, crawl the new URLs (or staging) to confirm canonicals, indexability, redirects, and structured data are correct before they're live and indexed.
Migration QA. During a site move or redesign, crawl the URL set with --check-links to verify redirects chain correctly, nothing is accidentally noindex, and canonicals point at the new structure. A migration is the highest-risk moment in a site's SEO life.
Ongoing monitoring. Because the output is plain CSV/JSON, schedule the same crawl on a cron and diff successive runs to catch regressions (a flipped canonical, a title that changed, a page that started 404-ing) as a diff rather than a ranking mystery.
International & structured-data audits at scale. When you need to validate hreflang reciprocity or JSON-LD across hundreds of pages, this does in one pass what's infeasible by hand.

When you instead need rendered SPA content, Core Web Vitals field data, or a point-and-click GUI, pair it with (or fall back to) heavier tooling. Those are deliberate non-goals here.

/07 Links & attribution

The SEO Audit Crawler is open source under the MIT licence. Clone it, run it against your own site, and star the repo if it saves you time.

Grab the code and run your own audits

Free, scriptable, and Python-native. Point it at your site, pick a User-Agent, and get a clean CSV + JSON report in one command.

View on GitHub

Built by Kiran Babu Thatha. Technical SEO and automation, Berlin. More at kiranbabuthatha.com. Want a crawl or data extract built around your site and reporting, with JavaScript rendering, scheduled runs, custom fields, or log-file and Search Console joins? See the Python SEO automation services or get in touch.

A Python SEO crawler that automates technical audits, without Screaming Frog.

/01 What it is & why I built it

/02 Why manual crawls still matter

/03 How to install & run

Verify it works offline first

Run your first crawl

/04 Features & what it checks

Response & indexability

On-page & content

Canonical, international & links

Media, social & structured data

Hygiene, headers & site-level

User-Agent flexibility

/05 Usage examples

Audit exactly a list of URLs

Audit a whole sitemap with broken-link checks

Crawl as a bot, following links to depth 3

/06 When to use this tool

/07 Links & attribution

Grab the code and run your own audits

Let's build something
that ranks.

/01 What it is & why I built it

/02 Why manual crawls still matter

/03 How to install & run

Verify it works offline first

Run your first crawl

/04 Features & what it checks

Response & indexability

On-page & content

Canonical, international & links

Media, social & structured data

Hygiene, headers & site-level

User-Agent flexibility

/05 Usage examples

Audit exactly a list of URLs

Audit a whole sitemap with broken-link checks

Crawl as a bot, following links to depth 3

/06 When to use this tool

/07 Links & attribution

Grab the code and run your own audits

Let's build somethingthat ranks.

Let's build something
that ranks.