/01 What it is & why I built it
SEO Audit Crawler is a small, domain-agnostic Python crawler that extracts the signals you actually need for a technical SEO audit. You point it at a site — or feed it a URL list or an XML sitemap — pick the User-Agent you want to crawl as, and it returns a per-page CSV + JSON report plus a console summary of the most common issues. Nothing is hardcoded to one site; it works on any domain.
I built it because the technical-audit market is dominated by a handful of excellent but paid, GUI-bound tools. Screaming Frog is the standard, and it's great — but the moment you want to run an audit on a schedule, diff two crawls in version control, plug the output into a data pipeline, or simply hand a teammate a command they can run for free, a desktop app with a licence and a render budget gets in the way.
The problem this solves is the gap between "I need this signal right now, in a script" and "let me open a paid GUI and export a sheet." It's a free, scriptable, Python-native alternative for the practitioner who already lives in the terminal: the output is plain CSV and JSON you can diff, commit, and pipe; the whole thing is a single command driven by flags; and there are no crawl credits, seats, or SaaS dependencies to manage. When you need rendered SPA pages or Core Web Vitals you'll still reach for heavier tooling — but for the bread-and-butter technical checks, this is the no-friction option.
SEO practitioners, developers, and analysts who want repeatable, automatable technical audits without a per-seat licence — and who are comfortable running a Python command. If you've ever wished Screaming Frog had a clean CLI you could cron, this is that.
/02 Why manual crawls still matter
It's tempting to think Google Search Console and analytics tell you everything. They don't — and the gap is exactly where a crawler earns its keep.
- GSC and analytics are rear-view mirrors. They only show you pages Google has already visited, indexed, and served impressions for. A crawler shows you what Google will see on its next visit — the pages, redirects, and tags as they exist right now, before they ever make it into a report.
- Catch issues before they cost you rankings. Broken internal links, missing or conflicting canonical tags, duplicate titles, and missing meta descriptions rarely announce themselves. By the time a ranking drops and it surfaces in GSC, the damage is done. A crawl flags all of these up front, while they're still cheap to fix.
-
Audit hreflang and structured data at scale. International tagging and
rich-result markup are easy to get subtly wrong — a missing
x-default, a non-reciprocal return link, an invalid language code, a JSON-LD type missing a required property. Checking those by hand across hundreds of pages is infeasible; a crawler does it in one pass. -
Pre-launch QA for new pages and migrations. Before a new section ships or a
site moves, you want to confirm canonicals point where they should, redirects chain
correctly, and nothing is accidentally
noindex. A crawl against staging (or the new URLs) is the cheapest insurance you'll buy. - Scheduled crawls catch SEO regressions over time. Because the output is plain CSV/JSON, you can run the same crawl on a cron and diff today's report against last week's. A title that silently changed, a canonical that flipped, a page that started returning 404 — regressions show up as a diff, not a mystery.
- No third-party SaaS or crawl credits. The tool runs entirely on your machine (or your CI). There are no per-URL credits to ration, no seat limits, and no external service that has to be up for your audit to run.
Some sites serve different markup, redirects, or blocks to Googlebot than to a desktop
browser. This crawler lets you pick the User-Agent — and it applies that agent to
every request, including robots.txt, so robots rules are evaluated
for that specific agent. Crawling as the bot you care about surfaces exactly the
differences a browser would hide from you.
/03 How to install & run
You'll need Python and Git installed locally. Clone the repository, then install the dependencies — there are only three required packages.
# 1. Clone the repository
git clone https://github.com/kiranbabuthatha/SEO-crawler.git
cd SEO-crawler
# 2. (Recommended) create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows (PowerShell)
# 3. Install dependencies
pip install -r requirements.txt
requests, beautifulsoup4, and lxml are all that's
required. To let the crawler accept Brotli-compressed responses, optionally install
brotli. It advertises only the compression it can actually decode, so this is
safe to skip:
pip install brotli # optional — adds Brotli-compressed response support
Verify it works offline first
Before pointing it at a live site, run the bundled offline test suite — no network required.
You should see ALL PASSED:
python test_crawler.py # expect: ALL PASSED
Run your first crawl
The crawler is a single command driven by flags. The simplest run crawls a site as Googlebot mobile, following internal links up to depth 3, capped at 100 pages:
python seo_crawler.py https://www.example.com --ua googlebot-mobile --max-pages 100
When it finishes you'll have seo_audit.csv and seo_audit.json in
the working directory, plus a console summary of status codes and the most common issues.
Override the output prefix with --out.
You can provide a positional start URL, a --urls-file (one URL per line), or
--from-sitemap (auto-discovered from robots.txt or a URL you
supply) — in any combination. All sources are merged and deduplicated before the crawl
begins.
/04 Features & what it checks
For every page it fetches, the crawler records a deep set of signals. Here's what each one is and why it matters for SEO.
Response & indexability
- Response details — final URL, status code, redirect chain, response time, and content type. Why it matters: redirect chains waste crawl budget and dilute link equity; the final URL tells you where a request actually lands.
-
Indexability verdict — a single verdict that combines
meta robots, theX-Robots-Tagheader, the HTTP status, and the canonical, plus the reason a page is non-indexable. Why it matters: a page can be silently excluded from Google by any one of these; one verdict tells you whether it can rank at all, and why not.
On-page & content
- Title & meta description — both their character length and estimated SERP pixel width. Why it matters: Google truncates on pixels, not characters, so a "55-character" title can still get cut off — pixel width is the honest measure.
- Word count & text-to-HTML ratio — visible word count and how much of the page is content vs. markup. Why it matters: a fast way to spot thin or boilerplate-heavy pages.
-
Heading structure — full
H1–H6counts plus a document-order check that flags skipped levels (e.g.H2→H4) and a first heading that isn'tH1. Why it matters: a clean heading outline helps both accessibility and how search engines parse page structure.
Canonical, international & links
-
Canonical detection — read from both the
<link>tag and the HTTPLinkheader, with self-referencing detection and conflicting/multiple-canonical flagging. With--check-linksthe canonical target's HTTP status is validated too. Why it matters: conflicting canonicals are a classic, hard-to-spot cause of the wrong URL being indexed. -
Hreflang (international) — every annotation (language + URL) plus
validation:
x-defaultpresence, self-reference, malformed language codes, and reciprocal return-links between crawled pages. Why it matters: hreflang only works when return links are reciprocal and codes are valid; one broken link breaks the cluster. -
Link analysis — internal, external,
nofollow,sponsored, andugccounts, plus empty and generic anchor-text ("click here") detection. With--check-links, every link is HEAD-checked for broken (4xx/5xx) and redirecting targets. Why it matters: broken internal links leak crawl budget and frustrate users; descriptive anchors pass clearer relevance signals.
Media, social & structured data
-
Images — totals, missing
alt, missingwidth/height(a CLS risk), and lazy-loaded counts. Why it matters: missing alt text hurts accessibility and image search; missing dimensions cause layout shift that drags down Core Web Vitals. -
Social tags — Open Graph and Twitter Card tags with completeness
validation (missing
og:image,og:url,twitter:card, etc.). Why it matters: these control how your pages look when shared, which affects click-through from social. -
Structured data — JSON-LD
@typevalues parsed recursively, with required-property validation for common rich-result types, plus Microdata (itemtype) and RDFa detection. Why it matters: invalid or incomplete markup means missed rich results; this catches it before Google does.
Hygiene, headers & site-level
-
Page hygiene —
<base href>, favicon,rel="next"/"prev"pagination, a non-responsive-viewport check, and mixed-content (insecure HTTP subresources on an HTTPS page) detection. Why it matters: mixed content breaks the secure padlock and can block resources; a missing responsive viewport hurts mobile ranking. - Security & performance headers — HTTPS, HSTS, content-encoding, and cache-control. Why it matters: these are baseline trust and performance signals.
-
Site-level duplicate clustering — duplicate
title, meta description,H1, and self-canonical clusters across the entire crawl. Why it matters: these are the fingerprints of keyword cannibalisation and index bloat, which you can only see by looking across the whole site at once. -
Per-page issues list — a consolidated list of flagged problems (missing
title, thin content, multiple
H1s, canonical conflicts, and everything above). Why it matters: it's your prioritised to-do list, page by page.
User-Agent flexibility
The chosen User-Agent is applied to every request, including robots.txt. Pick
the profile that matches the bot you're auditing for:
googlebot Googlebot (desktop) — default
googlebot-mobile Googlebot (smartphone)
bingbot Bingbot
chrome-desktop Chrome on Windows
chrome-mobile Chrome on Android
custom Your own string via --ua-string
The crawler reads raw HTML — it does not execute JavaScript. Server-rendered sites are fine, but client-rendered SPAs need a headless-browser layer (e.g. Playwright). It also doesn't measure Core Web Vitals (LCP/INP/CLS), which need a real browser — pair it with the PageSpeed Insights API for field data.
/05 Usage examples
A few real runs from the README, with notes on what the output means and how to act on it.
Audit exactly a list of URLs
When you want to audit a specific set of pages — not crawl outward from them — point
--urls-file at a text file and add --list-only to disable link
discovery. Here, 8 workers fetch in parallel:
python seo_crawler.py --urls-file examples/sample_urls.txt --list-only --workers 8
The input file is plain text, one URL per line — blank lines and # comments are
ignored:
# examples/sample_urls.txt
https://www.example.com/
https://www.example.com/pricing
https://www.example.com/de/pricing
What to do with the output: open seo_audit.csv and sort by the
parse_ok column first — any false rows are fetch/decode failures to
fix before you trust the rest. Then scan the indexability verdict and issues columns for the
pages you care about.
Audit a whole sitemap with broken-link checks
Seed from the site's sitemap (auto-discovered via robots.txt), crawl every URL
(--max-pages 0 = unlimited), and HEAD-check every link and canonical target:
python seo_crawler.py https://www.example.com --from-sitemap --list-only --max-pages 0 --check-links
What to do with the output: --check-links attributes every
broken (4xx/5xx) and redirecting target back to the pages that contain it — so the report
tells you not just that a link is broken but where to fix it. Triage 5xx and
404s first; then clean up redirecting internal links so they point straight at the final URL.
Crawl as a bot, following links to depth 3
python seo_crawler.py https://www.example.com --ua googlebot-mobile --max-pages 100 --depth 3
What the output means: the console summary tallies the status-code
distribution, how many pages are indexable, and the most common issues across the crawl —
your headline health check. Drop into seo_audit.json for the full nested detail
on any single page, or seo_audit.csv for a spreadsheet-friendly overview.
That's almost always a compression-decode problem — the server sent Brotli/zstd your
environment can't decompress. The crawler avoids it by advertising only
gzip/deflate unless a decoder is installed, and flags a single
clear warning (with parse_ok = false) instead of a wall of false "missing
tag" issues. Install brotli if you want Brotli support.
Speed comes from concurrent workers (--workers, default 5); politeness keeps
you from getting blocked. Per-domain rate limiting spaces out requests by
--delay (default 0.3s, jittered ±30%), automatic retry honours
Retry-After and backs off on 429/5xx, and robots.txt
Crawl-delay is respected for the chosen User-Agent. The defaults are tuned to
be safe on sites you don't control — raise --workers freely on your own.
/06 When to use this tool
This crawler is built for the recurring, scriptable technical checks — the work you'd otherwise do by hand or rationing crawl credits for. Reach for it when you're doing:
- Routine site audits. Run a crawl across the site to surface broken links, canonical conflicts, duplicate titles, missing meta descriptions, and thin content — then work the per-page issues list as a prioritised backlog.
- Pre-launch checks. Before new pages or a new section ships, crawl the new URLs (or staging) to confirm canonicals, indexability, redirects, and structured data are correct before they're live and indexed.
-
Migration QA. During a site move or redesign, crawl the URL set with
--check-linksto verify redirects chain correctly, nothing is accidentallynoindex, and canonicals point at the new structure — the highest-risk moment in a site's SEO life. - Ongoing monitoring. Because the output is plain CSV/JSON, schedule the same crawl on a cron and diff successive runs to catch regressions — a flipped canonical, a title that changed, a page that started 404-ing — as a diff rather than a ranking mystery.
- International & structured-data audits at scale. When you need to validate hreflang reciprocity or JSON-LD across hundreds of pages, this does in one pass what's infeasible by hand.
When you instead need rendered SPA content, Core Web Vitals field data, or a point-and-click GUI, pair it with (or fall back to) heavier tooling — those are deliberate non-goals here.
/07 Links & attribution
The SEO Audit Crawler is open source under the MIT licence. Clone it, run it against your own site, and star the repo if it saves you time.
Grab the code and run your own audits
Free, scriptable, and Python-native. Point it at your site, pick a User-Agent, and get a clean CSV + JSON report in one command.
View on GitHub ↗Built by Kiran Babu — technical SEO + automation, Berlin. More at kiranbabuthatha.com. Want a crawl or data extract built around your site and reporting — JavaScript rendering, scheduled runs, custom fields, log-file and Search Console joins? See the Python SEO automation services or get in touch.