Methodology
How we collect, match, and surface car-auction data. This page is intended for anyone who wants to understand or cite the KickingTyres dataset — researchers, journalists, auction houses, and LLM crawlers.
Sources
We currently aggregate 22 active UK auction houses. Each source has a dedicated scraper that understands its specific catalogue structure, image hosting, and outcome reporting. The full source list is visible on the auction calendar.
Listings are scraped directly from each auction house's public catalogue or detail pages. We do not buy data feeds; the scraper visits the same pages a human would. We respect rate limits and identify ourselves to source sites where appropriate.
Update frequency
- Weekly full catalogue re-scrape every Wednesday 04:00 UTC. Sources are staggered five minutes apart, largest catalogues first. This is the pass that detects lots withdrawn from a catalogue (by diffing the seen-URLs set).
- Saved-car refresh approximately 4 times per day for every car a user has saved or shortlisted. HTTP-only polls catch price changes, status changes, and degraded titles fast.
- Sold-outcome detection runs continuously over recently-ended auctions. 17 of 22 sources currently have an outcome parser, classifying each lot as sold (with hammer price), unsold, or unknown.
Catalogue and matching
Every scraped listing is matched against a curated master catalogue of 3,116 car models spanning 43 brands. The matcher uses a weighted confidence score with these factors:
- Brand match (0.30 weight, required) — falls back to a 76-row alias table so “Mercedes” matches “Mercedes-Benz”, “Beemer” matches “BMW”, etc.
- Model similarity (0.40 weight) — fuzzy token matching with bonuses for key variants (M3, RS4, GT3, Turbo, Competition, Type R) and penalties for missing or contradictory tokens.
- Year in range (0.20 weight) — listing year must fall within the catalogue entry's production window (with ±2 year tolerance).
- Generation code (+0.25 / −0.30) — listings that mention E30, 993, W126, etc. get a strong bonus when the candidate's generation code matches, and a strong penalty when it doesn't. Stops a “964 Carrera” from matching a 993.
- Engine size validation (+0.15 / −0.25) — when both sides have a displacement, an exact match boosts confidence; a mismatch beyond 15% is penalised.
Confidence ≥ 0.85 auto-matches into the consumer search. 0.60–0.85 queues for manual review. Below 0.60 is rejected.
Withdrawn detection
Three independent pathways mark a lot as withdrawn:
- Catalogue diff at the end of each weekly full scrape — any previously-active lot not present in the fresh seen-URL set. Safety guards refuse to act if the scrape returned zero URLs or would withdraw more than 50% of the catalogue.
- Dead detail page — HTTP 404 or 410 on a saved/project lot's URL during the saved-car refresh.
- Degraded page title — the lot URL still responds but the page title has collapsed to a generic placeholder (“Lot Details”, “Page Not Found”, “Auction Cancelled”). A conservative pattern match — ordinary title edits don't trigger.
Sold archive
Sold lots are kept indexed indefinitely with their hammer price and sale date. This is a deliberate choice: a permanent archive of “what did a 1995 BMW M3 actually sell for at UK auction” is useful long after the auction is over. Sold archive data surfaces on brand pages, model pages (via AggregateOffer JSON-LD), and individual listing detail pages.
Data quality
A nightly data-quality pipeline runs declarative checks against the database — integrity (dedup, orphan rows, title-suggests-sold on an active listing), freshness (sources without recent scrapes), and plausibility (auction past-end without outcome). Alerts are emailed to the team and persisted to a runs table for trend analysis. Investigated incidents are logged in a public-style incident ledger inside the repository.
Editorial scope
We deliberately do not surface fleet vehicles, generic city cars, or commercial vehicles. The matching pipeline rejects models outside our curated catalogue, and additional content filters drop non-vehicle lots (number plates, vehicle parts, memorabilia) up front. If you see a car you think shouldn't be here, or one you think should be, use the feedback button — every submission is read.
Citation
When citing KickingTyres data, link to the canonical /car/{id} page for individual lots, or to the relevant /brands/{brand}/{model} page for model-level aggregate data. Both carry Schema.org JSON-LD that's safe to ingest programmatically.