Database Downloads
| File Name / Mirror Link |
Format |
Record Count |
Description |
| aoc_tenders.db |
SQLite 3 |
~4.9M records |
Awards of Contract database. Contains winner metadata, contract values, dates. |
| tenders_vps.db |
SQLite 3 |
~3.9M records |
Active/Archived Tenders database. Contains published notices, EMDs, fee information. |
================================================================================
AN OPEN INVITATION TO DATA SCIENTISTS, RESEARCHERS, JOURNALISTS,
AND ANYONE WHO BELIEVES PUBLIC MONEY DESERVES PUBLIC SCRUTINY:
Every rupee the government spends goes through a tender. A notice is published,
bids are invited, a winner is selected, a contract is signed. This process
is, in theory, the immune system of public finance. In practice, it is where
most large-scale procurement corruption quietly occurs.
I spent two weeks building a high-throughput scraper that has systematically
crawled and archived the Central Public Procurement Portal (CPPP) of India
-- the government's own public-facing procurement database. The result is two
flat SQLite databases totalling over 8.8 million records, covering both the
initial tender notices and the final award outcomes, with full structured detail.
No paywalls. No throttled APIs. No pagination traps. No obfuscated HTML.
Just the data, as it was published.
The database is now yours. Here is what we ask in return:
1. Download it.
2. Look at who keeps winning.
3. Look at how short the bid windows are.
4. Look at how many tenders receive exactly one bid.
5. Run the numbers. Write the queries. Publish the findings.
Corruption is not invisible. It is just tedious to find. This dataset removes
the tedium. The analysis, and the courage to share it, is still on us.
COORDINATION GROUP (Discuss, share queries, and post findings):
Discord: https://discord.gg/7Zsgyg86Mq
================================================================================
Scope of Data
SOURCE
Portal : Central Public Procurement Portal (CPPP) -- eprocure.gov.in
Coverage : National and state-level procurement across all sectors
(Works, Goods, Services, Consultancies)
Portal Types Covered:
- Central Government Ministries & Departments
- State Government Portals (via CPPP aggregation)
- Defence Procurement (defproc.gov.in mirror entries)
- State-specific portals (e.g., wbtenders.gov.in, etenders.kerala.gov.in)
CORPUS SIZE
aoc_tenders.db : 4,921,960 award of contract listing records
aoc_details : 4,540,739 fully parsed detail pages (JSON)
tenders_vps.db : 3,952,191 published tender notice records
tender_details : 3,178,485 fully parsed tender detail pages (JSON)
-------------------------
Total : ~16,592,415 structured records across both databases
DATA COLLECTION METHOD
- Concurrent HTTP crawlers with session rotation and backoff
- HTML table extraction into normalized JSON key-value payloads
- MD5-hashed internal IDs derived from source URLs (collision-safe deduplication)
- Partition sharding for parallel ingestion
- All timestamps preserved from source in IST
KEY AUDITABLE FIELDS & RESEARCH ANGLES
Bid Competition Analysis:
* Number of bids received per award -- identify single-bid contracts at scale
* Bid submission windows (e_published_date vs bid_submission_closing_date)
-- artificially short windows are a known indicator of pre-selected winners
Vendor Concentration:
* Name of the selected bidder(s) -- aggregate winner frequency by vendor name
* Address of the selected bidder(s) -- cluster by geography or address overlap
* Cross-reference multiple contracts won by same entity across departments
Financial Anomaly Detection:
* Contract Value vs. EMD ratio -- unusually low EMDs can deter legitimate bidders
* Tender Fee structures -- high document fees as gatekeeping mechanism
* Contract Value outliers per category and per organisation
Timeline & Process Integrity:
* Corrigendum frequency -- repeated amendments can signal process manipulation
* Bid opening vs. closing gap -- very short gaps reduce competitive legitimacy
* Document download window vs. submission window asymmetries
Sector & Department Mapping:
* Organisation Type (Central / State / Defence / PSU)
* Product Category & Sub-category breakdowns for sector-level analysis
* Department-level award concentration over time
Database Schemas
1. aoc_tenders.db Schema Overview
-- TABLE: aoc_tenders (~4,921,960 rows)
-- Preliminary listing metadata for award notifications.
CREATE TABLE aoc_tenders (
internal_id TEXT PRIMARY KEY, -- MD5 hash of detail_url (used as unique key)
portal_type TEXT, -- Source portal classifier
year INTEGER, -- Calendar year of listing
sl_no TEXT, -- List serial number
aoc_date TEXT, -- Contract award date timestamp
closing_date TEXT, -- Original bid closing date
title TEXT, -- Subject line/title of the award
ref_no TEXT, -- Tender reference number
tender_id TEXT, -- Public tender ID key
org_name TEXT, -- Purchasing department or state agency
detail_url TEXT, -- CPPP details page source URL
partition_id INTEGER -- Hash partition index
);
-- TABLE: aoc_details (~4,540,739 rows)
-- Deep crawled values corresponding to award details.
CREATE TABLE aoc_details (
internal_id TEXT PRIMARY KEY, -- FK mapping to aoc_tenders.internal_id
tender_id TEXT, -- Public tender ID key
scraped_at TEXT, -- Timestamp of crawler execution
details_json TEXT -- JSON representation of raw HTML table key-values
);
-- JSON Schema keys inside aoc_details.details_json:
{
"Tender Type": string,
"Contract Date": string,
"Contract Value": string (currency value),
"Published Date": string,
"Tender Document": string (URL),
"Tender Ref. No.": string,
"Organisation Name": string,
"Tender Description": string,
"Number of bids received": string,
"Name of the selected bidder(s)": string,
"Address of the selected bidder(s)": string,
"Date of Completion/Completion Period in Days": string
}
2. tenders_vps.db Schema Overview
-- TABLE: tenders (~3,952,191 rows)
-- Listing metadata for active/archived tenders.
CREATE TABLE tenders (
internal_id TEXT PRIMARY KEY, -- Base64 decoded internal identifier
tender_id TEXT, -- Public tender ID key
detail_url TEXT, -- URL link to detail view
status TEXT, -- Scraper classification ('active' / 'archived')
organisation_name TEXT, -- Organisation or state department name
title TEXT, -- Tender title or description
reference_number TEXT, -- Department tender reference code
portal_type TEXT, -- Source category (org / state)
serial_number TEXT, -- List serial number
e_published_date TEXT, -- Published date timestamp
bid_submission_closing_date TEXT, -- Closing date timestamp
tender_opening_date TEXT, -- Bid opening date timestamp
corrigendum_url TEXT, -- Link to corrigendum updates page (if any)
scraped_at TEXT, -- Crawl execution timestamp
partition_id INTEGER -- Hash partition index
);
-- TABLE: tender_details (~3,178,485 rows)
-- Deep metadata and full parsed html values.
CREATE TABLE tender_details (
internal_id TEXT PRIMARY KEY, -- FK mapping to tenders.internal_id
tender_id TEXT, -- Public tender ID key
details_json TEXT, -- JSON representation of raw HTML details table
scraped_at TEXT -- Timestamp of deep crawl
);
-- JSON Schema keys inside tender_details.details_json:
{
"Tender Reference Number": string,
"Tender Title": string,
"Organisation Name": string,
"Organisation Type": string,
"Tender Category": string,
"Tender Type": string,
"Product Category": string,
"Product Sub-Category": string,
"ePublished Date": string,
"Bid Opening Date": string,
"Bid Submission Start Date": string,
"Bid Submission End Date": string,
"Document Download Start Date": string,
"Document Download End Date": string,
"EMD": string (Earned Money Deposit),
"Tender Fee": string,
"Location": string,
"Address": string,
"Name": string (Contact officer name),
"Work Description": string,
"Tender Document": string (URL)
}