09 — MongoDB masters export (A14 unblocked)
Status: A14 (V2 access method) closed 2026-05-06. Direct read access to the Swasti mForm V2 production database. All masters needed to seed staging Frappe are extracted to local JSONL and ready to load.
Connection
- Host:
mform-v2.0vgd0.mongodb.net(MongoDB Atlas, ap-south-1, MongoDB 8.0.21) - Database:
mformv2_swastihealthprod— single prod database, 79 collections - Auth: read-only user
abhijit_nair, credentials at~/Dhwani/keys/swasti-mform-v2-mongo.txt(mode 600) - Quirk: first connect is slow (~8–15s, cold replica set discovery). Use
serverSelectionTimeoutMS=20000in any client.
Re-running the export
~/Dhwani/swasti-mform-migration/.venv/bin/python \
~/Dhwani/swasti-mform-migration/scripts/export_masters.py
Idempotent overwrite (~30s end-to-end). Output at ~/Dhwani/swasti-mform-migration/raw/mongo-export/ (mode 700, files mode 600 — contains user PII).
What was exported
20 master collections as JSONL (BSON-faithful via bson.json_util):
| Collection | Docs | Notes |
|---|---|---|
forms | 13 | Header-only — no questions array |
tempforms | 14 | Carries the questions arrays (598 questions across all forms) |
states | 16 | Geography level 1 |
districts | 31 | |
blocks | 131 | |
grampanchayats | 1,963 | |
villages | 3,967 | |
roles | 8 | |
users | 481 | PII: name + mobile |
useraccesses | 489 | Geography + role assignment per user |
partners | 1 | Only “CMS” |
projects | 1 | Only “Social Protection” (PM-01) |
dashboards | 5 | |
dashboardindicators | 13 | |
contentflows | 10 | |
contentnavigations | 238 | |
phonedetails | 203 | Device telemetry, not strictly master |
questionbanks | 1 | Test entry |
questioncategories | 1 | Test entry |
sdgs | 0 | Empty in prod |
4 derived files:
derived/questions_index.jsonl— 598 questions indexed as(publishedFormId, order) → {title, shortKey, input_type, validation, answer_option_count, ...}. This is the canonical source for the A16.1(formId, order) → V3 fieldnameregistry.derived/formdynamicoptions__shape.jsonl— 14 rows (one per(formId, order)group with dynamic options), with total rows + active rows + distinct counts ofoptionId,optionName,parentOption. Diagnostic file: tells which groups are masters vs per-response.derived/formdynamicoptions__masters_active.jsonl— DistinctoptionNameper(formId, order), isActive=true. Cleaner master, drops deprecated/test entries. Use this for seeding.derived/formdynamicoptions__masters_all.jsonl— Same but all rows (active + inactive). Diff against_activeto find deprecated entries.
Three architectural surprises worth remembering
1. forms is header-only — questions live in tempforms
The forms collection has 13 docs but no questions array. All question definitions live in tempforms.questions, keyed by tempforms.publishedFormId (1000–1012). This matters when reading the V2 schema programmatically — don’t look at forms.questions, it doesn’t exist.
2. formdynamicoptions has dual semantics
The collection has 2.1M rows but only 14 distinct (formId, order) groups. Each group has one of two semantics:
- Master-style (low distinct names): the actual reference data — schemes, documents, statuses, etc.
- Per-response-generated (huge distinct names): rows created at submission time, keyed to specific responses. Three groups have 309k+ distinct names — these are member-id pickers, not masters.
The 5,000-distinct-names threshold cleanly separates the two. The __shape.jsonl file makes it explicit:
| (formId, order) | total rows | distinct names | role |
|---|---|---|---|
| (1000, “36”) | 2,122 | 11 | Member status master |
| (1002, “20”) | 197 | 185 active | Scheme master (sa_a_eligible) |
| (1002, “44”) | 105 | 6 | Historical (order 44 doesn’t exist in current tempforms — likely from older forms.version) |
| (1002, “6”) | 494,071 | 308,678 | Per-response member ref — skip |
| (1003, “8”) | 536,554 | 161 | Scheme reference in followup |
| (1003, “11”) | 501,220 | 154 | Followup status |
| (1004, “2”) | 494,997 | 309,142 | Per-response member ref — skip |
| (1004, “41”) | 32 | 31 active | Document master |
| (1005, “2”) | 60,734 | 34 | Document followup status |
| (1005, “16”) | 13,614 | 32 | |
| (1010, “4”) | 3,266 | 3,243 | Member-id picker for HS — per-response |
| (1011, “141”) | 449 | 446 | Per-response member ref |
| (1011, “149”) | 56 | 13 active | Health followup status master |
| (1012, “36”) | 1 | 1 | Test form |
3. The 107 schemes are actually 185
PM said “107 schemes” in the kickoff. Production has 185 active distinct scheme names at (formId=1002, order='20'), shortKey sa_a_eligible. Real schemes (sample): “Aam Aadmi Bima Yojana”, “Agristack(Farmer Id) Maharasthra”, “Ambedkar Hastshilp Vikas Yojana”, “Annadatha Sukhibhava”, “Antyodaya Anna Yojana”, “Ayushmaan Bharat Health Account” (39,223 fills — most-used), “Bhraman-Darshan Yojana”, etc.
Could be: PM’s count was rough; or 185 includes some closed/regional variants. Either way the seed list is what’s in formdynamicoptions__masters_active.jsonl for that group, not 107 hand-picked entries.
The document master at (formId=1004, order='41') is 31 active types (not 32) — Aadhar, PAN, Ration Card, Voter ID, Bank Account, Income Certificate, etc. Aadhar is by far the most-filled (54,439 fills).
What was not exported (deferred to A16.5)
Response data — out of scope for masters/seeds, will be imported as part of the response migration:
form_1000…form_1012(1.7M+ docs total — actual form responses)clientformdatas(1.6M — union view of all responses)rawclientformdatas(226k — pre-processed inputs)- All
*_copy_*snapshots from Oct/Nov/Dec/May (point-in-time backups) errorlogs(641k),csvlogs,mediaconversions,apihistories,emaillogs,dataversions,migrationerrorlogs(operational data, not masters)
Where to seed Frappe from
- Static masters (states, districts, blocks, GPs, villages, roles, projects, partners): load from
collections/<name>.jsonl. ObjectId references resolve via the_idfield. - Dynamic-option masters (Scheme Master, Document Master, status taxonomies): filter
derived/formdynamicoptions__masters_active.jsonlby the right(formId, order). Mapping(formId, order) → V3 Doctypecomes fromnotes/05-pm-design-doc.md. - Question/field metadata (for A16.1 registry): join
derived/questions_index.jsonlagainst the V3 Doctype field plan.
What this unblocks
Action items: A14, A14b, A15 all complete. A18 can proceed (rare input_types are now sample-able directly from the dump). Pipeline jobs A16.1 (registry) and A16.4 (master-data hydration) move to in-progress — both have foundations.
Remaining external blockers:
- A14a — Kannada/Telugu translations source (open with mForm team — V2 dump appears English-only)
- A22 — DRIS comments on HS form (open with PM)
- A24 — Civic IDs “List of 9” enumeration (open with PM)
Internal-only work that can start now: A20 (Livelihood Doctype from PM sheet), A21 (BMI/HB/BP/RBS compute helpers), A23 (WorkflowStatus Doctype from PM status taxonomy), A25 (HS Test Cases as regression fixtures).