Swasti · mForm V2→V3

09 — MongoDB masters export (A14 unblocked)

Status: A14 (V2 access method) closed 2026-05-06. Direct read access to the Swasti mForm V2 production database. All masters needed to seed staging Frappe are extracted to local JSONL and ready to load.

Connection

  • Host: mform-v2.0vgd0.mongodb.net (MongoDB Atlas, ap-south-1, MongoDB 8.0.21)
  • Database: mformv2_swastihealthprod — single prod database, 79 collections
  • Auth: read-only user abhijit_nair, credentials at ~/Dhwani/keys/swasti-mform-v2-mongo.txt (mode 600)
  • Quirk: first connect is slow (~8–15s, cold replica set discovery). Use serverSelectionTimeoutMS=20000 in any client.

Re-running the export

~/Dhwani/swasti-mform-migration/.venv/bin/python \
  ~/Dhwani/swasti-mform-migration/scripts/export_masters.py

Idempotent overwrite (~30s end-to-end). Output at ~/Dhwani/swasti-mform-migration/raw/mongo-export/ (mode 700, files mode 600 — contains user PII).

What was exported

20 master collections as JSONL (BSON-faithful via bson.json_util):

CollectionDocsNotes
forms13Header-only — no questions array
tempforms14Carries the questions arrays (598 questions across all forms)
states16Geography level 1
districts31
blocks131
grampanchayats1,963
villages3,967
roles8
users481PII: name + mobile
useraccesses489Geography + role assignment per user
partners1Only “CMS”
projects1Only “Social Protection” (PM-01)
dashboards5
dashboardindicators13
contentflows10
contentnavigations238
phonedetails203Device telemetry, not strictly master
questionbanks1Test entry
questioncategories1Test entry
sdgs0Empty in prod

4 derived files:

  • derived/questions_index.jsonl598 questions indexed as (publishedFormId, order) → {title, shortKey, input_type, validation, answer_option_count, ...}. This is the canonical source for the A16.1 (formId, order) → V3 fieldname registry.
  • derived/formdynamicoptions__shape.jsonl — 14 rows (one per (formId, order) group with dynamic options), with total rows + active rows + distinct counts of optionId, optionName, parentOption. Diagnostic file: tells which groups are masters vs per-response.
  • derived/formdynamicoptions__masters_active.jsonl — Distinct optionName per (formId, order), isActive=true. Cleaner master, drops deprecated/test entries. Use this for seeding.
  • derived/formdynamicoptions__masters_all.jsonl — Same but all rows (active + inactive). Diff against _active to find deprecated entries.

Three architectural surprises worth remembering

1. forms is header-only — questions live in tempforms

The forms collection has 13 docs but no questions array. All question definitions live in tempforms.questions, keyed by tempforms.publishedFormId (1000–1012). This matters when reading the V2 schema programmatically — don’t look at forms.questions, it doesn’t exist.

2. formdynamicoptions has dual semantics

The collection has 2.1M rows but only 14 distinct (formId, order) groups. Each group has one of two semantics:

  • Master-style (low distinct names): the actual reference data — schemes, documents, statuses, etc.
  • Per-response-generated (huge distinct names): rows created at submission time, keyed to specific responses. Three groups have 309k+ distinct names — these are member-id pickers, not masters.

The 5,000-distinct-names threshold cleanly separates the two. The __shape.jsonl file makes it explicit:

(formId, order)total rowsdistinct namesrole
(1000, “36”)2,12211Member status master
(1002, “20”)197185 activeScheme master (sa_a_eligible)
(1002, “44”)1056Historical (order 44 doesn’t exist in current tempforms — likely from older forms.version)
(1002, “6”)494,071308,678Per-response member ref — skip
(1003, “8”)536,554161Scheme reference in followup
(1003, “11”)501,220154Followup status
(1004, “2”)494,997309,142Per-response member ref — skip
(1004, “41”)3231 activeDocument master
(1005, “2”)60,73434Document followup status
(1005, “16”)13,61432
(1010, “4”)3,2663,243Member-id picker for HS — per-response
(1011, “141”)449446Per-response member ref
(1011, “149”)5613 activeHealth followup status master
(1012, “36”)11Test form

3. The 107 schemes are actually 185

PM said “107 schemes” in the kickoff. Production has 185 active distinct scheme names at (formId=1002, order='20'), shortKey sa_a_eligible. Real schemes (sample): “Aam Aadmi Bima Yojana”, “Agristack(Farmer Id) Maharasthra”, “Ambedkar Hastshilp Vikas Yojana”, “Annadatha Sukhibhava”, “Antyodaya Anna Yojana”, “Ayushmaan Bharat Health Account” (39,223 fills — most-used), “Bhraman-Darshan Yojana”, etc.

Could be: PM’s count was rough; or 185 includes some closed/regional variants. Either way the seed list is what’s in formdynamicoptions__masters_active.jsonl for that group, not 107 hand-picked entries.

The document master at (formId=1004, order='41') is 31 active types (not 32) — Aadhar, PAN, Ration Card, Voter ID, Bank Account, Income Certificate, etc. Aadhar is by far the most-filled (54,439 fills).

What was not exported (deferred to A16.5)

Response data — out of scope for masters/seeds, will be imported as part of the response migration:

  • form_1000form_1012 (1.7M+ docs total — actual form responses)
  • clientformdatas (1.6M — union view of all responses)
  • rawclientformdatas (226k — pre-processed inputs)
  • All *_copy_* snapshots from Oct/Nov/Dec/May (point-in-time backups)
  • errorlogs (641k), csvlogs, mediaconversions, apihistories, emaillogs, dataversions, migrationerrorlogs (operational data, not masters)

Where to seed Frappe from

  • Static masters (states, districts, blocks, GPs, villages, roles, projects, partners): load from collections/<name>.jsonl. ObjectId references resolve via the _id field.
  • Dynamic-option masters (Scheme Master, Document Master, status taxonomies): filter derived/formdynamicoptions__masters_active.jsonl by the right (formId, order). Mapping (formId, order) → V3 Doctype comes from notes/05-pm-design-doc.md.
  • Question/field metadata (for A16.1 registry): join derived/questions_index.jsonl against the V3 Doctype field plan.

What this unblocks

Action items: A14, A14b, A15 all complete. A18 can proceed (rare input_types are now sample-able directly from the dump). Pipeline jobs A16.1 (registry) and A16.4 (master-data hydration) move to in-progress — both have foundations.

Remaining external blockers:

  • A14a — Kannada/Telugu translations source (open with mForm team — V2 dump appears English-only)
  • A22 — DRIS comments on HS form (open with PM)
  • A24 — Civic IDs “List of 9” enumeration (open with PM)

Internal-only work that can start now: A20 (Livelihood Doctype from PM sheet), A21 (BMI/HB/BP/RBS compute helpers), A23 (WorkflowStatus Doctype from PM status taxonomy), A25 (HS Test Cases as regression fixtures).


Last updated 2026-05-04