π Data Pipeline Specs¶
MVP Pipeline (What Actually Works Day 1)¶
Document Processing Flow¶
1. Policy β Markdown Conversion¶
- Input: PDF, DOCX, XLSX files
- Tool: pdf-parse (PDF), mammoth.js (Word)
- Output: Plain text in markdown format
- Storage: Supabase jsonb column
2. Variable Extraction (Post-MVP)¶
[Placeholder - Waiting for policy templates]
Future: Extract {{company_name}}, {{review_date}}, etc.
3. Question Bank Structure¶
CREATE TABLE questions (
id UUID PRIMARY KEY,
question_text TEXT,
source_type TEXT, -- 'insurance', 'audit', 'custom'
source_name TEXT, -- 'Allianz Cyber Form v2.1'
framework_tags TEXT[], -- ['E8', 'ISO27001']
risk_pattern TEXT, -- 'access_control', 'incident_response'
answer_type TEXT, -- 'boolean', 'text', 'file'
persona TEXT -- 'director', 'executive', 'it_manager'
);
Storage Schemas¶
Policies Storage¶
{
"id": "uuid",
"title": "Acceptable Use Policy",
"content_md": "# Policy content...",
"variables": {}, // Empty in MVP
"framework_tags": ["E8_A"],
"status": "draft|active|archived"
}
Assessment Responses¶
{
"question_id": "uuid",
"response": "Yes, we have MFA",
"evidence_urls": ["storage/evidence1.pdf"],
"responded_by": "user_id",
"timestamp": "2024-01-15T10:00:00Z"
}
Placeholders for Human Work¶
/placeholder/policy-extraction.md- Waiting for 14 policy Word docs/placeholder/insurance-forms.md- Need Allianz/Chubb samples/placeholder/question-dump.md- Waiting for CSV of all questions
Data Flow Priorities¶
- Now: Manual upload β Text storage β Basic display
- Next: Question bank β Dynamic forms
- Later: AI parsing β Variable extraction β Smart forms
Integration Points¶
- Upload UI β Supabase Storage
- Text extraction β Background job
- Question forms β Frontend components
- Reports β Data aggregation queries
Remember: Start with "it works" not "it's perfect"