Skip to content

πŸ“Š Data Pipeline Specs

MVP Pipeline (What Actually Works Day 1)

Upload β†’ Extract β†’ Store β†’ Display
 PDF     Text      JSON    Dashboard

Document Processing Flow

1. Policy β†’ Markdown Conversion

  • Input: PDF, DOCX, XLSX files
  • Tool: pdf-parse (PDF), mammoth.js (Word)
  • Output: Plain text in markdown format
  • Storage: Supabase jsonb column

2. Variable Extraction (Post-MVP)

[Placeholder - Waiting for policy templates]
Future: Extract {{company_name}}, {{review_date}}, etc.

3. Question Bank Structure

CREATE TABLE questions (
  id UUID PRIMARY KEY,
  question_text TEXT,
  source_type TEXT, -- 'insurance', 'audit', 'custom'
  source_name TEXT, -- 'Allianz Cyber Form v2.1'
  framework_tags TEXT[], -- ['E8', 'ISO27001']
  risk_pattern TEXT, -- 'access_control', 'incident_response'
  answer_type TEXT, -- 'boolean', 'text', 'file'
  persona TEXT -- 'director', 'executive', 'it_manager'
);

Storage Schemas

Policies Storage

{
  "id": "uuid",
  "title": "Acceptable Use Policy",
  "content_md": "# Policy content...",
  "variables": {}, // Empty in MVP
  "framework_tags": ["E8_A"],
  "status": "draft|active|archived"
}

Assessment Responses

{
  "question_id": "uuid",
  "response": "Yes, we have MFA",
  "evidence_urls": ["storage/evidence1.pdf"],
  "responded_by": "user_id",
  "timestamp": "2024-01-15T10:00:00Z"
}

Placeholders for Human Work

  • /placeholder/policy-extraction.md - Waiting for 14 policy Word docs
  • /placeholder/insurance-forms.md - Need Allianz/Chubb samples
  • /placeholder/question-dump.md - Waiting for CSV of all questions

Data Flow Priorities

  1. Now: Manual upload β†’ Text storage β†’ Basic display
  2. Next: Question bank β†’ Dynamic forms
  3. Later: AI parsing β†’ Variable extraction β†’ Smart forms

Integration Points

  • Upload UI β†’ Supabase Storage
  • Text extraction β†’ Background job
  • Question forms β†’ Frontend components
  • Reports β†’ Data aggregation queries

Remember: Start with "it works" not "it's perfect"