Background
As the sole engineer at Northside, I built an end-to-end AI system to automate founder evaluation. It scores startup founders against a proprietary investment rubric using fine-tuned language models, syncs with the firm's CRM and deal pipeline, and acts as a conversational AI assistant for the investment team, replacing a manual screening process that previously took hours per candidate.
One of the biggest bottlenecks in deal flow is founder screening, which involves manually reviewing LinkedIn profiles to evaluate whether a startup founder fits the firm's investment thesis. This was eating hours of analyst time every week, and evaluations were inconsistent across the team.
I wanted to build a system that could do this autonomously. The core requirements:
AI Scoring
Score any founder profile against a proprietary rubric in seconds, not hours.
Human-in-the-Loop
Let analysts override AI scores, and feed those corrections back into model training.
CRM Integration
Cross-reference thousands of existing contacts to avoid duplicate outreach and track pipeline status.
Conversational Assistant
Give the team a natural language interface to query founders, companies, and deals across all data sources.
Architecture
The system has 4 main layers: the analyst dashboard (no-code UI), the API backend, the data/AI layer, and external integrations.
Analyst Dashboard
Search founders, score profiles, override scores, push to CRM
FastAPI Backend
OpenAI
Scoring + fine-tuning
PostgreSQL
Profiles, scores
CRM API
Synced records
Scraping API
LinkedIn profiles
It's a Python monolith deployed on a PaaS. The backend is built with FastAPI and exposes both synchronous scoring endpoints and async webhook-based flows. The database is PostgreSQL via Supabase. External integrations handle LinkedIn scraping, CRM sync, and LLM inference + fine-tuning.
Scoring Engine
The scoring engine is the core of the system. Given a founder's professional profile, it produces a decimal score from 1.0 to 9.0 with reasoning.
Rubric Engineering
The rubric went through 4 iterations to calibrate the AI's judgment against the investment team's. The final version uses a structured format with explicit score band definitions, hard floors, and hard caps:
1# Score bands (simplified)2# 9 – elite: unicorn-level founder or massive exit, now founding again3# 8 – very strong: significant exit or top-operator at leading company4# 7 – strong: recent founder with credible signals (accelerator, top logos)5# 6 – promising but below bar6# 5 – average: solid IC, little founder/leadership signal7# 1-4 – declining quality / red flags89# Hard Floors: force minimum scores for exceptional signals10# e.g., significant exit history → min 8.01112# Hard Caps: prevent over-scoring weak profiles13# e.g., service/agency-heavy background → max 3.0
Hard floors and caps were critical. They enforce consistency by preventing the model from under-scoring elite profiles or over-scoring weak ones, regardless of how persuasive the LinkedIn copy is. This was the single most impactful design decision in the system.
Prompt Design
The scoring prompt follows a system + user pattern. The system prompt sets the investor persona and output format. The user prompt injects the rubric and a canonicalized profile:
1def score_profile(profile, experiences):2 canon = canonicalize(profile, experiences)34 messages = [5 {6 "role": "system",7 "content": "You are an investor. Score 1-9 using the rubric. "8 "Return JSON only: {\"score\": x, \"why\": \"<=60 words\"}."9 },10 {11 "role": "user",12 "content": f"Rubric ->\n{RUBRIC}\n\nProfile ->\n{canon}\n\n"13 "Return JSON only.",14 },15 ]1617 resp = client.chat.completions.create(18 model=MODEL, # base or fine-tuned19 temperature=0,20 messages=messages,21 )22 return json.loads(resp.choices[0].message.content)
Canonicalization is important because it reduces messy scraper JSON into a consistent text format (name, headline, about, top skills, 5 most recent experiences sorted by recency) so the model always sees the same structure regardless of input source.
Autonomous Pipeline
The flagship endpoint handles the full lifecycle in a single API call:
- Cache check: if the profile exists and has a score, return instantly
- Scrape: trigger the scraping service to pull the LinkedIn profile
- Normalize: parse raw JSON, generate a canonical profile ID (MD5 of URL), deduplicate via SHA256
- Score: run the profile through the rubric with the current model
- Persist: store profile, experiences, and score in the database
For production reliability, there's also a webhook-based flow that decouples scraping from scoring. The scraping service calls back to a webhook endpoint when complete, triggering normalization and scoring asynchronously. This prevents timeout issues and allows retry logic.
Fine-Tuning
The human-in-the-loop training loop is what makes the system improve over time. Analysts review AI scores in the dashboard and can override them with their own judgment and reasoning.
Dataset Construction
Training data is built from the labels table, where every analyst override becomes a supervised training example. The dataset uses the exact same prompt format as inference so the fine-tuned model is a drop-in replacement:
1for profile_id, label in labeled_profiles.items():2 profile = profiles_by_id[profile_id]3 experiences = all_experiences[profile_id]4 canon = canonicalize(profile, experiences)56 example = {7 "messages": [8 {"role": "system", "content": SCORING_SYSTEM_PROMPT},9 {"role": "user",10 "content": f"Rubric ->\n{RUBRIC}\n\nProfile ->\n{canon}"},11 {"role": "assistant",12 "content": json.dumps({"score": label["human_score"]})},13 ]14 }15 dataset.append(example)1617# Write to train.jsonl
A design decision: early on, I trained on score only (not reasoning), letting the model develop its own justifications from the base model's capabilities. This worked better than training on sparse or template-generated human reasoning.
Training Loop
The fine-tuning orchestrator includes idempotency to avoid wasted training runs:
- Hash check: SHA256 of
train.jsonlcompared to last run; skip if unchanged - Upload: push dataset to OpenAI's file API
- Train: fine-tune on the base model (GPT-4o-mini)
- Poll: wait for completion
- Hot-swap: on success, atomically update the model pointer (both a config file and the environment variable) so the API immediately serves the new model
1if job.status == "succeeded" and job.fine_tuned_model:2 ft_id = job.fine_tuned_model3 PTR.write_text(ft_id) # pointer file4 set_key(str(ENV_PATH), "OPENAI_MODEL", ft_id) # .env5 STATE.write_text(json.dumps({6 "last_ft_id": ft_id,7 "dataset_sha": dataset_sha8 }))
The suffix includes the training date for audit trails. This makes it trivial to trace any score back to the exact model version that produced it.
Conversational Assistant
Beyond scoring, the system includes a conversational AI layer that gives the investment team a natural language interface across all data sources. It works across email, messaging, API, and the dashboard.
Intent Classification
Inbound messages are classified into intents using a structured JSON-mode call:
1INTENTS = {2 "lookup": "Wants to know if we've seen a founder or company",3 "reminder": "Wants to set a reminder or schedule a task",4 "memo": "Wants to generate a deal memo",5 "draft": "Wants to draft an outbound email or reply",6 "unknown": "Cannot determine; escalate to human",7}
The classifier also extracts entities (company name, founder name, LinkedIn URL, domain) in the same call using JSON mode, so we can immediately route to the correct lookup strategy without a second LLM call.
CRM Lookup
The lookup system searches across multiple data sources in priority order:
- LinkedIn URL (highest confidence): exact match in CRM or profiles table
- Company name: fuzzy search across CRM organizations
- Founder name: search across scored profiles, then cross-reference with CRM
- Domain: fallback to domain-based lookup
Each result includes a confidence level (high, medium, low) so the assistant can flag ambiguous matches instead of guessing. This was important because in a VC context, confidently returning the wrong company is worse than admitting uncertainty.
Reply Generation
The reply generator uses few-shot examples to maintain a consistent voice: concise, factual, never speculative. It injects live CRM and database context before generating:
1def generate_reply(user_message, lookup_result, intent, channel):2 context = build_context_string(lookup_result)34 messages = [5 {"role": "system", "content": SYSTEM_PROMPT + channel_note},6 *FEW_SHOT_EXAMPLES,7 {"role": "system",8 "content": f"Relevant data:\n\n{context}"},9 {"role": "user", "content": user_message},10 ]
The system adapts tone by channel: shorter for messaging, more structured for email. The assistant never fabricates information; if no match is found, it says so clearly and offers to add the entry.
Data Pipeline
Ingestion
LinkedIn profiles are ingested through a scraping API. To manage rate limits, the system routes requests across multiple scraping agents assigned to different team members, distributing load:
1# Round-robin across scraping agents2AGENTS = {3 "default": os.getenv("SCRAPER_AGENT_DEFAULT"),4 "sourcer_1": os.getenv("SCRAPER_AGENT_1"),5 "sourcer_2": os.getenv("SCRAPER_AGENT_2"),6}
The webhook-based flow decouples scraping from processing. The scraping service calls back when complete, triggering normalization and scoring automatically.
Normalization
The normalizer converts messy scraper JSON into clean, canonical records:
- →Profile ID: MD5 hash of the canonical LinkedIn URL for deduplication
- →Change detection: SHA256 of the raw JSON to detect profile updates
- →Experience parsing: extracts company, title, dates, and duration from nested structures
- →URL canonicalization: strips tracking params, normalizes paths
This layer is critical because scraper output is unreliable. Field names change, formats vary, and the same profile can appear multiple times. The SHA256 dedup means we never re-process unchanged profiles.
CRM Integration
Bi-Directional Sync
The system maintains a bi-directional sync with the firm's CRM, keeping thousands of organization records in sync. Reads are always enabled; writes are gated behind a config flag to prevent accidental data pollution during development:
1_profile_locks: dict[str, threading.Lock] = {}2_profile_locks_guard = threading.Lock()34def _get_profile_lock(key: str) -> threading.Lock:5 with _profile_locks_guard:6 if key not in _profile_locks:7 _profile_locks[key] = threading.Lock()8 return _profile_locks[key]
Per-profile threading locks prevent concurrent duplicate organization creation when multiple scoring requests target the same founder simultaneously. The sync handles CRM rate limiting, pagination across large record sets, and field mapping between the CRM's nested API responses and the system's flat schema.
Weekly Digest
An automated weekly outreach digest pulls live data from the CRM, filters by upcoming week's scheduled outreach dates, groups entries by pipeline status, and emails a categorized HTML list with a CSV attachment.
The key optimization: the digest uses only ~5-10 API calls total (instead of thousands) by leveraging the CRM's v2 API with inline field selection rather than per-record lookups. It also enriches entries with LinkedIn URLs extracted from CRM notes as a fallback when the structured field is empty.
Database
The system uses PostgreSQL via Supabase, with a schema designed around the scoring lifecycle:
Tables
- →
profiles: canonical profile data (MD5 ID, URL, name, headline, skills, SHA256 for change detection) - →
experiences: work history (1:many with profiles, deduped by unique constraint on profile + company + title + start date) - →
scores: AI scoring history by model version (supports multiple scores per profile for A/B testing models) - →
labels: human overrides that feed the training pipeline
Views
- →
latest_scores: resolves display score usingCOALESCE(human_score, ai_score)with window functions to implement human-wins priority - →
profiles_for_ui: unified view powering the dashboard, joining profiles with scores and experience counts
The latest_scores view is the key abstraction. It means every consumer (dashboard, API, assistant) always sees the most trusted score without any application-level logic.
Deployment
The backend is deployed on a PaaS as a single FastAPI service. This simplicity was intentional. Scoring requests are short-lived (under 15 seconds), and the webhook flow handles longer operations, so there's no need for complex orchestration, queues, or workers.
Environment configuration uses the platform's secrets management for all API keys. The service auto-restarts on deploys with zero downtime via health checks against /healthz.
Results
The system went from concept to production in about 3 months, built and maintained as a solo engineer.
Hours → seconds
Screening time per candidate
Consistent
Every founder scored against the same rubric
Weeks
Time to full team adoption
Self-improving
Analyst corrections feed back into training
Takeaways
“The quality ceiling of an LLM system is set by its instructions, not its model.”
Rubric engineering was the hardest part. It took 4 iterations to balance specificity (hard floors and caps) with flexibility (the model's ability to reason about edge cases).
“Keep training format identical to inference.”
The fine-tuning feedback loop created a real flywheel. Watching the model improve as analysts corrected its mistakes was deeply satisfying. Making the fine-tuned model a true drop-in replacement was the key insight.
“A write gate is a simple mechanism that prevents real incidents.”
CRM integration forced careful thinking about data ownership, deduplication, and read-vs-write boundaries. That single config flag prevented multiple production incidents during development.
“For internal tools, speed-to-ship beats flexibility.”
Using a no-code platform for the analyst dashboard let me focus engineering time on the AI and data layers while still delivering a polished experience.
Next Steps
Agentic Screening
Move from single-shot scoring to multi-step research, letting the system autonomously gather funding history, co-founder backgrounds, and company traction before producing a score.
→ Multi-step researchRAG over Deal History
Build a retrieval layer over historical deal memos and IC notes to give the assistant deeper context about why the firm passed or invested in similar companies.
→ Contextual memoryEmail Automation
Extend the assistant to fully handle inbound deal flow emails: classify, lookup, draft responses, and route to the right team member automatically.
→ Zero-touch inboundMulti-Firm Generalization
Abstract the rubric and scoring system to support multiple firms with different investment theses, turning it into a platform.
→ Platform play