Demir Eren

Background

As the sole engineer at Northside, I built an end-to-end AI system to automate founder evaluation. It scores startup founders against a proprietary investment rubric using fine-tuned language models, syncs with the firm's CRM and deal pipeline, and acts as a conversational AI assistant for the investment team, replacing a manual screening process that previously took hours per candidate.

One of the biggest bottlenecks in deal flow is founder screening, which involves manually reviewing LinkedIn profiles to evaluate whether a startup founder fits the firm's investment thesis. This was eating hours of analyst time every week, and evaluations were inconsistent across the team.

I wanted to build a system that could do this autonomously. The core requirements:

AI Scoring

Score any founder profile against a proprietary rubric in seconds, not hours.

Human-in-the-Loop

Let analysts override AI scores, and feed those corrections back into model training.

CRM Integration

Cross-reference thousands of existing contacts to avoid duplicate outreach and track pipeline status.

Conversational Assistant

Give the team a natural language interface to query founders, companies, and deals across all data sources.

Architecture

The system has 4 main layers: the analyst dashboard (no-code UI), the API backend, the data/AI layer, and external integrations.

Analyst Dashboard

Search founders, score profiles, override scores, push to CRM

FastAPI Backend

15+ endpointsPydantic validationWebhook handlers

OpenAI

Scoring + fine-tuning

PostgreSQL

Profiles, scores

CRM API

Synced records

Scraping API

LinkedIn profiles

It's a Python monolith deployed on a PaaS. The backend is built with FastAPI and exposes both synchronous scoring endpoints and async webhook-based flows. The database is PostgreSQL via Supabase. External integrations handle LinkedIn scraping, CRM sync, and LLM inference + fine-tuning.

Scoring Engine

The scoring engine is the core of the system. Given a founder's professional profile, it produces a decimal score from 1.0 to 9.0 with reasoning.

Rubric Engineering

The rubric went through 4 iterations to calibrate the AI's judgment against the investment team's. The final version uses a structured format with explicit score band definitions, hard floors, and hard caps:

python

1# Score bands (simplified)
2# 9 – elite: unicorn-level founder or massive exit, now founding again
3# 8 – very strong: significant exit or top-operator at leading company
4# 7 – strong: recent founder with credible signals (accelerator, top logos)
5# 6 – promising but below bar
6# 5 – average: solid IC, little founder/leadership signal
7# 1-4 – declining quality / red flags
8 
9# Hard Floors: force minimum scores for exceptional signals
10# e.g., significant exit history → min 8.0
11 
12# Hard Caps: prevent over-scoring weak profiles
13# e.g., service/agency-heavy background → max 3.0

Hard floors and caps were critical. They enforce consistency by preventing the model from under-scoring elite profiles or over-scoring weak ones, regardless of how persuasive the LinkedIn copy is. This was the single most impactful design decision in the system.

Prompt Design

The scoring prompt follows a system + user pattern. The system prompt sets the investor persona and output format. The user prompt injects the rubric and a canonicalized profile:

python

1def score_profile(profile, experiences):
2    canon = canonicalize(profile, experiences)
3 
4    messages = [
5        {
6            "role": "system",
7            "content": "You are an investor. Score 1-9 using the rubric. "
8                       "Return JSON only: {\"score\": x, \"why\": \"<=60 words\"}."
9        },
10        {
11            "role": "user",
12            "content": f"Rubric ->\n{RUBRIC}\n\nProfile ->\n{canon}\n\n"
13                       "Return JSON only.",
14        },
15    ]
16 
17    resp = client.chat.completions.create(
18        model=MODEL,  # base or fine-tuned
19        temperature=0,
20        messages=messages,
21    )
22    return json.loads(resp.choices[0].message.content)

Canonicalization is important because it reduces messy scraper JSON into a consistent text format (name, headline, about, top skills, 5 most recent experiences sorted by recency) so the model always sees the same structure regardless of input source.

Autonomous Pipeline

The flagship endpoint handles the full lifecycle in a single API call:

Cache check: if the profile exists and has a score, return instantly
Scrape: trigger the scraping service to pull the LinkedIn profile
Normalize: parse raw JSON, generate a canonical profile ID (MD5 of URL), deduplicate via SHA256
Score: run the profile through the rubric with the current model
Persist: store profile, experiences, and score in the database

For production reliability, there's also a webhook-based flow that decouples scraping from scoring. The scraping service calls back to a webhook endpoint when complete, triggering normalization and scoring asynchronously. This prevents timeout issues and allows retry logic.

Fine-Tuning

The human-in-the-loop training loop is what makes the system improve over time. Analysts review AI scores in the dashboard and can override them with their own judgment and reasoning.

Dataset Construction

Training data is built from the labels table, where every analyst override becomes a supervised training example. The dataset uses the exact same prompt format as inference so the fine-tuned model is a drop-in replacement:

python

1for profile_id, label in labeled_profiles.items():
2    profile = profiles_by_id[profile_id]
3    experiences = all_experiences[profile_id]
4    canon = canonicalize(profile, experiences)
5 
6    example = {
7        "messages": [
8            {"role": "system", "content": SCORING_SYSTEM_PROMPT},
9            {"role": "user",
10             "content": f"Rubric ->\n{RUBRIC}\n\nProfile ->\n{canon}"},
11            {"role": "assistant",
12             "content": json.dumps({"score": label["human_score"]})},
13        ]
14    }
15    dataset.append(example)
16 
17# Write to train.jsonl

A design decision: early on, I trained on score only (not reasoning), letting the model develop its own justifications from the base model's capabilities. This worked better than training on sparse or template-generated human reasoning.

Training Loop

The fine-tuning orchestrator includes idempotency to avoid wasted training runs:

Hash check: SHA256 of train.jsonl compared to last run; skip if unchanged
Upload: push dataset to OpenAI's file API
Train: fine-tune on the base model (GPT-4o-mini)
Poll: wait for completion
Hot-swap: on success, atomically update the model pointer (both a config file and the environment variable) so the API immediately serves the new model

python

1if job.status == "succeeded" and job.fine_tuned_model:
2    ft_id = job.fine_tuned_model
3    PTR.write_text(ft_id)                        # pointer file
4    set_key(str(ENV_PATH), "OPENAI_MODEL", ft_id) # .env
5    STATE.write_text(json.dumps({
6        "last_ft_id": ft_id,
7        "dataset_sha": dataset_sha
8    }))

The suffix includes the training date for audit trails. This makes it trivial to trace any score back to the exact model version that produced it.

Conversational Assistant

Beyond scoring, the system includes a conversational AI layer that gives the investment team a natural language interface across all data sources. It works across email, messaging, API, and the dashboard.

Intent Classification

Inbound messages are classified into intents using a structured JSON-mode call:

python

1INTENTS = {
2    "lookup":   "Wants to know if we've seen a founder or company",
3    "reminder": "Wants to set a reminder or schedule a task",
4    "memo":     "Wants to generate a deal memo",
5    "draft":    "Wants to draft an outbound email or reply",
6    "unknown":  "Cannot determine; escalate to human",
7}

The classifier also extracts entities (company name, founder name, LinkedIn URL, domain) in the same call using JSON mode, so we can immediately route to the correct lookup strategy without a second LLM call.

CRM Lookup

The lookup system searches across multiple data sources in priority order:

LinkedIn URL (highest confidence): exact match in CRM or profiles table
Company name: fuzzy search across CRM organizations
Founder name: search across scored profiles, then cross-reference with CRM
Domain: fallback to domain-based lookup

Each result includes a confidence level (high, medium, low) so the assistant can flag ambiguous matches instead of guessing. This was important because in a VC context, confidently returning the wrong company is worse than admitting uncertainty.

Reply Generation

The reply generator uses few-shot examples to maintain a consistent voice: concise, factual, never speculative. It injects live CRM and database context before generating:

python

1def generate_reply(user_message, lookup_result, intent, channel):
2    context = build_context_string(lookup_result)
3 
4    messages = [
5        {"role": "system", "content": SYSTEM_PROMPT + channel_note},
6        *FEW_SHOT_EXAMPLES,
7        {"role": "system",
8         "content": f"Relevant data:\n\n{context}"},
9        {"role": "user", "content": user_message},
10    ]

The system adapts tone by channel: shorter for messaging, more structured for email. The assistant never fabricates information; if no match is found, it says so clearly and offers to add the entry.

Data Pipeline

Ingestion

LinkedIn profiles are ingested through a scraping API. To manage rate limits, the system routes requests across multiple scraping agents assigned to different team members, distributing load:

python

1# Round-robin across scraping agents
2AGENTS = {
3    "default": os.getenv("SCRAPER_AGENT_DEFAULT"),
4    "sourcer_1": os.getenv("SCRAPER_AGENT_1"),
5    "sourcer_2": os.getenv("SCRAPER_AGENT_2"),
6}

The webhook-based flow decouples scraping from processing. The scraping service calls back when complete, triggering normalization and scoring automatically.

Normalization

The normalizer converts messy scraper JSON into clean, canonical records:

→Profile ID: MD5 hash of the canonical LinkedIn URL for deduplication
→Change detection: SHA256 of the raw JSON to detect profile updates
→Experience parsing: extracts company, title, dates, and duration from nested structures
→URL canonicalization: strips tracking params, normalizes paths

This layer is critical because scraper output is unreliable. Field names change, formats vary, and the same profile can appear multiple times. The SHA256 dedup means we never re-process unchanged profiles.

CRM Integration

Bi-Directional Sync

The system maintains a bi-directional sync with the firm's CRM, keeping thousands of organization records in sync. Reads are always enabled; writes are gated behind a config flag to prevent accidental data pollution during development:

python

1_profile_locks: dict[str, threading.Lock] = {}
2_profile_locks_guard = threading.Lock()
3 
4def _get_profile_lock(key: str) -> threading.Lock:
5    with _profile_locks_guard:
6        if key not in _profile_locks:
7            _profile_locks[key] = threading.Lock()
8        return _profile_locks[key]

Per-profile threading locks prevent concurrent duplicate organization creation when multiple scoring requests target the same founder simultaneously. The sync handles CRM rate limiting, pagination across large record sets, and field mapping between the CRM's nested API responses and the system's flat schema.

Weekly Digest

An automated weekly outreach digest pulls live data from the CRM, filters by upcoming week's scheduled outreach dates, groups entries by pipeline status, and emails a categorized HTML list with a CSV attachment.

The key optimization: the digest uses only ~5-10 API calls total (instead of thousands) by leveraging the CRM's v2 API with inline field selection rather than per-record lookups. It also enriches entries with LinkedIn URLs extracted from CRM notes as a fallback when the structured field is empty.

Database

The system uses PostgreSQL via Supabase, with a schema designed around the scoring lifecycle:

Tables

→profiles : canonical profile data (MD5 ID, URL, name, headline, skills, SHA256 for change detection)
→experiences : work history (1:many with profiles, deduped by unique constraint on profile + company + title + start date)
→scores : AI scoring history by model version (supports multiple scores per profile for A/B testing models)
→labels : human overrides that feed the training pipeline

Views

→latest_scores : resolves display score using COALESCE(human_score, ai_score) with window functions to implement human-wins priority
→profiles_for_ui : unified view powering the dashboard, joining profiles with scores and experience counts

The latest_scores view is the key abstraction. It means every consumer (dashboard, API, assistant) always sees the most trusted score without any application-level logic.

Deployment

The backend is deployed on a PaaS as a single FastAPI service. This simplicity was intentional. Scoring requests are short-lived (under 15 seconds), and the webhook flow handles longer operations, so there's no need for complex orchestration, queues, or workers.

Environment configuration uses the platform's secrets management for all API keys. The service auto-restarts on deploys with zero downtime via health checks against /healthz.

Results

The system went from concept to production in about 3 months, built and maintained as a solo engineer.

Hours → seconds

Screening time per candidate

Consistent

Every founder scored against the same rubric

Weeks

Time to full team adoption

Self-improving

Analyst corrections feed back into training

Takeaways

“The quality ceiling of an LLM system is set by its instructions, not its model.”

Rubric engineering was the hardest part. It took 4 iterations to balance specificity (hard floors and caps) with flexibility (the model's ability to reason about edge cases).

“Keep training format identical to inference.”

The fine-tuning feedback loop created a real flywheel. Watching the model improve as analysts corrected its mistakes was deeply satisfying. Making the fine-tuned model a true drop-in replacement was the key insight.

“A write gate is a simple mechanism that prevents real incidents.”

CRM integration forced careful thinking about data ownership, deduplication, and read-vs-write boundaries. That single config flag prevented multiple production incidents during development.

“For internal tools, speed-to-ship beats flexibility.”

Using a no-code platform for the analyst dashboard let me focus engineering time on the AI and data layers while still delivering a polished experience.

Next Steps

Agentic Screening

Move from single-shot scoring to multi-step research, letting the system autonomously gather funding history, co-founder backgrounds, and company traction before producing a score.

→ Multi-step research

RAG over Deal History

Build a retrieval layer over historical deal memos and IC notes to give the assistant deeper context about why the firm passed or invested in similar companies.

→ Contextual memory

Email Automation

Extend the assistant to fully handle inbound deal flow emails: classify, lookup, draft responses, and route to the right team member automatically.

→ Zero-touch inbound

Multi-Firm Generalization

Abstract the rubric and scoring system to support multiple firms with different investment theses, turning it into a platform.

→ Platform play