Product Engineering · AI/ML Systems · Compliance · Business & Technology
This portfolio documents original systems I designed, built, and shipped at Ontop — a LATAM payroll and employer-of-record platform. The work spans production AI systems, data engineering pipelines, full-stack product development, and academic research. Each project began with a real business problem and ended with software running in production.
| Project | Domain | Scale / Outcome |
|---|---|---|
| Gandalf — KYB Compliance Agent | AI · Compliance · FinTech | Production LLM system; 133 jobs evaluated; ICAIL 2026 paper |
| AI Triage & Sentiment Pipeline | Data Engineering · ML Ops · CX | 1,205 clients scored weekly; full serverless AWS pipeline |
| HireDesk | Full-Stack · AI · People Ops | End-to-end hiring platform shipped and used in production |
| Synthetic Data Generator | Data Engineering · ML Tooling | Reusable OSS-grade dataset generator for AI experiments |
A production-grade multi-agent LLM system for automating Know Your Business (KYB) due diligence at a global fintech and payroll platform, with a peer-reviewed academic paper submitted to ICAIL 2026.
Global payroll and employer-of-record platforms must verify the legal entities they onboard before processing cross-border payroll. This Know Your Business process — checking corporate registry data, beneficial ownership chains, sanctions exposure, and document authenticity — was performed entirely by manual compliance analysts. The process took days to weeks per entity, couldn't scale with growth, introduced inconsistency across analysts, and produced decisions without traceable evidence trails.
Regulatory frameworks (FATF, FinCEN AML/CTF guidelines) required that any automation maintain full decision traceability and remain defensible under audit. This made naive LLM automation dangerous: a hallucinated sanctions check or fabricated ownership detail would create legal liability. The challenge was to automate aggressively while maintaining regulatory defensibility.
Solution & ArchitectureGandalf is a reliability-oriented multi-agent system that decomposes KYB into discrete stages, each handled by the most appropriate mechanism: deterministic logic for clear-cut cases, specialist LLM agents for domains requiring judgment, and external RegTech data sources for objective ground truth. No single "do everything" prompt — every component has a narrow, well-defined responsibility.
topLevel.py — Deterministic country risk check + industry risk check before any LLM invocation. Handles ~15–20% of cases with zero token cost. Output: rejected or manual_review with 1.0 confidence.
Company Research Agent — Web research via Firecrawl; corporate registry lookups; business legitimacy signals.
Representative Research Agent — PEP screening, global sanctions checks, fraud scoring via RegTech API providers.
Shareholder Analysis Agent — Beneficial ownership chain traversal; UBO identification; circular ownership detection.
Document Reviewer Agent — OCR + R1–R5 rule evaluation on uploaded certificates, IDs, and ownership documents.
topLevel.py orchestration — Aggregates specialist outputs, resolves conflicts, applies confidence weighting. Each agent output is schema-validated before aggregation; validation failure is treated as a hard error, not silently ignored.
Schema-constrained final decision — accept | reject | manual_review with confidence, evidence list, rule violations, and structured justification. Analyst override pathway (R3) logs the override rationale for audit trail.
These six patterns form the core contribution of the ICAIL 2026 paper. Each addresses a specific failure mode common to production LLM compliance systems:
# Deterministic early-exit gates — no LLM cost for clear cases
def evaluate_entity(entity_data: dict) -> KYBDecision:
# Gate 1: Jurisdiction risk — no LLM needed
if entity_data['country'] in PROHIBITED_COUNTRIES:
return KYBDecision(
status='rejected',
reason='Prohibited jurisdiction under AML policy',
confidence=1.0,
llm_used=False,
audit_code='GATE_COUNTRY'
)
# Gate 2: Industry risk
if entity_data['industry'] in HIGH_RISK_INDUSTRIES:
return KYBDecision(
status='manual_review',
reason='High-risk industry classification',
confidence=1.0,
llm_used=False,
audit_code='GATE_INDUSTRY'
)
return run_agent_ensemble(entity_data)
# Schema-constrained agent output — hallucinations fail validation, not silently pass
class AgentOutput(BaseModel):
decision: Literal['accept', 'reject', 'manual_review']
confidence: float = Field(ge=0.0, le=1.0)
evidence: list[str] # must be non-empty
rule_violations:list[str] # R1-R5 violations found
regtech_flags: list[str] # sanctions / fraud flags from external APIs
requires_analyst_review: bool
# Any LLM response that fails this schema → manual_review (never silently passes)
| Metric | Value | Notes |
|---|---|---|
| Total cases processed | 133 KYB jobs | 61 corporate clients + 72 business contractors |
| Auto-accepted | 58 (43.6%) | Fully automated — no analyst touch required |
| Routed to manual review | 75 (56.4%) | Agent provided evidence summary to analyst |
| Avg. risk score | 2.76 / 5.0 | Confidence-weighted scoring across all agents |
| Top rejection signal | Document rules not satisfied | R1–R5 rule evaluation failures (Document Reviewer Agent) |
| Model | GPT-4 (OpenAI) | All agents used same model; specialist prompts differ |
"Gandalf: Architecting Multi-Agent Systems for Know Your Business Compliance in Global Financial Services"
Joshua Dazas & Felipe García — Submitted to ICAIL 2026 (International Conference on Artificial Intelligence and Law).
The paper introduces six reliability-oriented design patterns for production LLM compliance systems, grounded entirely in real implementation and production evaluation data. Theory derived from working code, not hypothetical architectures.
The KYB problem was identified by observing the compliance analyst workflow directly. Key signals: average entity review took 3–5 business days; analysts repeatedly checked the same sources (corporate registry, sanctions list, Google for reputation) in the same sequence; rejections were almost always explainable by a small set of rule violations; the reasoning was formulaic but the volume was not.
An end-to-end, fully serverless ML pipeline that turns raw customer support tickets into ranked churn-risk recommendations, delivered weekly to account managers via Slack.
Ontop serves 1,200+ active corporate clients across Latin America, each generating support tickets in Zendesk and an internal messaging platform (DIIO). With this volume, identifying which clients are at genuine risk of churn or operational escalation before a situation becomes a crisis required either a large account management team or an automated system. No existing tooling could surface the right clients at the right time.
The specific failure mode the business experienced: account managers would only become aware of a deteriorating client relationship when the client threatened to churn or escalated to senior leadership — at which point it was often too late for meaningful intervention. A weekly automated digest of the highest-risk, newest cases would give account managers actionable intelligence when there was still time to act.
Pipeline ArchitectureAWS Glue Python jobs — Extract Zendesk tickets and DIIO conversations via API. Filter to external tickets only (is_external = 'true'). Land raw data into Redshift: external.zendesk__tickets_sentiment_analysis and external.diio__sentiment_analysis.
XLM-RoBERTa via SageMaker — Three Lambdas: warm (keep endpoint alive), fetch (query Redshift for unscored tickets), score (invoke SageMaker endpoint per batch, parallel via Step Functions MaxConcurrency=10). Outputs to process_data.zendesk_sentiment and process_data.diio_sentiment.
GPT-4o Mini + Bedrock Embeddings — Two Lambdas (batch-query + extract) via Step Functions. Extracts structured issue types from ticket text. Outputs to process_data.extracted_issues. Parallel execution across ticket batches (MaxConcurrency=10).
Salesforce + Redshift + Aura API — Aggregates 1,205 active client records with transaction health metrics, conversation volume (Aura, 4-week window), and L1 signals (sentiment + issues, 30-day window). Outputs one row per client to process_data.client_context_rules. Uses execute_values(page_size=200) for bulk upsert within Lambda timeout.
AWS Bedrock — Claude 3.5 Haiku (cross-region inference profile) — Reads top 70 clients by risk score from Redshift. For each client, invokes Claude with a structured prompt containing sentiment trends, issue categories, transaction health, and churn signals. Outputs schema-constrained JSON: urgency, reason_summary, recommended_action, confidence. Writes to process_data.triage_recommendations.
n8n webhook → Slack — Queries top 10 new clients from triage_recommendations (7-day dedup via slack_digest_log table). POSTs structured JSON payload to n8n webhook. n8n formats and delivers to account management Slack channel. Decoupled from core pipeline — Slack formatting changes don't require Lambda redeployment.
| CloudFormation Stack | Key Functions | Schedule (EventBridge) |
|---|---|---|
| sentiment-classifier-v2 | sentiment-warm-v2, sentiment-fetch-v2, sentiment-score-v2 | Mondays 01:00 UTC |
| issue-extractor-v1 | issue-batch-query-v1, issue-extract-v1 | Mondays 01:00 UTC |
| context-signals-etl | context-signals-etl | Mondays 03:00 UTC |
| triage-agent | triage-agent-v1 | Mondays 04:30 UTC |
| slack-digest | slack-digest-lambda | Mondays 06:00 UTC |
| Tier | Score Range | Client Count | % of Base | Recommended Response |
|---|---|---|---|---|
| Critical | 50+ | 6 | 0.5% | Immediate escalation to senior AM |
| High | 30–49 | 337 | 28% | Proactive outreach within 48 hours |
| Medium | 20–29 | 595 | 49% | Include in weekly digest, monitor |
| Low | < 20 | 267 | 22% | Routine check, no action needed |
# Cross-region inference profile required for all newer Claude models
# Direct model IDs are blocked by Bedrock; inference profiles are mandatory
BEDROCK_MODEL_ID = "us.anthropic.claude-3-5-haiku-20241022-v1:0"
RISK_THRESHOLD = 20
MAX_CLIENTS = 70 # Top 70 clients by risk score, weekly
def build_triage_prompt(client: dict) -> str:
return f"""You are a customer success triage agent for a payroll platform.
Analyze this client and respond ONLY with valid JSON matching the schema below.
Client: {client['company_name']}
Risk Score: {client['risk_score']} / 100
Sentiment Trend (30d): {client['sentiment_trend']}
Top Issues: {client['top_issues']}
Transaction Volume: {client['transaction_count']} transactions (4-week window)
Active Conversations: {client['conversation_count']} (Aura, 4 weeks)
Churn Signals: {client['churn_signals']}
Required JSON schema:
{{
"urgency": "critical | high | medium | low",
"reason_summary": "1-2 sentences explaining root cause",
"recommended_action": "specific next step for account manager",
"confidence": 0.0 - 1.0
}}"""
# Fetch top at-risk clients — ordered by risk, deduped for new entries only
FETCH_QUERY = """
SELECT c.client_id, c.company_name, c.top_issues,
t.risk_score, t.sentiment_trend, t.transaction_count,
t.conversation_count, t.churn_signals
FROM process_data.client_context_rules c
JOIN process_data.triage_recommendations t USING (client_id)
WHERE t.risk_score >= :threshold
ORDER BY t.risk_score DESC
LIMIT :max_clients
"""
The original problem statement was "we need to know which clients are unhappy." The discovery process refined this considerably over several conversations with the account management team.
DefinitionSubstitutions — no hardcoded ARNs. MaxConcurrency=10 for ticket processing. Design decision: Step Functions over Airflow or cron because it integrates natively with Lambda and provides built-in retry/error handling.is_external filter (lowercase string). Glue ETL was filtering on is_external = 'True' (Python boolean string) instead of 'true' (database string). Result: zero Zendesk tickets were processed. Fix: explicit lowercase string comparison. Lesson: verify filter values against actual database column values before declaring a pipeline working.executemany() upsert for 1,205 rows was timing out at the 900s Lambda limit. Fix: replaced with psycopg2.extras.execute_values(page_size=200). Lesson: batch insert patterns matter at scale; test with production-volume data.us.* prefix) for all newer Claude models. IAM policy required two separate resource ARNs: one for the inference profile (with account ID) and one for the foundation model (without). Lesson: AWS Bedrock model access patterns change; always verify against current documentation.triage_recommendations. Fixed by querying the actual table schema and updating the Lambda JOIN logic to derive missing fields (risk_category via CASE statement; client_name via JOIN).sam build && sam deploy) for all five stacks. All schedules deployed in DISABLED state, then enabled post-integration test. This allows safe incremental rollout without accidental cron execution during deployment.A purpose-built hiring platform for Ontop's People Ops team, with AI candidate ranking, automated email workflows, video screening, and bulk candidate management — shipped and used in production.
Ontop's People Ops team managed hiring across multiple open roles using a combination of spreadsheets, email threads, and manual Calendly coordination. The specific pain points:
Structured hiring briefs with AI-generated application form schemas. Required fields include job title, description, salary band, and a mandatory Calendly booking link (validated at form submission — no requisition can be published without one, ensuring interview emails always have a valid scheduling link). Requisition lifecycle: pending → form_generated → published → closed.
On each application submission, GPT-4o evaluates the applicant's responses against the job description and outputs a structured ranking: Very High Fit / High Fit / Average / Low Fit, with a justification paragraph. Auto-rejection logic: only Low Fit candidates are auto-rejected (not Average — a deliberate product decision to give borderline candidates a chance at video screening).
| Status Change | Email Triggered | Requirement / Condition |
|---|---|---|
| → hm_interview | Interview scheduling email with Calendly link | Requires booking_link on the requisition |
| → chro_interview | CHRO interview invitation | Requires CHRO_BOOKING_LINK environment variable |
| → rejected | Branded rejection email (Ontop copy, warm tone) | Fires on every status change to rejected, including bulk |
| video_requested | Video submission request with token URL | Via Vercel cron, 24 hours after application received |
A token-based video upload URL (/video/[token]) is generated on application creation and dispatched 24 hours later via cron. Upload-only (no browser recording) — MP4, WebM, MOV, AVI; max 50MB; 2–3 minutes; English only. Videos stored in Supabase Storage with signed URLs. The cron job checks idempotency before dispatch: skips the video request if the application status has changed from application_received.
Multi-select checkboxes on the applications dashboard allow bulk status changes across any combination of candidates. Bulk rejection fires individual rejection emails for each selected candidate via Promise.allSettled() — non-blocking, with per-email error tracking. Requisition close flow: a "Close Requisition" button triggers a two-step confirmation modal that auto-rejects all non-hired candidates and sets requisition status to closed.
All status changes are synced to a connected Google Sheet in real time, giving hiring stakeholders a read-only pipeline view without requiring platform access.
Tech Stack| Layer | Technology | Purpose |
|---|---|---|
| Frontend | Next.js 16 App Router + React 19 + TypeScript | Server components, file-based routing, type safety |
| UI Library | shadcn/ui + Tailwind CSS + Radix | Accessible, composable component system |
| Database | Supabase (PostgreSQL) | Applications, requisitions, form_schemas, scheduled_jobs tables |
| Auth | Supabase Auth | Session-based auth for People Ops admins |
| AI | OpenAI GPT-4o | Candidate ranking + structured JSON output |
| SendGrid | Interview scheduling, rejection, video request emails | |
| File Storage | Supabase Storage + signed URLs | Video upload with expiring token-based access |
| Deployment | Vercel (serverless functions + cron) | Edge deployment, automatic scaling, cron job execution |
| Integrations | Google Sheets API | Real-time status sync for stakeholder reporting |
export async function PATCH(request: Request) {
const { ids, status } = await request.json()
// Update all applications atomically
const { data: apps } = await supabase
.from('applications')
.update({ status })
.in('id', ids)
.select('id, applicant_name, applicant_email, form_schema_id')
// Fire emails in parallel — non-blocking, track failures without throwing
const results = await Promise.allSettled(
apps.map(async (app) => {
if (status === 'rejected') {
const { data: schema } = await supabase
.from('form_schemas')
.select('job_title')
.eq('id', app.form_schema_id)
.single()
return sendRejectionEmail(
{ applicant_name: app.applicant_name,
applicant_email: app.applicant_email },
schema.job_title
)
}
})
)
const failed = results.filter(r => r.status === 'rejected').length
return NextResponse.json({
updated: apps.length,
emails_sent: apps.length - failed,
emails_failed: failed
})
}
export async function sendRejectionEmail(
applicant: { applicant_name: string; applicant_email: string },
jobTitle: string
) {
const msg = {
to: applicant.applicant_email,
from: process.env.SENDGRID_FROM_EMAIL || 'noreply@hiredesk.com',
subject: `Update on your application for ${jobTitle}`,
html: `
<p>Hi ${applicant.applicant_name},</p>
<p>Thank you for taking the time to go through our process and for the
energy you put into your application. We truly appreciate the effort
and the thoughtfulness you showed along the way.</p>
<p>It was great getting to know you, and we're grateful you considered
being part of Ontop.</p>
<p>Wishing you the best in what's ahead.</p>
<p><strong>The Ontop Team</strong></p>
<hr/>
<p style="color:#888;font-size:12px;">
Questions? Contact us at hr@getontop.com
</p>
`,
}
await sgMail.send(msg)
}
Discovery started from a simple request: "we need a way to handle applications." Several conversations with the People Ops team and hiring managers surfaced a more complete picture.
applications table, embedded in the URL. No auth required to upload; token expiry not implemented (deliberate simplicity for MVP).hm_interview then back to application_received would retrigger the video request. Idempotency prevents this.job_requisitions → form_schemas → applications → scheduled_jobs. Deleting a requisition automatically cleans up the entire hierarchy. No orphaned records, no manual cleanup.CHRO_BOOKING_LINK) rather than a database field. Rationale: it changes infrequently, applies globally (not per-requisition), and doesn't need user-editable UI.SENDGRID_SANDBOX_MODE=true for staging (disables actual delivery without changing code). From address configured via SENDGRID_FROM_EMAIL env var. Sender verification required in SendGrid dashboard before production use./api/cron/process-scheduled-jobs. Cron logs available in Vercel dashboard. Idempotency checks ensure safe re-execution if cron fires unexpectedly.Supplementary work that enabled or validated the larger projects above.
Built to generate realistic synthetic customer support ticket datasets for training, evaluation, and baseline benchmarking of the triage pipeline. Fully configurable — schema, volume, category distributions, and sentiment patterns are all parameterized.
Output files: clients.csv (50 clients with Faker-generated profiles), tickets.csv (200 tickets with realistic category/severity distributions), conversations.csv (3–8 messages per ticket, LLM-augmented content), ticket_history.csv (~300 historical tickets for recurrence pattern simulation).
Why it matters: Without realistic synthetic data, testing the triage pipeline required waiting for real production data cycles. The generator enabled rapid iteration on the scoring model and issue extractor without touching production systems. It also served as a reusable artifact for the AI triage business case presentation.
A structured business case document quantifying the ROI of deploying the triage automation vs. expanding the account management headcount. Covered: cost modeling (Lambda + Bedrock + SageMaker vs. additional FTE salary), time-savings analysis (manual weekly review estimated at 8–12 hours per AM), risk assessment, and phased implementation roadmap. Designed for executive review and used as the basis for the build decision.
Business insight demonstrated: The ability to build the business case and the system represents a pattern that runs through all four projects — identifying business value, designing the technical solution, and shipping it. None of these were handed to me as fully-specified engineering tickets. Each began as an ambiguous problem and was shaped into a buildable, measurable system.
These projects represent the full product engineering lifecycle — discovery, design, implementation, and deployment — across AI/ML systems, regulatory technology, full-stack development, and data engineering. All shipped to production, not prototypes.
| Capability | How Demonstrated | Projects |
|---|---|---|
| Production AI/LLM Systems | Multi-agent KYB pipeline, client triage agent, GPT-4o candidate ranking — all running in production, not demos | Gandalf, Triage, HireDesk |
| Compliance & RegTech | FATF/AML KYB automation with schema-constrained outputs, audit trails, and RegTech API integration | Gandalf |
| AWS Serverless Architecture | Lambda, Step Functions, EventBridge, Bedrock, SageMaker, Secrets Manager, Glue ETL — all in production CloudFormation stacks | Triage Pipeline |
| Full-Stack Product Engineering | Next.js 16 App Router + Supabase + OpenAI + SendGrid, from DB schema to deployed UI, iteratively shipped | HireDesk |
| Data Engineering & ML Ops | End-to-end ML pipeline (ETL → model inference → LLM → notification), multilingual NLP, production data quality fixes | Triage Pipeline |
| Business × Technology Translation | Converted ambiguous business problems (KYB manual review, client churn, hiring chaos) into buildable, measurable systems | All projects |
| Product Thinking | Identified failure modes (alert fatigue, auto-rejection scope, mandatory booking link) through stakeholder conversations before writing code | HireDesk, Triage |
| Academic Research | Extracted publishable design patterns from a production system; ICAIL 2026 paper grounded entirely in real code and production data | Gandalf |
| AI & LLM | OpenAI GPT-4o · GPT-4o Mini · Claude 3.5 Haiku (AWS Bedrock) · XLM-RoBERTa (SageMaker) · Bedrock Embeddings |
| Cloud & Infra | AWS Lambda · AWS SAM · CloudFormation · AWS Glue · Amazon Redshift · Step Functions · EventBridge · Secrets Manager · Bedrock · Vercel |
| Backend | Python · TypeScript · Next.js 16 API Routes · psycopg2 · Supabase · PostgreSQL · SendGrid · n8n webhooks |
| Frontend | Next.js 16 App Router · React 19 · TypeScript · shadcn/ui · Tailwind CSS · Radix UI |
| Data & ML | pandas · AWS Glue ETL · Amazon Redshift DWH · Faker · Synthetic dataset generation |
| Compliance & Integrations | RegTech APIs (sanctions lists, corporate registries, fraud scoring) · Salesforce API · Google Sheets API · Aura API · Slack · n8n |
| Research & Writing | ICAIL 2026 academic paper · Business case / ROI documentation · Product requirements documentation |