Yash Raj Pandey — AI Agents Architect

AI Agents Architect at the University of Florida (IFAS), building local-first AI infrastructure: self-hosted open-weight LLMs, retrieval-augmented generation (RAG), vector search, and AI agents running in production. Joined UF as a Software Engineer (Mar 2025), promoted to Lead (Oct 2025), then Architect (Apr 2026).

Profile

Mail — yashpn62@gmail.com

Selected Work

Selected Works — Production systems and open-source work. Click any card to open the case study.

Stack: Python, Django, React, TypeScript, PostgreSQL, RAG, Qdrant, vLLM, Ollama, Docker, Kubernetes, Terraform

Blue Omics — Full-stack research data platform

A Django, React, and PostgreSQL platform that grew from zero to 5M+ live records and became the primary system for an entire research lab.

Problem: A research lab ran its data on a sprawl of spreadsheets and manual workflows. Submitting, searching, and cross-referencing records was slow, error-prone, and impossible to scale across 30+ researchers and 5 labs.

Approach:

  • Designed and built Blue Omics from scratch: a React frontend on a Django REST backend with PostgreSQL, structured across 32 data models and 58 API endpoints.
  • Built 7 ingestion pipelines for heterogeneous formats (PDF, Excel, CSV, Word, PowerPoint), cutting manual data prep from hours to minutes.
  • Tuned PostgreSQL with 35 explicit indexes and caching to hold low-millisecond latency under concurrent access by 30+ users.
  • Deployed on GCP with Kubernetes and Terraform, Docker multi-stage builds, and CI/CD. Optimized the frontend from 8s to 3s load time.

Stack: Django REST, React, TypeScript, PostgreSQL, GCP + Kubernetes, Terraform, Docker

Impact: Live records: 0 → 5M+; Trait-lookup latency: spreadsheet → ~25 ms; Frontend load time: 8 s → 3 s; Daily active users: baseline → +40%

Trade-offs: Chose a well-indexed PostgreSQL core over premature service-splitting to keep one clear backup and monitoring story. The platform replaced manual workflows entirely and became the system of record, which is what earned the promotion path from Software Engineer to Lead.

TurboQuant on Apple Silicon — CPU-only LLM quantization study

Independent evaluation of TurboQuant (arXiv 2504.19874) ported to run on Apple Silicon. Open source and reproducible.

Problem: TurboQuant is a near-optimal LLM weight and activation quantization method, but the reference path assumed dedicated GPU hardware. The open question: can it run, and hold long-context accuracy, on consumer Apple Silicon with no GPU?

Approach:

  • Worked from a CPU-only fork on an M1 Pro (16GB) and fixed five implementation bugs that were blocking correct inference.
  • Ran a two-round study: an MLX path and a separate llama.cpp Metal path, each benchmarked on long-context needle-in-a-haystack retrieval.
  • Published the full evaluation, the bug fixes, and reproducible results as an open-source repository, with writeups on LinkedIn and X.

Stack: MLX, llama.cpp (Metal), Apple Silicon (M1 Pro), Python

Impact: Needle retrieval @ 16K: 0% → 100%; KV cache memory: baseline → significantly reduced; Bugs fixed in fork: 5 blocking → 0

Trade-offs: A CPU-only target trades raw throughput for accessibility: the point was proving strong quantization and long-context accuracy are reachable on hardware anyone has on their desk, not winning a latency benchmark. Reflects how I approach AI infrastructure: take a research-grade method, get it actually running on accessible hardware, measure it honestly, and share it.

https://github.com/devYRPauli/turboquant-m1pro-evaluation

ApplyScore — AI resume gap-analysis extension

A published Chrome extension that scores how well a resume matches any job posting on the web, with evidence-linked gaps and no fluff.

Problem: Most AI resume tools hallucinate skills and rewrite bullets with confident fluff that recruiters see through instantly. The honest question, how well does this resume actually match this job, went unanswered.

Approach:

  • Built a universal scraper that reads job postings across LinkedIn, Greenhouse, Ashby, Lever, Workday and more, piercing Shadow DOM to work on virtually any board.
  • Runs a strict, evidence-based gap analysis: a confidence-weighted 0-100 fit score, requirement-by-requirement matches linked to the exact resume bullets that prove them, and a prioritized list of what is missing.
  • Privacy-first by design: the resume is cached locally and the user brings their own API key (OpenAI, Anthropic, or Google), so data and model choice stay fully in their control.

Stack: JavaScript, Chrome Extension APIs, Shadow DOM scraping, LLM APIs (BYO-key)

Trade-offs: Deliberately a gap analyzer, not a rewriter. Suggesting only 1-2 targeted, non-hallucinated bullets keeps it honest; the BYO-key model trades one-click convenience for the user keeping full control of their data and cost.

https://chromewebstore.google.com/detail/applyscore/ibecekikdjelajpnjnmapejhahgcplim

About

5M+ — Records in production

3 — Roles in 14 months

The Journey

  • 2019 — BTech begins: Computer Science at Jaypee University of Engineering and Technology.
  • 2022 — First production app: SWE intern at Hackdev: shipped a Flutter legal-tech app to production.
  • 2023 — Exchange to UF: Final undergrad semester at UF as an exchange student, which led to MS admission.
  • 2025 — Blue Omics: Joined UF IFAS, built a 5M+ record platform, promoted to Lead.
  • 2026 — AI Agents Architect: Proposed and now lead a local-first AI systems function.

Off the clock — Football Hub. A live football stats app I built for myself because I love the game. (https://football-hub-six.vercel.app/)

Builder Tools (free, client-side)

Builder Tools — Free, client-side. Your data never leaves the browser.

Token Counter — Cost across frontier models, side-by-side (runs entirely in your browser, no signup)

Prompt Formatter — Restructure raw prompts into blocks (runs entirely in your browser, no signup)

JSON to Schema — Generate Pydantic / Zod / TypeScript (runs entirely in your browser, no signup)

Regex Playground — Test, explain, match in real-time (runs entirely in your browser, no signup)

cURL Converter — cURL to fetch / Python requests / httpx (runs entirely in your browser, no signup)

Contrast Checker — WCAG AA/AAA with live preview (runs entirely in your browser, no signup)

Playbooks

3 — Battle-tested plays

Self-Hosting Open-Weight LLMs — Run capable models locally without sending data to a cloud API

There is a whole class of work where you cannot send the data to a cloud API: confidential records, regulated environments, anything air-gapped. The good news is that open-weight models have gotten good enough that you do not have to. Here is how I think about running them locally.

  • Pick the model to fit the hardware, not the other way around
  • Choose a serving layer on purpose
  • Watch the context window, it is where performance goes to die
  • Keep the application layer hardware-agnostic
  • Measure before you trust

RAG That Holds Up in Production — Retrieval, reranking, and the evals that keep it honest

Most RAG demos look great and most RAG systems quietly disappoint, because the demo never stressed retrieval. The model is rarely the bottleneck. The retrieval and the chunking are.

  • Garbage chunks, garbage answers
  • Hybrid retrieval beats pure vector
  • Rerank, but watch the dilution
  • Cite or it did not happen
  • Build the eval before you optimize

Evaluation-Gated Releases for LLM Systems — Stop shipping regressions you cannot see

LLM systems fail differently from normal software. A change can improve five cases and silently break three, and nothing throws an error. The only defense is a gate: no change ships unless it clears a measured bar.

  • Freeze a benchmark
  • Freeze the judge too
  • Know your noise floor
  • Set tiers before you look at results
  • A regression is a reason to stop

Contact

LinkedIn — /in/yashrajpandeyy

GitHub — devYRPauli

Gainesville, FL — Eastern Time / UTC-5. University of Florida / IFAS.

Open to Conversations — AI infrastructure / Local-first LLMs. Always up for a good conversation on building AI that runs in production.