Posts - Perivitta Rajendran

Mar 27, 2026

Multi-Agent Systems: Orchestration, Communication, and Collaborative AI

A single agent hits context and capability limits fast. Multi-agent systems distribute work across specialized roles with structured communication protocols. Orchestration patterns...

multi-agent· 15 mins read

Mar 25, 2026

inference

LLM Inference Optimization: Quantization, KV Cache, and Serving at Scale

Serving a 70B model cheaply requires quantization, KV cache tuning, continuous batching, and the right serving stack. A systems-level breakdown of vLLM,...

inference· 23 mins read

Mar 22, 2026

embeddings

Embedding Models: Training, Fine-Tuning, and Optimization for Retrieval

Embedding quality determines what your retrieval system can find. How contrastive training works, when to fine-tune versus use off-the-shelf models, and what...

embeddings· 22 mins read

Mar 21, 2026

prompt-engineering

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Tree-of-Thoughts Explained

Chain-of-thought improves multi-step reasoning. ReAct adds tool use. Tree-of-thoughts explores multiple solution paths. When each technique earns its token cost — and...

prompt-engineering· 21 mins read

Mar 19, 2026

structured-output

Structured Outputs in LLMs: JSON Mode, Function Calling, and Schema Validation

Free-form LLM output breaks parsing pipelines. JSON mode, function calling, grammar-constrained decoding, and Pydantic validation are the layers that make structured output...

structured-output· 20 mins read

Mar 14, 2026

security

Prompt Injection Attacks: How LLMs Get Exploited and How to Defend Your Application

Prompt injection turns user input into an instruction override. Indirect injection, jailbreaks, and data exfiltration vectors are all in scope — and...

security· 21 mins read

Mar 12, 2026

observability

LLM Observability: Tracing, Logging, and Debugging AI Applications

You can't debug what you can't trace. Setting up prompt logging, span tracing, cost tracking, and latency monitoring for production LLM apps...

observability· 15 mins read

Mar 5, 2026

hybrid-search

Hybrid Search: Combining Keyword and Vector Search for Better Retrieval

Pure vector search misses exact matches. BM25 misses semantic intent. Reciprocal rank fusion combines both without the tuning overhead of learned fusion...

hybrid-search· 16 mins read

Feb 27, 2026

llm

PEFT Methods Explained: LoRA, QLoRA, and Adapter-Based Fine-Tuning

Full fine-tuning a 7B model costs thousands in GPU hours. LoRA and QLoRA achieve comparable quality by training a fraction of the...

llm· 16 mins read

Feb 22, 2026

llm

Why Your LLM Application Feels Slow

LLM latency usually isn't the model's fault. Synchronous retrieval, sequential tool calls, missing streaming, and cold-start overhead are the architectural decisions that...

llm· 12 mins read

Feb 20, 2026

LLM

A Beginner’s Guide to Cost Optimization in LLM Applications

LLM API costs compound fast at scale. Token budgeting, model routing, prompt caching, and batching are the four levers that cut costs...

LLM· 12 mins read

Feb 19, 2026

RAG

How Retrieval-Augmented Generation (RAG) Works

RAG grounds LLM responses in retrieved documents rather than model weights. Walk through the full pipeline — indexing, retrieval, augmentation, and generation...

RAG· 14 mins read

Feb 18, 2026

llm-agents

What is an AI Agent?

An LLM becomes an agent when it can reason about which tool to call, execute that call, and update its plan based...

llm-agents· 18 mins read

Feb 17, 2026

multimodal-ai

Navigating the 3 Critical Hurdles of Multimodal AI Agent Deployment

Multimodal agents hit three hard walls in production: image token cost, latency from vision encoding, and grounding errors that compound across reasoning...

multimodal-ai· 21 mins read

Feb 16, 2026

multimodal-ai

Multimodal AI and Grounding Challenges

Vision-language models can describe an image without understanding what's in it. Spatial reasoning failures, hallucinated objects, and weak grounding are architectural constraints...

multimodal-ai· 19 mins read

Feb 13, 2026

llm

Context Window Limits: Why Your LLM Still Hallucinates

A 128K context window doesn't mean the model attends equally across all of it. Token budget pressure, retrieval gaps, and the lost-in-the-middle...

llm· 21 mins read

Feb 12, 2026

embeddings

How to Generate Better Embeddings for Vector Search

Bad embeddings kill retrieval before the LLM even sees the query. Preprocessing strategies, model selection, chunking decisions, and fine-tuning approaches that move...

embeddings· 17 mins read

Feb 12, 2026

chatbot

Building Real-Time Chatbot Memory with Vector Databases + LLMs

Stateless LLMs forget everything between turns. Combining short-term context buffers with long-term vector memory gives chatbots the persistence that real-world use cases...

chatbot· 24 mins read

Feb 10, 2026

RAG

Why Most RAG Systems Fail in Production

Poor chunking, weak embedding models, and retrieval that returns irrelevant context are why RAG fails — not the generator. A production-focused breakdown...

RAG· 12 mins read

Feb 10, 2026

artificial-intelligence

A Beginner’s Guide to Building AI Safety Filters

Input classifiers, output filters, and safe-completion layers don't stop all attacks — but they raise the cost of abuse significantly. How to...

Multi-Agent Systems: Orchestration, Communication, and Collaborative AI

LLM Inference Optimization: Quantization, KV Cache, and Serving at Scale

Embedding Models: Training, Fine-Tuning, and Optimization for Retrieval

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Tree-of-Thoughts Explained

Structured Outputs in LLMs: JSON Mode, Function Calling, and Schema Validation

Prompt Injection Attacks: How LLMs Get Exploited and How to Defend Your Application

LLM Observability: Tracing, Logging, and Debugging AI Applications

Hybrid Search: Combining Keyword and Vector Search for Better Retrieval

PEFT Methods Explained: LoRA, QLoRA, and Adapter-Based Fine-Tuning

Why Your LLM Application Feels Slow

A Beginner’s Guide to Cost Optimization in LLM Applications

How Retrieval-Augmented Generation (RAG) Works

What is an AI Agent?

Navigating the 3 Critical Hurdles of Multimodal AI Agent Deployment

Multimodal AI and Grounding Challenges

Context Window Limits: Why Your LLM Still Hallucinates

How to Generate Better Embeddings for Vector Search

Building Real-Time Chatbot Memory with Vector Databases + LLMs

Why Most RAG Systems Fail in Production

A Beginner’s Guide to Building AI Safety Filters

Airflow vs Prefect for ML Pipelines

How OpenAI Builds and Maintains ChatGPT

Vector DB Comparison: Pinecone vs Weaviate vs Qdrant

A Beginner's Guide to CI/CD for ML Models (GitHub Actions + Docker + Kubernetes)

Best Open-Source LLMs in 2026

How Netflix Builds Recommender Systems

How to Monitor ML Drift in Real Deployments

Feature Engineering: Making Data Understandable for Machines

Metrics Beyond Accuracy: Measuring What Actually Matters

Why Overfitting Is the Real Enemy of Machine Learning

Why AI Models Fail in the Real World

Agentic AI: From Passive Models to Autonomous Systems

A Beginner's Guide to Agentic AI

Medical AI: Models, Data, and Evaluation in High-Risk Systems

Why Governments Care About AI: Compute, Data, and Talent

K-Nearest Neighbors (KNN) — Part 1: Classification

A Beginner’s Guide to Elastic Net Regression (L1 + L2 Regularization)

A Beginner's Guide to Lasso Regression (L1 Regularization)

A Beginner's Guide to Ridge Regression (L2 Regularization)

A Beginner's Guide to Residual Sum of Squares (RSS)

A Beginner's Guide to Mean Absolute Error (MAE)

A Beginner's Guide to R-Squared (R-Squared)

A Beginner's Guide to Root Mean Squared Error (RMSE)

Multiple Linear Regression Model

Simple Linear Regression Model