LLMs: Architecture, Training, RAG & Interview Guide
LLMs are large language models trained on massive text and code corpora to predict, generate, transform, and reason over language. They matter because the same model can power a UPI support chatbot, summarise legal documents, write code, and retrieve policy answers when connected to enterprise data. After reading, you can explain how LLMs work and choose the right implementation path.
LLMs sit at the intersection of deep learning, natural language processing, distributed systems, and product design. Research teams, SaaS companies, banks, hospitals, ed-tech platforms, and government-tech vendors use them for search, automation, copilots, analytics, and content workflows where natural language becomes the interface.
You will be able to compare LLM architectures, explain tokens and transformers, distinguish pretraining from fine-tuning, design RAG pipelines, evaluate outputs, handle safety risks, and answer interview questions with concrete examples.
Who This Guide Is For
This guide is specifically designed for:
Core Concepts
LLMs are not one idea; they are a stack of modelling, data, optimisation, retrieval, inference, and safety decisions. The evolution from early generative systems to transformer-based models is covered in The Evolution of Generative AI: From Early Algorithms to Modern LLMs, which is useful context before comparing modern architectures.
Tokens and Context
A token is the unit of text an LLM reads and writes. A token may be a word, part of a word, punctuation mark, whitespace pattern, byte sequence, or special marker depending on the tokenizer. The model never sees raw sentences directly; it sees token IDs mapped to vectors called embeddings.
The context window is the maximum number of tokens the model can attend to in one request. If a support bot receives a long IRCTC refund conversation, previous messages may need summarisation or retrieval because everything cannot always fit. In an industry setting, a hospital discharge summariser must fit patient notes, lab results, and instructions without truncating critical medication details.
The main tokenizer families are word-level, character-level, subword tokenizers such as BPE and WordPiece, unigram/SentencePiece-style tokenizers, and byte-level tokenizers. Subword and byte-level approaches handle Indian names, product codes, Hinglish, and rare medical terms better than plain word tokenizers because unknown words can be split into known pieces.
Code Example
Transformer Architecture
The transformer is the dominant architecture behind modern LLMs. Its central operation is self-attention: each token computes how strongly it should attend to other tokens in the same context. This lets the model connect βitβ to the correct noun, link a PAN verification error to a KYC flow, or trace a variable across a code function.
The standard transformer block contains token embeddings, positional information, multi-head self-attention, feed-forward layers, residual connections, and layer normalisation. Multi-head attention allows different heads to learn different relationships, such as syntax, factual association, entity tracking, or code structure.
Architecture variants include decoder-only models for generation, encoder-only models for understanding and classification, encoder-decoder models for sequence-to-sequence tasks, multimodal LLMs for text with images or audio, sparse mixture-of-experts models for scaling compute efficiently, and smaller domain-specific LLMs for controlled enterprise deployments. A familiar example is a travel assistant generating an itinerary. An industry-specific example is a bank using an encoder model for complaint classification and a decoder model for drafting compliant responses.
Code Example
Training Stages
LLM training usually has multiple stages. Pretraining teaches a model broad language, code, and world-pattern knowledge through objectives such as next-token prediction or masked language modelling. Supervised fine-tuning then teaches instruction following using labelled examples, such as question-answer pairs, summaries, or coding tasks.
Preference tuning aligns outputs with human expectations. RLHF, RLAIF, direct preference optimisation, rejection sampling, and constitutional-style approaches are common variants. They do not magically make a model truthful; they bias the model toward responses rated as helpful, harmless, and aligned under a specific training process.
A familiar example is an ed-tech tutor first learning English and mathematics patterns from books, then learning to answer CBSE-style doubts politely. An industry example is a healthcare assistant pretrained generally, fine-tuned on de-identified clinical guidelines, and preference-tuned to refuse unsafe dosage recommendations without doctor oversight.
Code Example
Adaptation Methods
Adaptation means making a general model useful for a specific task without necessarily training a new foundation model. The main methods are zero-shot prompting, few-shot prompting, chain-of-thought-style reasoning prompts, tool use or function calling, RAG, full fine-tuning, parameter-efficient fine-tuning such as LoRA and adapters, prompt tuning, and domain-specific small model training.
Use prompting when the task is simple and the base model already knows the skill. Use few-shot examples when output style or edge cases matter. Use RAG when facts live in private or changing documents. Use fine-tuning when you need consistent behaviour, specialised format, or domain language. Use LoRA or adapters when full fine-tuning is too expensive.
A familiar example is a Zomato-style order assistant using few-shot prompts to classify refund reasons. An industry-specific example is a SaaS company fine-tuning a support assistant to follow its escalation policy while using RAG for current product documentation.
Code Example
Retrieval-Augmented Generation
RAG connects an LLM to an external knowledge source. The usual pipeline is ingestion, cleaning, chunking, embedding, indexing, retrieval, reranking, prompt construction, generation, citation, and monitoring. This is the preferred pattern when answers must reflect private policies, product manuals, legal contracts, or changing information.
RAG reduces hallucination risk but does not eliminate it. Poor chunking, irrelevant retrieval, missing metadata, stale indexes, and weak prompts can still produce wrong answers. A familiar example is a customer asking for the latest Aadhaar update process where retrieval should use current official text. An industry-specific example is an insurance claims assistant retrieving policy clauses before explaining coverage.
Vector databases, hybrid search, keyword filters, metadata filters, and rerankers often work together. For production systems, teams also log retrieved documents, scores, generated answers, user feedback, and fallback decisions so errors can be diagnosed.
Code Example
Inference and Decoding
Inference is the runtime process where the model generates tokens. Decoding controls how the next token is selected from model probabilities. Greedy decoding picks the highest-probability token, beam search keeps multiple candidate sequences, temperature changes randomness, top-k samples from the k most likely tokens, and nucleus or top-p sampling samples from a probability mass threshold.
Low temperature is useful for structured extraction, compliance answers, and SQL generation. Higher temperature is useful for brainstorming, marketing variants, or creative writing. A familiar example is generating alternate WhatsApp notification text for a food delivery app. An industry example is a banking chatbot using temperature near zero for regulatory FAQs to avoid creative but unsafe wording.
Production inference also involves max token limits, streaming, stop sequences, batching, KV cache, quantisation, speculative decoding, routing, retries, and cost monitoring. These choices often decide whether an LLM app feels fast and reliable.
Code Example
Evaluation and Safety
LLM evaluation checks whether the system is correct, useful, safe, and robust for the target task. There is no single universal metric. Common methods include exact match, F1, BLEU, ROUGE, semantic similarity, faithfulness checks, toxicity checks, human preference review, pairwise ranking, red-teaming, adversarial tests, and production feedback loops.
Safety covers hallucination, bias, privacy leakage, prompt injection, jailbreaks, unsafe advice, copyright risk, over-refusal, and insecure tool calls. A familiar example is a chatbot refusing to expose someone elseβs Aadhaar-linked data. An industry example is a clinical assistant refusing diagnosis from incomplete symptoms while suggesting consultation with a qualified professional.
Evaluation must be tied to risk. A movie recommendation bot can tolerate some subjectivity. A loan underwriting assistant, healthcare triage tool, or legal drafting system needs stricter tests, audit logs, and human review because mistakes can harm users.
Code Example
Deployment Patterns
LLM deployment is the engineering layer that turns model capability into a reliable product. Common patterns include hosted API usage, self-hosted open-weight models, model gateways, RAG services, agentic workflows, async job queues, human-in-the-loop review, observability dashboards, caching layers, and fallback models.
Hosted APIs are faster to start and reduce infrastructure burden. Self-hosting gives more control over latency, data locality, customisation, and cost at scale. Open-weight ecosystems are growing quickly; a useful next comparison is Best DeepSeek Course to Learn Open-Source LLMs for Development, Research, and Automation for learners exploring open-source LLM workflows.
A familiar example is a college helpdesk bot deployed behind an HTTP API with cached answers for admission deadlines. An industry-specific example is a fintech lender using a gateway that routes low-risk FAQs to a small model and escalates high-risk credit explanations to a reviewed RAG workflow.
For implementation details, the Hugging Face Transformers documentation is a reputable source for model loading, tokenizers, pipelines, and deployment tooling.
Code Example
Learning Path
Use this path to move from conceptual fluency to production-level LLM application design. Each phase has a clear output: explain, implement, evaluate, and defend your design choices in interviews.
Frequently Asked Questions
What is an LLM?
An LLM is a large neural language model, usually transformer-based, trained to predict and generate text tokens. In practice, it can answer questions, summarise documents, write code, classify text, extract fields, translate language, and act as a natural-language interface for software systems.
How is an LLM different from traditional NLP?
Traditional NLP systems often used task-specific models and handcrafted pipelines for classification, tagging, parsing, or translation. LLMs are more general-purpose because pretraining gives them broad language capability, and the same model can be adapted through prompting, retrieval, or fine-tuning.
What is the difference between pretraining and fine-tuning?
Pretraining teaches broad statistical language patterns from massive datasets, commonly through next-token prediction. Fine-tuning adapts the pretrained model to a narrower task, format, tone, or domain using labelled examples or parameter-efficient methods.
When should I use RAG instead of fine-tuning?
Use RAG when answers depend on private, changing, auditable, or source-backed knowledge. Use fine-tuning when the model needs consistent behaviour, domain style, specialised output format, or repeated task patterns that cannot be solved reliably with prompting alone.
Why do LLMs hallucinate?
LLMs generate likely token sequences, not guaranteed facts. Hallucinations happen when the model lacks the right context, retrieves weak evidence, overgeneralises from training patterns, or is asked for information it cannot verify.
What is temperature in an LLM?
Temperature controls randomness during token sampling. Lower values produce more deterministic answers, while higher values make outputs more varied and creative but can increase inconsistency.
Are open-source LLMs always cheaper?
Not always. Open-weight models can reduce vendor dependency and improve control, but self-hosting introduces GPU, engineering, monitoring, security, scaling, and maintenance costs that must be compared with hosted APIs.
What is the biggest misconception about LLMs?
The biggest misconception is that a larger model automatically solves accuracy, safety, and business reliability. System design, retrieval quality, evaluation, prompt constraints, monitoring, and human review often matter more than raw model size.
Interview Preparation
LLM interview questions test whether you understand both model fundamentals and production trade-offs. Strong answers define the concept, explain why it matters, give an example, and mention failure modes or evaluation.
Conceptual Questions
- What problem does self-attention solve? Self-attention lets each token weigh other tokens in the context, so the model can capture long-range dependencies and relationships. This is why a transformer can connect a pronoun to an earlier entity or link an error message to a later resolution step.
- Why are decoder-only models common for chatbots? Decoder-only models are trained to predict the next token from previous context, which directly matches text generation. Chatbots, code assistants, and summarisation systems need controlled continuation, so decoder-only transformers are a natural fit.
- What is the role of embeddings in LLMs? Embeddings convert token IDs into dense vectors that capture learned relationships between tokens. In RAG systems, embeddings also represent documents and queries so semantically similar content can be retrieved even when exact words differ.
- How does instruction tuning improve a model? Instruction tuning trains the model on examples of user instructions and desired responses. It makes the model more likely to follow tasks such as summarise, classify, extract, explain, or refuse unsafe requests.
Applied / Problem-Solving Questions
- Design an LLM system for a banking FAQ chatbot. Use RAG over approved policy documents, low-temperature decoding, PII filters, source citations, escalation for account-specific issues, and audit logs. Evaluate with policy-grounded test cases, hallucination checks, latency targets, and human review for high-risk answers.
- A chatbot gives outdated refund policy answers. What would you fix? First inspect retrieval logs to verify whether the latest policy document was indexed and retrieved. Then update ingestion, metadata filtering, chunking, reranking, and prompts before considering fine-tuning.
- How would you reduce LLM latency in production? Use streaming, caching, shorter prompts, better chunk selection, batching, smaller routed models, quantisation, and KV cache where applicable. Also separate synchronous chat from long-running document workflows using queues.
- How would you evaluate a healthcare summarisation assistant? Use clinician-reviewed test cases, factual consistency checks, missing-critical-information checks, privacy checks, and refusal tests for diagnosis or dosage advice. Generic fluency scores are not enough because medical risk depends on correctness and omissions.
- How would you protect an LLM app from prompt injection? Treat retrieved and user-provided text as untrusted data, isolate system instructions, validate tool calls, restrict permissions, and log suspicious requests. Add adversarial tests such as documents saying βignore previous instructionsβ to ensure the system does not obey injected content.
Key Takeaways
LLMs work by converting text into tokens, representing those tokens as embeddings, and using transformer attention to predict useful continuations. The practical stack includes architecture choice, training stages, adaptation method, retrieval design, decoding settings, evaluation, safety, and deployment engineering.
For GATE-style and interview preparation, focus on tokens versus context windows, self-attention, decoder-only versus encoder-only transformers, pretraining versus fine-tuning, RAG versus fine-tuning, hallucination versus bias, and temperature versus top-p decoding. These are the highest-yield comparison points because they reveal both theory and system-design understanding.
The natural next step is to build a small RAG assistant with evaluation cases, then extend it with model routing, guardrails, and monitoring. If you want hands-on agent deployment practice, the Coursera specialization listed above fits well after these foundations.