Skip to main content

AI and LLMs in Programming

Large Language Models (LLMs) have rapidly changed how software is written. Tools like GitHub Copilot, ChatGPT, Claude, and Cursor can generate code, explain errors, write tests, and review pull requests. Understanding what these tools are, how they work, and how to use them responsibly is increasingly important for any programmer.

tip

If you're using LLMs on a VERSO project, also read the VERSO AI Usage Standards page, which covers the project's specific policies.

What Is a Large Language Model?

A Large Language Model is a type of AI trained on enormous amounts of text — code, books, websites, academic papers — to predict what text comes next. When you ask it a question, it generates a response token by token, each time predicting the most plausible next word given everything before it.

"Large" refers to the number of parameters (weights) in the model — modern LLMs have billions to trillions of parameters. More parameters generally means better reasoning, but also more compute and memory required to run.

LLMs are a type of neural network, specifically built on a design called the Transformer architecture, introduced in 2017.

How LLMs Work (High Level)

Understanding the basics helps you use LLMs more effectively and recognize their limitations.

Tokenization

LLMs don't read words — they read tokens, which are chunks of text (roughly 3–4 characters on average). The sentence "Hello, world!" might become tokens like ["Hello", ",", " world", "!"]. This matters because:

  • LLMs have a context window — a limit on how many tokens they can process at once
  • Code with long files may need to be split or summarized before sending

Training

LLMs are trained in stages:

  1. Pre-training — the model reads massive amounts of text and learns to predict the next token. This is where general knowledge comes from.
  2. Fine-tuning — the model is further trained on specific data (e.g., code, instructions, conversations) to specialize its behavior.
  3. RLHF (Reinforcement Learning from Human Feedback) — human raters evaluate model outputs, and the model is trained to produce outputs humans prefer. This is how models learn to be helpful, harmless, and honest.

Context Window

The context window is the total amount of text (input + output) a model can handle at once. Early models had windows of a few thousand tokens. Modern models handle 100,000 to over 1,000,000 tokens — enough to fit an entire codebase.

What LLMs Are Not

  • They are not databases — they don't look things up; they generate text based on patterns learned during training
  • They are not always correct — they can confidently produce wrong answers ("hallucinations")
  • They are not reasoning engines — they generate statistically likely text, which often resembles reasoning but can fail on novel problems
  • They have a training cutoff — they don't know about events after their training data was collected

Timeline of Major Models

The field has moved extremely fast. Here are the most significant milestones.

2017 — The Transformer

"Attention Is All You Need" — a paper from Google researchers introduced the Transformer architecture, which became the foundation for virtually every major LLM since. Key insight: an "attention" mechanism allows the model to relate any word to any other word in a sequence, regardless of distance.

2018 — BERT

BERT (Bidirectional Encoder Representations from Transformers) by Google. First model to read text in both directions simultaneously, dramatically improving tasks like question answering and text classification. Widely used in search engines.

2019 — GPT-2

GPT-2 (OpenAI, 1.5B parameters). Generated coherent paragraphs and sparked early discussion about AI-generated text. OpenAI initially withheld the full model out of misuse concerns — an early moment in the AI safety conversation.

2020 — GPT-3

GPT-3 (OpenAI, 175B parameters). A qualitative leap in capability — it could write essays, answer questions, translate, and generate code with just a few examples in the prompt ("few-shot learning"), without any fine-tuning. Access was via API only.

2021 — Codex and GitHub Copilot

Codex (OpenAI) — GPT-3 fine-tuned specifically on code from GitHub. Powered the first version of GitHub Copilot, which launched as a technical preview for developers. Copilot could autocomplete entire functions from a comment or function signature.

2022 — ChatGPT and the Public AI Moment

InstructGPT — OpenAI showed that RLHF made models dramatically more useful and safer for conversation.

ChatGPT launched on November 30, 2022, built on GPT-3.5. It reached 1 million users in 5 days and 100 million in two months — the fastest-growing product in history at the time. It made LLMs accessible to the general public.

GitHub Copilot became generally available (no longer in preview).

2023 — The Model Explosion

  • FebruaryLlama (Meta): a family of open-weights models released for research. Being open-weights meant anyone could run them locally or fine-tune them.
  • FebruaryBing Chat (Microsoft, powered by GPT-4): conversational search.
  • MarchGPT-4 (OpenAI): multimodal (text + images), significantly more capable and reliable than GPT-3.5. Powers ChatGPT Plus.
  • MarchClaude 1 (Anthropic): designed with a focus on safety and long context windows.
  • MayGoogle Bard (now Gemini): Google's conversational AI, powered by PaLM 2.
  • JulyLlama 2 (Meta): improved open-weights model, released for commercial use.
  • AugustCode Llama (Meta): Llama 2 fine-tuned specifically for code generation.
  • SeptemberMistral 7B: a highly efficient open-weights model from French startup Mistral AI, competitive with much larger models.
  • GitHub Copilot Chat added conversational AI assistance inside VS Code and other IDEs.

2024 — Capabilities Leap and New Paradigms

  • FebruaryGemini 1.5 Pro (Google): introduced a 1 million token context window — enough to process hours of video or an entire large codebase.
  • MarchClaude 3 family (Anthropic): three tiers — Haiku (fast/cheap), Sonnet (balanced), Opus (most capable). Set new benchmarks on many tasks.
  • AprilLlama 3 (Meta): major quality jump in open-weights models; 8B and 70B parameter versions.
  • MayGPT-4o (OpenAI, "omni"): multimodal across text, image, and audio in a single model; faster and cheaper than GPT-4.
  • JuneClaude 3.5 Sonnet (Anthropic): surpassed Claude 3 Opus on most benchmarks while being faster. Introduced Artifacts (Claude can generate interactive web pages, documents, code).
  • Septembero1 (OpenAI): a new paradigm — the model "thinks" before answering by generating an internal chain-of-thought. Dramatically better at math, science, and complex reasoning.
  • OctoberClaude 3.5 Haiku (Anthropic): fastest Claude model, competitive with larger models.
  • DecemberGemini 2.0 Flash (Google): fast, capable, multimodal model with tool use.
  • Cursor emerged as a leading AI-powered code editor, built on VS Code with deep model integration.
  • Amazon Q Developer (formerly CodeWhisperer): AI coding assistant integrated into AWS and IDEs.

2025 — Reasoning Models and Open Source Convergence

  • JanuaryDeepSeek R1 (DeepSeek, China): open-weights reasoning model competitive with o1, released for free. Caused significant discussion about the cost of training frontier models.
  • FebruaryClaude 3.7 Sonnet (Anthropic): introduced extended thinking — Claude can show its reasoning process and spend more time on hard problems.
  • FebruaryGPT-4.5 (OpenAI): improved emotional intelligence and instruction following.
  • MarchGemini 2.5 Pro (Google): strong reasoning capabilities, competitive on coding benchmarks.
  • Aprilo3 and o4-mini (OpenAI): more powerful reasoning models with tool use (web search, code execution, image analysis).
  • AprilLlama 4 (Meta): multimodal open-weights models with a mixture-of-experts architecture.
  • 2025Claude 4 series (Anthropic): the current generation, with Sonnet 4 and Opus 4, focused on agentic tasks — models that can take sequences of actions to complete complex goals.

AI Tools for Programmers

Code Completion and Chat

ToolMade ByDescription
GitHub CopilotGitHub/OpenAIIn-editor code completion and chat; integrates with VS Code, JetBrains, and more
ChatGPTOpenAIGeneral-purpose chat; excellent at explaining code, debugging, and generating examples
ClaudeAnthropicStrong at reasoning through code, large context (can read entire files), nuanced explanations
GeminiGoogleIntegrated into Google products; strong on Python and data science tasks
CodeiumCodeiumFree AI code completion for many IDEs
Amazon Q DeveloperAmazonAWS-integrated coding assistant; good for cloud infrastructure code
TabnineTabnineAI completion focused on privacy; can run locally

AI-Powered IDEs

ToolDescription
CursorFork of VS Code with deep AI integration — model can edit code, run commands, and understand your full project
Windsurf (by Codeium)AI-native IDE similar to Cursor
VS Code + CopilotOfficial GitHub Copilot extension adds chat and inline suggestions to VS Code

CLI and Agentic Tools

ToolDescription
Claude CodeAnthropic's CLI agent (this tool!) — understands and modifies entire codebases, runs commands
OpenAI Codex CLIOpenAI's terminal agent
AiderOpen source CLI tool that works with any LLM to edit code via chat

Using LLMs Effectively for Coding

What LLMs Are Good At

  • Boilerplate code — writing repetitive setup code, config files, and standard patterns
  • Explaining code — "what does this function do?" is often answered well
  • Debugging — paste an error message and stack trace; LLMs often identify the cause immediately
  • Writing tests — generating unit tests for a given function
  • Translating code — converting between languages (Python to JavaScript, etc.)
  • Refactoring — suggesting cleaner ways to write something
  • Learning a new API — "show me how to use the GitHub REST API to list pull requests"

What LLMs Are Not Good At

  • Novel algorithms — if the answer isn't in training data, performance drops significantly
  • Very long, complex tasks — quality degrades over long interactions; break tasks into pieces
  • Up-to-date information — they have a training cutoff and may not know about recent library versions
  • Knowing your codebase — unless you provide context, the model doesn't know how your project is structured
  • Always being right — they can produce plausible-looking but incorrect code with confidence

Prompting for Code

Better prompts produce better results. Be specific:

# Vague (produces generic result)
"write a function to parse data"

# Specific (produces useful result)
"Write a Python function that reads a CSV file using pandas,
filters rows where the 'status' column equals 'active',
and returns a list of dictionaries. The CSV has columns:
id, name, email, status."

Useful prompting strategies:

  • Provide context — paste the relevant function, class, or error message
  • Specify constraints — language version, libraries to use or avoid, style conventions
  • Ask for explanation — "explain each step" helps you understand and verify the output
  • Iterate — if the first answer isn't right, describe what's wrong and ask for a revision
  • Ask for alternatives — "what are two other ways to do this?"

The Golden Rule: Understand What You Commit

Never commit code you don't understand. If an LLM generates a solution, read it carefully before using it. Ask the model to explain any part you don't understand. You are responsible for every line of code with your name on it — in code review, debugging, and maintenance.

See the VERSO AI Usage Standards for more guidance on this.

Limitations and Risks

Hallucinations

LLMs sometimes generate confident, convincing, and completely wrong answers. This is called a hallucination. Common forms in programming:

  • Fabricated function names or API methods that don't exist
  • Outdated syntax or deprecated APIs
  • Logically flawed algorithms that look correct
  • Security vulnerabilities in generated code

Always test generated code. Don't trust output you haven't verified.

Security Risks

LLM-generated code has been found to contain security vulnerabilities at higher rates than human-written code in some studies. Common issues include:

  • SQL injection vulnerabilities
  • Hardcoded secrets
  • Insufficient input validation
  • Use of deprecated or insecure cryptographic functions

Run generated code through a linter and security scanner. See Security Basics.

Models trained on code from GitHub may reproduce code from that training data. The legal status of this is actively debated. If you're working on a project with strict license requirements, be aware of this possibility.

Privacy

Do not paste sensitive data, passwords, proprietary code, or personal information into public LLM tools like ChatGPT or Claude.ai. Inputs may be used for training or logged. For sensitive work, use tools that offer data privacy guarantees or run models locally.

Running Models Locally

If privacy or cost is a concern, you can run open-weights models on your own hardware:

  • Ollama — the easiest way to run models like Llama, Mistral, and Gemma locally
  • LM Studio — graphical app for downloading and running models locally
# Install and run Llama 3 locally with Ollama
ollama run llama3

Local models are smaller and less capable than frontier models, but they're free, private, and work offline.

Important Resources

Learning and Understanding

Coding Tools

Benchmarks and Comparisons

  • LMSYS Chatbot Arena — crowdsourced model rankings based on human preference
  • HumanEval — OpenAI's benchmark for code generation
  • SWE-bench — benchmark for AI agents solving real GitHub issues

Staying Current

The field moves fast. Good ways to keep up: