I Don’t Know Machine Learning as Well as I Thought I Did

Out of curiosity, I picked up ML and AI as the next tech stack I wanted to learn. After a month, I’m realising it’s probably going to take a year or more before I feel even minimally confident enough to put it on my resume.

A little background: I did Andrew Ng’s Machine Learning Specialisation in college around 5–6 years ago. I built a fraud detection project, an image classification project, and even wrote a research paper titled “YOLOv3 Remote Sensing SAR Ship Image Detection.” Then I got a job as a Mobile Engineer (with a bit of backend work too) and slowly got busy with life.
Back then, it mostly felt like choosing a good enough ML algorithm based on the dataset and desired outcome. I didn’t see much beyond that at the time. But with the rise of LLMs, I got curious again and started reading AI Engineering by Chip Huyen, along with random blogs, Andrej Karpathy videos, research papers, and other internet resources.

I realised I had a lot of misconceptions, knowledge gaps, and even more curiosity.
The rest of this blog is about those things.

One thing I started doing was drawing rough mental maps of how concepts connect together.
This was one of the first sketches I made while trying to understand what actually goes into a “foundation model.”

One of my early attempts at connecting together concepts like data, architectures, and foundation models.

Misconceptions:

Context: When someone says a foundation model needs context tokens to generate the next output token, I thought “context” only meant the input tokens we provide.
But that’s not true. That’s only the starting point. These input tokens are processed, additional context may be added via external tools like MCPs, and previously generated output tokens themselves also become context for the next token.
Stateless: When we send prompts to a foundation model like ChatGPT, the information is not stored inside the model itself, i.e., the model does not permanently learn from it. It processes the prompt, generates an output, and moves on. Basically, they are stateless.
Self-Supervision: The so-called pre-training process of a foundation model does not rely heavily on human labellers because the model learns through self-supervision using massive amounts of internet data.

Knowledge Gaps

Intuitions:
1. Derivatives: I forgot the intuition for derivatives. They represent the "rate of change at a particular point" in a function. The rate of change, i.e., the slope, can be steep or shallow. That represents the magnitude of the change. The sign tells whether the change is increasing or decreasing.
2. Dot Product: I have to look this up again.
3. Attention Mechanism: Why does the transformer use the dot product between query and key vectors to score attention between tokens?
4. Parameters and Model Size: I still struggle to intuitively understand why models with more parameters generally perform better. Since parameters are tied to the function the model learns, how does increasing parameter count improve learning capacity and reasoning ability?
  
  I even started drawing out tiny toy examples to rebuild intuition around parameters, functions, and what models are actually learning.
  
  Trying to rebuild intuition around parameters and functions.
ML Concepts: Understanding where they fit into the bigger picture of machine learning. I need to revisit them again.
1. Activation Functions
2. Feedforward
3. RAG (Retrieval-Augmented Generation)
4. Post-Training

Things That Started Making Sense

Things I was curious about and finally got answers to:

SSMs: A newer architecture focused on handling longer contexts more efficiently because transformer architectures (used in most modern LLMs) become expensive and memory-heavy with long contexts.
Hybrid Architectures: Models can mix layers and ideas from different architectures and mechanisms. For example, combining ideas from attention mechanisms and RNN-like architectures.
Dataset: Quality of training data matters more than just quantity.
Naming Convention: Model names often represent the number of parameters. In Llama3-70B, 70B refers to the number of parameters.

The more concepts started making sense, the more new questions started appearing.

Still Curious

What are companies actually using for chatbots, customer support, and AI assistants?
Are they building on top of existing foundation models through APIs, fine-tuning open-source models, or training things in-house?
And what does “in-house AI” actually mean inside startups or enterprises? What kind of work do those teams actually do day-to-day?
How does post-training work for base models?
Do companies train the model again on additional data, or is the process more nuanced than that?
I tried drawing the pipeline based on my current understanding, but I still feel like I’m missing a lot of details around what actually happens during post-training.

My current mental model of pre-training → supervised fine-tuning → reinforcement learning.
How does RAG actually work internally?
Since context windows are limited, how does the system retrieve only the relevant pieces of information from huge datasets?
When using tools like Claude or Cursor, is every chat prompt an API call happening in real time?
And how do they stream responses word-by-word? Is it done through WebSockets or something else?
Why are models like Claude so good at coding tasks?
Is it because of post-training on coding datasets, reinforcement learning, tool usage, or something else?
Are there papers or engineering blogs explaining this in detail?
What does running a model on lower precision like FP8 or BF16 actually mean?
How does reducing precision still preserve model quality while improving performance?

Conclusion

Starting ML again after 5–6 years has been humbling. Back then, I thought I had understood it — courses, projects, even a research paper. But coming back to it now made me realise how much I had forgotten, and how much deeper the field actually goes.
This time, I’m trying to understand what is actually happening underneath instead of just making the models work.

I Don’t Know Machine Learning as Well as I Thought I Did

Misconceptions:

Knowledge Gaps

Things That Started Making Sense

Still Curious

Conclusion

Comments (2)

Human Learning ML

More from this blog

How “Design Patterns Explained” Changed the Way I Design Software

Tests are for Future Releases

When AI Suddenly Took Center Stage

Navigating the Waters of Technical Curiosity: A Confession

Command Palette

Misconceptions:

Knowledge Gaps

Things That Started Making Sense

Still Curious

Conclusion

Comments (2)

Human Learning ML

More from this blog