Sorry Engineering by Rafal Makara: AI Learning Notes

Model Architecture: seq2seq

Rafal Makara — Wed, 12 Feb 2025 23:58:09 GMT

A model architecture defines how a machine learning model is structured—how data flows through it, how different components interact, and how it makes predictions.

There have been quiet a few model architectures. Right now, Transformer architecture dominates the field.

But before Transformer, Seq2Seq (Sequence-to-Sequence) was the big thing.

How does Seq2Seq work?

Seq2Seq is built with two main components:

• Encoder: Processes the input.

• Decoder: Generates the output.

Both work with sequences of tokens and, in the classic approach, use Recurrent Neural Networks (RNNs) or their more powerful versions—LSTMs and GRUs.

The encoder reads the input sequence step by step, updating its hidden state at each time step.
The final hidden state after processing the last input token represents the entire input sequence.
The decoder receives this final hidden state as its initial state and starts generating the output sequence, token by token.

Metaphor used in "AI Engineering" book was: working with final hidden state is like answering questions about a book based on just the summary. The final hidden state tries to capture everything from the input, but some details may get lost.

Since RNNs work sequentially, we must process the entire input before generating even the first output token. The longer the input, the longer we wait before we get anything in return. It doesn’t create the best UX for the chatbots.

After some time, the problem of not looking back at original input tokens (but only at the Final Hidden State) got solved by the Attention Mechanism. But this is a story for another time.

Sources:

AI Engineering by Chip Huyen (O’Reilly). Copyright 2025 Developer Experience Advisory LLC, 978-1-098-16630-4
https://d2l.ai/chapter_recurrent-modern/seq2seq.html
https://arxiv.org/abs/1409.3215v3

What is Inference Optimization?

Rafal Makara — Sat, 08 Feb 2025 22:53:41 GMT

It sounds like a smart term. In simple words, it is about making models faster and cheaper. It involves techniques to reduce computational costs, latency, and memory usage while maintaining or improving model accuracy.

If a chatbot generates a response of 300 tokens, with each token taking 10 milliseconds to produce, the user experience will be significantly better than if each token took 100 milliseconds to generate.

At this moment, I do not know much about inference optimization techniques. Perplexity says there are such as: pruning, quantization, knowledge distillation, weight sharing, low-rank factorization, early exit mechanisms, deployment strategy, caching and memoization, parallelism and batching. And probably more.

Sources:

AI Engineering by Chip Huyen (O’Reilly). Copyright 2025 Developer Experience Advisory LLC, 978-1-098-16630-4
Perplexity

Dataset Engineering

Rafal Makara — Sat, 08 Feb 2025 22:40:49 GMT

Dataset engineering refers to designing, collecting, curating, generating, annotating, optimizing the data needed for training and adapting AI models.

Imagine two problem types: classification and open chat response.

In closed-ended models, such as traditional classification models, dataset engineering is straightforward. For example, labeling an image as “cat” or “not a cat” is a well-defined task with clear ground truth.

However, in open-ended models, such as foundation models, dataset engineering becomes more complex. Since these models (through e.g. chatbot UI) can generate responses in an almost unlimited number of ways, it extends beyond simple labeling. Instead, dataset engineering focuses on tasks like deduplication, tokenization, context retrieval, quality control, removal of sensitive information.

How to prepare datasets, so specific model can train effectively on them?

Sources:

AI Engineering by Chip Huyen (O’Reilly). Copyright 2025 Developer Experience Advisory LLC, 978-1-098-16630-4
Additional research

Pre-training, Finetuning, Post-training

Rafal Makara — Sat, 08 Feb 2025 22:14:20 GMT

Training is deeply associated with the process of adjusting the model weights. While prompt engineering influences the output by modifying the the given input/context, prompt engineering doesn’t change the model weights.

Pre-training, training a model from scratch. Model weights are initialised. Then huge amount of training data is processed to adjust the model weights. This is the most resource-intensive phase.

Finetuning, continuation of the training with the weights obtained from the previous training sessions. This process usually uses much smaller or specialised dataset.

Post-training, from one perspective, the finetuning and the post-training are the same as both happen after the model is pre-trained as the mission of both is to improve the model.

Finetuning vs Post-training

Then, what’s the difference between finetuning and post-training?

I am not sure if this widely accepted definition, but in the context of foundation models, according to the book, finetuning is made by users of foundation models, while post-training is made by the foundation model engineers. There might be also a difference in a goal.

If you’re building your end-user facing application on top of OpenAI, and you decide to adjust the weights of the existing model, then you do finetuning. At this point, your finetuning will be probably targeting your specific use cases (e.g. domain) of the model to make the model more accurate and knowledgable in your context.

If you’re building your end-user facing application on top of OpenAI, and they decide to adjust the weights of the existing model, then they do post-training. For example, they can apply Reinforcement Learning from Human Feedback (RLHF) to align the model with human values, ethical principles, and intended use cases.

The last phase, post-training, which may use RLHF doesn’t necessarily adjust the weights, but might apply some output filtering techniques. So, training is not always about adjusting the weights?

Sources:

AI Engineering by Chip Huyen (O’Reilly). Copyright 2025 Developer Experience Advisory LLC, 978-1-098-16630-4
Additional research

Basic Layers of AI Systems

Rafal Makara — Sat, 08 Feb 2025 21:43:49 GMT

We’ve had a lot of time to get used to lots of architectural approaches for building applications. In the AI world, there are a few as well. The most basic one, seems to be:

Application Development Layer
- AI Interface
- Prompt engineering
- Context construction
- Evaluation
Model Development Layer
- Inference optimization
- Dataset engineering
- Modeling & training
- Evaluation
Infrastructure Layer
- Compute management
- Data management
- Serving
- Monitoring

Sources:

AI Engineering by Chip Huyen (O’Reilly). Copyright 2025 Developer Experience Advisory LLC, 978-1-098-16630-4

First thoughts on AI Metrics

Rafal Makara — Sat, 08 Feb 2025 21:38:56 GMT

While non-AI software engineering world is still getting used to metrics, metrics are an important part of AI world as well.

There might be such:

Quality of the chatbot responses. Thumbs up. Thumbs down.
Chat message follow-up rate.
Accuracy. Precision. Recall. R1 Score. Hallucination rate.
Inference Latency. Time to first token. Time per full answer. Time per token.
P70, P95 tokens per response. Tokens per prompt.
Memory usage. GPU/CPU Utilization.
Cost per job.
Median, P75 of agents involved in a job.
Energy consumption.
Labeling accuracy.
Error rates.

There are more. Will be more. Some will be valuable in a given context and time. The other ones will be pointless.

Sources:

AI Engineering by Chip Huyen (O’Reilly). Copyright 2025 Developer Experience Advisory LLC, 978-1-098-16630-4
Additional research

Agents, learnings from Anthropic

Rafal Makara — Wed, 05 Feb 2025 23:16:59 GMT

What’s an Agent?

Agents function like workflows, but with AI deciding what process or tool to use next.

Most agents are just LLMs using external tools—taking in data, making decisions, and acting accordingly. They are LLMs with enhancements like retrieval, tool integration, and memory.

There’s a trade-off: higher costs, more compute, and sometimes slower execution. But in return, you gain flexibility—fewer hardcoded conditions and more dynamic problem-solving.

Agentic Workflows

Different patterns shape how agentic workflows operate. A few examples:

Prompt Chaining: Breaks tasks into steps for better results.
Example: Generate marketing copy → Translate → Format.
Routing: Directs tasks to the right process.
Example: FAQ bot for general queries, automation for refunds, AI for tech support.
Sectioning: Splits work across multiple models.
Example: One model generates responses, another moderates content.
Voting: Runs multiple times for accuracy, enabling models to challenge each other.
Example: Content moderation using three models to balance false positives.
Orchestrator-Workers: A central LLM assigns tasks to worker models—similar to map-reduce for LLMs.
Example: Separate research workers doing the research, which gets aggregated later.
Evaluator-Optimizer: One LLM generates, another evaluates and refines, creating a self-improvement loop.
Example: Preparing the feedback and improving it execution-by-execution.

Agentic workflows have best practices, design patterns, and bad smells—just like coding.

Sources:

https://www.anthropic.com/research/building-effective-agents

Workflows: Learnings from Gumloop

Rafal Makara — Tue, 04 Feb 2025 22:16:14 GMT

I watched this Gumloop presentation. I personally use n8n, but I like checking out alternatives to get inspired by them. For sure, Gumloop is a product worth looking at.

Subflows Are Like Functions

Workflows often repeat the same steps. Gumloop has subflows, which let you reuse parts of a workflow instead of rebuilding the same logic again. It’s like writing functions in code.

If you have 20 different workflows, but all of them end with sending a Slack message written in a way aligned with your writing style - it can be a subflow.

Simple UI for Execution

Gumloop lets you create a basic website/HTML form with a few text fields for input parameters. You fill them in, click a button, and the workflow runs. Such forms can be shared with non-technical users.

Of course, without this functionality, you could setup such a website on your own and implement a button to call a webhook executing the workflow. However, it’s cool to see that they have a feature for it.

Chrome Extension for Instant Automation

Gumloop comes with a Chrome extension that lets you grab content from a currently opened website and send it as input to a workflow. One-click, and automation kicks in, already populated with the content from the website. Faster than copy&paste. Much easier than URL scrapping, especially for pages behind authorization.

Building Custom Nodes in AI-Powered Way

Gumloop supports custom nodes, letting you connect to third-party data sources that aren’t natively integrated. But what stood out in the video? The process. You copy & paste the API documentation of a service that isn’t yet supported, and Gumloop’s AI generates the custom node for you. No manual coding, no complex setup—just an instant, AI-assisted connector, ready to be used in your workflows.

Sources:

https://www.youtube.com/watch?v=QFc7jXZ2pdE

Prompts: Zero-shot, Few-shot, Chain-of-Thought

Rafal Makara — Tue, 04 Feb 2025 21:28:34 GMT

Zero-shot prompting

Zero-shot prompting means that the prompt used to interact with the model does not contain examples.

Prompt:

Classify the text into neutral, negative or positive.
Text: I think the vacation is okay.
Sentiment:

Few-shot prompting

Few-shot prompting provides examples in the prompt to steer the model to better performance.

Prompt:

This is awesome! // Negative
This is bad! // Positive
Wow that movie was rad! // Positive
What a horrible show! //

Chain-of-Thought

Chain-of-Thought prompting provides an example of a reasoning process in a prompt, that the model can learn from. The sentence “let’s take it step by step” works like magic.

Prompt:

Q: Bulb of garlic has 9 cloves. I ate 5 of them. I bought another bulb of garlic. How many cloves do I have?
A: Let's take it step-by-step. You had 9 cloves in a single bulb. You ate 5 cloves, after that you had 4 cloves left. You bought another bulb, assuming it also has 9 cloves. So you have 9+4 cloves in total, which is 13 cloves.
Q: My plant gives 3 new flowers every week. One of them dies every 2 weeks. If I buy one more plant in the second week. How many flowers I am going to have after 4 weeks?
A:

I haven’t shared the answers that the model gave me. Try it yourself.

Sources:

https://www.promptingguide.ai/

Tokens and Vocabulary

Rafal Makara — Sun, 02 Feb 2025 10:08:12 GMT

Phrase:

“I wouldn’t like to go to the gym today because I am sick”

might be split into the tokens (depending on the model) like:

["I", "would", "n", "'", "t", "like", "to", "go", "to", "the", "gym", "today", "because", "I", "am", "sick"]

Why tokenization?

Tokens are more meaningful than single characters.
There are fewer unique tokens, than words. Also “ing” token in English is quiet common. Making model more efficient.
Tokens help with unknown words, like “bananing”, which is made of “banana” and “ing”.

Vocabulary

Set of all tokens a model can work with is the model’s vocabulary.

Sources:

AI Engineering by Chip Huyen (O’Reilly)

Supervision and Self-supervision

Rafal Makara — Sun, 02 Feb 2025 10:06:22 GMT

Supervision means training a model using labeled data. You show it two pictures—one labeled as “cat” and the other not labelled as “cat”. The model learns to recognize pictures with cats.

Self-supervision means learning directly from the input data itself. Take the phrase: “Sky is blue and beautiful.” The model treats it as a training sequence: 1. Sky; 2. Sky is; 3. Sky is blue; 4. Sky is blue and; 5. Sky is blue and beautiful. It learns patterns and context without explicit labels, learns even from inputs.

Sources:

AI Engineering by Chip Huyen (O’Reilly). Copyright 2025 Developer Experience Advisory LLC, 978-1-098-16630-4

Key AI Model Types and Concepts

Rafal Makara — Sun, 02 Feb 2025 10:05:03 GMT

Language Models (LMs)

Language Models are based on statistical patterns learned from one or more languages. If they complete sentences like “My favorite color is _”, then it is autoregressive language model predicting the next word. Alternatively, when they fill in the blanks in sentences like “My favorite _ is blue”, then this is masked language model.

Large Language Models (LLMs)

The key difference between a standard language model and a large language model is scale—LLMs are trained on larger datasets, have more parameters, and require greater computational power, making them significantly more capable. Size matters.

Multimodal Models

Unlike traditional language models that process only text, multimodal models can handle multiple types of input, such as text, images, audio, and speech, enabling more complex interactions and understanding.

Task-Specific Models

These models are optimized for a single function—for example, a translation model can convert text between languages but cannot perform sentiment analysis. They are highly efficient but limited in scope.

General-Purpose Models

These models are versatile and can handle multiple tasks, such as translation, sentiment analysis, and more, without requiring significant modifications.

Foundation Models

Subcategory of general-purpose models, serving as a base for building AI applications. Thanks to them gazillions of startups can call themselves “AI Startup”.

ML Engineering

Machine Learning Engineering involves not only developing end-user applications but also designing, training, and optimizing machine learning models. It is sometimes referred to as MLOps, AIOps, or LLMOps. When we had no Foundation Models, that was the way to go.

AI Engineering

AI Engineering is the process of building applications on top of Foundation Models, making AI accessible and easy to use for most of us.

Sources:

AI Engineering by Chip Huyen (O’Reilly)