《AI Engineering》第2章 – 理解基础模型

“In general, however, differences in foundation models can be traced back to decisions about training data, model architecture and size, and how they are post-trained to align with human preferences”（摘录原文）

💬 评论：基础模型应该关注的参数：训练的数据，模型架构，规模，如何进行后期训练使其符合人类偏好等方面的决策

“Since models learn from data, their training data reveals a great deal about their capa‐ bilities and limitations. This chapter begins with how model developers curate train‐ ing data, focusing on the distribution of training data. Chapter 8 explores dataset engineering techniques in detail, including data quality evaluation and data synthesis.”（摘录原文）

💬 评论：模型数据的一些概念：训练数据的选择，训练数据的分布，数据集工程（数据质量评估，数据合成）

训练数据的选择（重点：训练数据的分布）

“Some might wonder, why not just train a model on all data available, both general data and specialized data, so that the model can do everything? This is what many people do. However, training on more data often requires more compute resources and doesn’t always lead to better performance. For example, a model trained with a smaller amount of high-quality data might outperform a model trained with a large amount of low-quality data. Using 7B tokens of high-quality coding data, Gunasekar et al. (2023) were able to train a 1.3B-parameter model that outperforms much larger models on several important coding benchmarks. The impact of data quality is dis‐ cussed more in Chapter 8.”（摘录原文）

💬 评论：训练数据的选择并非越多越好，更多的数据意味着就要更多的训练成本，而得到的结果并非会有更好的性能，而且很多从互联网公开信息获取的数据有真有假，质量良莠不齐。所以训练数据的数量和训练数据的质量都是影响结果的因素

💡 思考：训练数据的质量和数量，还有多语言的接收也会影响模型生成的质量以及模型消耗的 token，还有特定领域的模型，通用型可以对日常一些问题做出回答，但是对于信息在互联网上相对少披露的领域回答的准确率也会下降

Transformer 架构特性以及什么样的架构可以取代

“As of this writing, the most dominant architecture for language-based foundation models is the transformer architecture (Vaswani et al., 2017), which is based on the attention mechanism. It addresses many limitations of the previous architectures, which contributed to its popularity.”（摘录原文）

💬 评论：transformer 架构基于注意力机制，这也是transformer受欢迎的原因

💡 思考：1.什么是注意力机制，2.之前的架构有那些局限性

🤖 From GPT: 这张图展示了传统的 Seq2Seq (RNN-based) 架构和 Transformer 架构在处理翻译任务（如从英文 “How are you?” 翻译成西班牙语 “¿Cómo > estás?”）时的核心差异。

我们逐部分解释：

⸻

🔲 上半部分：Seq2Seq（基于 RNN 的架构）

✅ 工作原理：

输入序列 "How are you ?" 被一个编码器 RNN 处理，最后得到一个最终的隐藏状态（Final hidden state）。

解码器 RNN 从这个隐藏状态出发，逐步生成目标语言的单词： • 生成 ¿ • 然后是 Como • 然后 estas • 最后是 ?

所有的翻译决策都依赖于“最终隐藏状态”这个压缩的信息向量。

❗ 问题： • 信息瓶颈（信息压缩）：长句子时，这个隐藏状态难以保留足够的信息。 • 上下文丢失严重：早期输入（如 “How”）可能在最后状态中被淡化或忽略。

⸻

🔲 下半部分：Transformer（基于 Attention 机制）

✅ 工作原理：

输入同样是 "How are you ?"，但 Transformer 不是将所有信息压缩进一个隐藏状态，而是对每个输出词都： • “关注”输入序列中的不同词（attention）。 • 如图中，生成 ¿ 时，主要参考了 “How”；生成 estas 时则参考了 “you”。

彩色箭头表示的是 attention： • 每个输出词（西语单词）根据 attention 去看输入句中哪些词重要。 • 例如： • ¿ 主要关注 “How” • Como 同时参考 “How”、“are” • estas 更多关注 “are” 和 “you” • 最后的 ? 关注整个输入句

✅ 优势： • 无压缩瓶颈：每个输出词都能访问整个输入句。 • 并行计算：Transformer 非序列生成，可并行训练。 • 更好的上下文建模能力。

⸻

📌 总结对比

特性 Seq2Seq (RNN) Transformer
输入信息获取仅靠最后隐藏状态全局 attention
长句建模能力弱强
训练效率慢，串行处理快，可并行
可解释性低高（通过 attention 可视化）

⸻

💡应用启发：

这张图很好地说明了为什么 Transformer 架构在机器翻译、ChatGPT 等任务中表现远胜于传统 RNN 模型。

如果你正在学习 NLP 或 transformer 架构，这张图可以作为理解 attention 工作机制的经典范例。

特性	Seq2Seq (RNN)	Transformer
输入信息获取	仅靠最后隐藏状态	全局 attention
长句建模能力	弱	强
训练效率	慢，串行处理	快，可并行
可解释性	低	高（通过 attention 可视化）

“There are two problems with seq2seq that Vaswani et al. (2017) addresses. First, the vanilla seq2seq decoder generates output tokens using only the final hidden state of the input. Intuitively, this is like generating answers about a book using the book summary. This limits the quality of the generated outputs. Second, the RNN encoder and decoder mean that both input processing and output generation are done sequentially, making it slow for long sequences. If an input is 200 tokens long, seq2seq has to wait for each input token to finish processing before moving on to the”（摘录原文）

💬 评论：seq2seq（序列到序列）存在的2个问题：

普通的序列到序列解码器仅使用输入的最终隐藏状态来生成输出标记。直观地说，这就像是根据书籍的内容梗概来生成有关这本书的答案。这限制了生成输出的质量

循环神经网络（RNN）编码器和解码器意味着输入处理和输出生成都是按顺序进行的，这使得处理长序列时速度较慢。如果一个输入有 200 个标记长，序列到序列模型必须等待每个输入标记处理完毕后才能继续处理下一个。

“The transformer architecture addresses both problems with the attention mecha‐ nism. The attention mechanism allows the model to weigh the importance of differ‐ ent input tokens when generating each output token. This is like generating answers by referencing any page in the book. A simplified visualization of the transformer architecture is shown in the bottom half of”（摘录原文）

💬 评论：Transformer架构通过注意力机制解决了这两个问题。注意力机制使模型在生成每个输出标记时，能够权衡不同输入标记的重要性。这就好比通过查阅书中的任意一页来生成答案。上图的下半部分展示了Transformer架构的简化示意图。

“While the attention mechanism is often associated with the trans‐ former model, it was introduced three years before the transformer paper. The attention mechanism can also be used with other archi‐ tectures. Google used the attention mechanism with their seq2seq architecture in 2016 for their GNMT (Google Neural Machine Translation) model. However, it wasn’t until the transformer paper showed that the attention mechanism could be used without RNNs”（摘录原文）

💬 评论：注意力机制并不和 transformer 架构强绑定。

模型规模

后训练的决策

模型的采样

“While most people understand the impact of training on a model’s performance, the impact of sampling is often overlooked. Sampling is how a model chooses an output from all possible options. It is perhaps one of the most underrated concepts in AI. Not only does sampling explain many seemingly baffling AI behaviors, including hal‐ lucinations and inconsistencies, but choosing the right sampling strategy can also sig‐ nificantly boost a model’s performance with relatively little effort. For this reason, sampling is the section that I was the most excited to write about in this chapter.”（摘录原文）

💬 评论：采样是指模型如何从所有可能的选项中选择一个输出。它或许是人工智能领域中最被低估的概念之一。采样不仅能解释许多看似令人困惑的人工智能行为，包括幻觉和不一致性，而且选择正确的采样策略还能以相对较少的努力显著提升模型的性能