《AI Engineering》第2章 – 理解基础模型

“In general, however, differences in foundation models can be traced back to decisions about training data, model architecture and size, and how they are post-trained to align with human preferences”（摘录原文）

💬 评论：基础模型应该关注的参数：训练的数据，模型架构，规模，如何进行后期训练使其符合人类偏好等方面的决策

“Since models learn from data, their training data reveals a great deal about their capa‐ bilities and limitations. This chapter begins with how model developers curate train‐ ing data, focusing on the distribution of training data. Chapter 8 explores dataset engineering techniques in detail, including data quality evaluation and data synthesis.”（摘录原文）

💬 评论：模型数据的一些概念：训练数据的选择，训练数据的分布，数据集工程（数据质量评估，数据合成）

训练数据的选择（重点：训练数据的分布）

“Some might wonder, why not just train a model on all data available, both general data and specialized data, so that the model can do everything? This is what many people do. However, training on more data often requires more compute resources and doesn’t always lead to better performance. For example, a model trained with a smaller amount of high-quality data might outperform a model trained with a large amount of low-quality data. Using 7B tokens of high-quality coding data, Gunasekar et al. (2023) were able to train a 1.3B-parameter model that outperforms much larger models on several important coding benchmarks. The impact of data quality is dis‐ cussed more in Chapter 8.”（摘录原文）

💬 评论：训练数据的选择并非越多越好，更多的数据意味着就要更多的训练成本，而得到的结果并非会有更好的性能，而且很多从互联网公开信息获取的数据有真有假，质量良莠不齐。所以训练数据的数量和训练数据的质量都是影响结果的因素

💡 思考：训练数据的质量和数量，还有多语言的接收也会影响模型生成的质量以及模型消耗的 token，还有特定领域的模型，通用型可以对日常一些问题做出回答，但是对于信息在互联网上相对少披露的领域回答的准确率也会下降

Transformer 架构特性以及什么样的架构可以取代

模型规模

后训练的决策

模型的采样

“While most people understand the impact of training on a model’s performance, the impact of sampling is often overlooked. Sampling is how a model chooses an output from all possible options. It is perhaps one of the most underrated concepts in AI. Not only does sampling explain many seemingly baffling AI behaviors, including hal‐ lucinations and inconsistencies, but choosing the right sampling strategy can also sig‐ nificantly boost a model’s performance with relatively little effort. For this reason, sampling is the section that I was the most excited to write about in this chapter.”（摘录原文）

💬 评论：采样是指模型如何从所有可能的选项中选择一个输出。它或许是人工智能领域中最被低估的概念之一。采样不仅能解释许多看似令人困惑的人工智能行为，包括幻觉和不一致性，而且选择正确的采样策略还能以相对较少的努力显著提升模型的性能