《AI Engineering》第1章 – 使用基础模型构建人工智能应用程序
“A language model encodes statistical information about one or more languages. Intui‐ tively, this information tells us how likely a word is to appear in a given context. For example, given the context “My favorite color is __”, a language model that encodes English should predict “blue” more often than “car”.”(摘录原文)
💬 评论:语言模型是大语言模型的的原始状态,也正是大语言模型的发展成就了人工智能的发展
“The basic unit of a language model is token. A token can be a character, a word, or a part of a word (like -tion), depending on the model.2 For example, GPT-4, a model behind ChatGPT, breaks the phrase “I can’t wait to build AI applications” into nine tokens, as shown in Figure 1-1. Note that in this example, the word “can’t” is broken into two tokens, can and ’t. You can see how different OpenAI models tokenize text on the OpenAI website.”(摘录原文)
💬 评论:token(词元)是语言模型最基本的单位,不同的模型对于词元的计算有着不同的方法
“The set of all tokens a model can work with is the model’s vocabulary. You can use a small number of tokens to construct a large number of distinct words, similar to how you can use a few letters in the alphabet to construct many words. The Mixtral 8x7B model has a vocabulary size of 32,000. GPT-4’s vocabulary size is 100,256. The toke‐ nization method and vocabulary size are decided by model developers.”(摘录原文)
💬 评论:模型的词汇表就是模型能够处理的所有 token 集合,这个是由开发者决定的
“Why do language models use token as their unit instead of word or character? There are three main reasons:
- Compared to characters, tokens allow the model to break words into meaningful components. For example, “cooking” can be broken into “cook” and “ing”, with both components carrying some meaning of the original word.
- Because there are fewer unique tokens than unique words, this reduces the model’s vocabulary size, making the model more efficient (as discussed in Chapter 2).
- Tokens also help the model process unknown words. For instance, a made-up word like “chatgpting” could be split into “chatgpt” and “ing”, helping the model understand its struc‐ ture. Tokens balance having fewer units than words while retaining more meaning than individual characters. ”(摘录原文)
💬 评论:
token 会将单词分解成更有意义的组成部分
唯一 token 数量少于唯一单词数量,减少词汇表量
token 有助于模型理解未知单词,做到比单词的数量少但是却又保留了更多含义
“Masked language model”(摘录原文)
💬 评论:掩码语言模型经过训练,能够利用缺失 token 的上下文,预测序列中任意位置的的缺失 token
“Autoregressive language model” (摘录原文)
💬 评论:自回归语言模型经过训练,可以仅使用序列中的前序 token 来预测下一个 token,现在的生成式文本选择的是自回归语言模型
“Self-supervision: The answer is that language models can be trained using self-supervision, while many other models require supervision. Supervision refers to the process of training ML algorithms using labeled data, which can be expensive and slow to obtain. Selfsupervision helps overcome this data labeling bottleneck to create larger datasets for models to learn from, effectively allowing models to scale up. Here’s how.” (摘录原文)
💬 评论:
Self-supervision (自监督):什么是监督(解释:监督就是指使用标记数据训练机器学习算法的过程),但是获取可以训练的数据昂贵且耗时,大语言模型可以利用用户输入的数据进行自我训练(类似可以自己进行未知领域的探索以及已知领域的深入)
“Planning AI Applications” (摘录原文)
💬 评论:如何规划和构建人工智能程序
用例评估:构建什么样的程序,风险驱动项目推进,列出风险优先级
人工智能在人类社会中的作用:
2.1 主要角色还是辅助角色
2.2 被动式还是主动式(被动式是根据用户请求进行相应,主动式则会在适当时机中主动出现帮助用户)
2.3 动态或者静态:动态会随着用户执行持续更新,静态则会定期更新
产品的可防御性(也就是说产品可复制性)
设定预期(阶段性的成功的标志是什么?)
里程碑设立(产品的迭代的标志以及版本)
维护(如何随着时间的推移保持产品的可用性以及保持产品的质量)
“Three Layers of the AI Stack” (摘录原文)
💬 评论:关于现在人工智能我们能做的三个方面
- 应用程序的开发,基于基础模型或者应用模型创建应用程序
- 模型开发:模型开发工具,建模,数据训练(包括数据集工程),微调,推理优化框架
- 基础设施:模型服务,数据管理,监控工具
No comments to display
No comments to display