《AI Engineering》第1章 – 使用基础模型构建人工智能应用程序
“A language model encodes statistical information about one or more languages. Intui‐ tively, this information tells us how likely a word is to appear in a given context. For example, given the context “My favorite color is __”, a language model that encodes English should predict “blue” more often than “car”.”(摘录原文)
💬 评论:语言模型是大语言模型的的原始状态,也正是大语言模型的发展成就了人工智能的发展
“The basic unit of a language model is token. A token can be a character, a word, or a part of a word (like -tion), depending on the model.2 For example, GPT-4, a model behind ChatGPT, breaks the phrase “I can’t wait to build AI applications” into nine tokens, as shown in Figure 1-1. Note that in this example, the word “can’t” is broken into two tokens, can and ’t. You can see how different OpenAI models tokenize text on the OpenAI website.”(摘录原文)
💬 评论:token(词元)是语言模型最基本的单位,不同的模型对于词元的计算有着不同的方法
“The set of all tokens a model can work with is the model’s vocabulary. You can use a small number of tokens to construct a large number of distinct words, similar to how you can use a few letters in the alphabet to construct many words. The Mixtral 8x7B model has a vocabulary size of 32,000. GPT-4’s vocabulary size is 100,256. The toke‐ nization method and vocabulary size are decided by model developers.”(摘录原文)
💬 评论:模型的词汇表就是模型能够处理的所有 token 集合,这个是由开发者决定的
“Why do language models use token as their unit instead of word or character? There are three main reasons:
- Compared to characters, tokens allow the model to break words into meaningful components. For example, “cooking” can be broken into “cook” and “ing”, with both components carrying some meaning of the original word.
- Because there are fewer unique tokens than unique words, this reduces the model’s vocabulary size, making the model more efficient (as discussed in Chapter 2).
- Tokens also help the model process unknown words. For instance, a made-up word like “chatgpting” could be split into “chatgpt” and “ing”, helping the model understand its struc‐ ture. Tokens balance having fewer units than words while retaining more meaning than individual characters. ”(摘录原文)
💬 评论:
token 会将单词分解成更有意义的组成部分
唯一 token 数量少于唯一单词数量,减少词汇表量
token 有助于模型理解未知单词,做到比单词的数量少但是却又保留了更多含义
🧩 关联概念:[[Prompt 工程]] [[模型评估方法]]
📌 建议:提炼 2–3 个核心概念,作为新增概念笔记。