How can we measure efficacy of a language model? Language model researchers use the term “Perplexity” to measure how a language model performs on tasks on standard datasets. In a language model the task means quizzing the model to complete a sentence or hold a Q&A or generate an essay. GPT-3 scored well, in fact very well, on perplexity on standard benchmarks like Penn Tree Bank. Overall, though, the results were mediocre. Perplexity of a model measures the “surprise” factor in the generation or how many branches does the model deal with when predicting the next word. A perplexity of 20 would mean that given a few words the model has to pick between 20 choices for the next word. If that number was 2, then the model has an easier task but that is most likely because we overfit the model to a specific task. Without this context specific training, the GPT-3 folks claim that the model is a few shot learner, which means it takes a few attempts before the model hones in on the task’s context. There are variants of transformer models like BERT cased - which are trained for specific tasks like “complete last word” or “fill in the blank” which perform much better at those specific tasks than GPT-3.As we are building LLMs to perform task like translation, generation, completion etc., should we overfit a model to a specific task or leave it as a generic model and provide in-context training via prompting? With the GPT-3, it seems prompting is the chosen route to get the model to hone-in. And what about the model size? Does it make sense to overfit a model with hundreds of billions of parameters to do a single task (like translate) or leave it as a few shot performer i.e. mediocre? These are the tradeoffs that one has to make when building a new LLM. Larger models are mediocre at best.
posted by Vikas Deolaliker at 7:54 AM on Jun 10, 2023
"Perplexity, Entropy: How to measure LLMs?"
No comments yet. -