Neural Networks have a bad reputation of being very confident when they are wrong. This is the result of a bad probability estimates being calculated (i.e. learned). They also suffer from adversarial attacks. Training is the activity that takes the most time in arriving at a functional LLM. Besides collecting, curating and integrating data sets, we have to also navigate around pot holes by employing optimization techniques on the objective function. Objective function or goal seeking function is a function that takes data and model parameters as arguments and outputs a number. The goal is to find values for these parameters which either maximize or minimize this number. Maximum likelihood (MLE) is one of the most often used function that fits this task of finding the set of parameters that best fit the observed data.LLMs have three model architectures (a) encoder only (BERT) (b) decoder only (GPT) (c) encoder-decoder (T5). Looking at (b), which is a probability distribution over a word given the prompt, which is arrived by taking a smoothed exponential (softmax) of scores calculated using scaled dot products between each new prediction word with prompt, we use MLE to find the best distribution that fits the observed data.Stochastic gradient descent (SGD) and ADAM (adaptive moment estimation) are two common methods used to optimize the objective function. The latter is memory intensive. There are many knobs like size of a floating point, calculating moments (more value per parameter), changing learning rates among others that can be used to learn a model. Sometimes the knob settings result in generic learning and other times overfitting. More often than not we just don’t converge i.e. no learning. ADAM is popular optimizer (I use it on the open source transformers from hugging), it keeps 3X more values per parameter than vanilla SGD. AdaFactor is an optimization on ADAM to reduce memory consumption but has been known to not always work. A rule of thumb in ML is gather more data as a dumb model with lots of data beats a smart model with limited data. But training on large amount of data is costly. It used computational resources and as the whole process is iterative, we need fast processors to crunch through the data so the model can be adapted and iterated upon. More data does not guarantee convergence i.e learning. The whole exercise in learning a model looks and feels like art bordering on black magic than anything analytic or scientific. If modeling felt like a recipe than this is like cooking the recipe. The end result has a lot of variance.
posted by Vikas Deolaliker at 11:22 AM on Jul 9, 2023
"Learning a Model"
No comments yet. -