2024-12-26

Machine Learning is Like Floating-Point Arithmetic on the Real World

A trained model can be treated just like a regular computer program.

I still think that view is correct. But now, with the benefit of additional knowledge, I am finetuning how I reason about machine learning models; I’ll now think:

A trained model can be treated like a function in a regular computer program.

And it’s not just making predictions using a trained model. Training a model using a computer is also done using a function in a computer program. What do the training and prediction (and possibly also the hyperparameter tuning) functions share? Some global variable(s) of course. The variables are the parameters/weights of the model in the computer memory. The training function updates those weights, and the prediction/inference function uses those weights to make predictions.

So I got to thinking about how the weights are stored in memory, and what the weights in memory actually represent. The weights represent the model/function’s understanding of the real world. Weights are usually tensors (multidimensional arrays) containing floats (real numbers). Imagine that you are training a deep neural network on some data from the real world. It is possible that, because you are using physics data, the network might learn about Pi, the gravitational constant, and Planck’s constant. And so after training there might be three values residing somewhere in your model weights with values 3.14159..., 0.0000000000667430..., and 0.000000000000000000000000000000000662607015... for each of those physical constants respectively.

But the thing is that when computers store real numbers, like the three examples above, they are commonly stored as a floating-point approximate representation (aka floats). This approximation is necessary because there are infinitely many real numbers; if you wanted to store all real numbers accurately, you will need infinite memory (and this is true even if you limit it to say real numbers between 0 and 1). Anyway, how a real number is stored in computer memory depends on the floating-point format used by the computer. The physical constants I have used as an example might be represented as ‘pairs’ of numbers where the first element is the signficand and the second is the exponent: [3.14159, 0], [6.67430, -11], and [6.62607, -34] respectively. This is assuming a base of 10 and a precision of 6. (But of course, computers use a base of 2; there are also lots of other things to talk about regarding the actual floating-point formats used in practice, like the sign of the number. See What Every Computer Scientist Should Know About Floating-Point Arithmetic for further reading).

So we can see that this approximation is already one source of noise/error for the model, especially for a regression model where the predictions are real numbers/continuous.

Now, what if the weights learned by the model are more complex than that? What if, for instance, the model learned some floating-point representation for Pi, gravitational constant and Planck’s constant? So instead of trying to store those three constants directly in three parameters, the model has assigned 6 parameters to store significand and exponent for each of the three constants. There might also be additional learned parameters that represent the ‘floating-point format’ that the model has learned. And this is not just about floating-point representations, this could be some approximation of some law/truth/function that the model has learned. In this case, in addition to the noise introduced by the way the actual weights are stored, there is also the noise introduced by the approximation that the model is doing with the weights.

To put it another way: Computers have to approximate real numbers because real numbers are unlimited and the computer memory is limited; machine learning models have to approximate the real world because machine learning models can only be so large but real-world functions that they predict are usually ‘infinite’ (like the decimals of Pi, or the sine function having an infinite number of terms when you do a Taylor series expansion).

Many computer programmers already know to be careful of calculations with floats because they ‘hallucinate’. A common example is that this piece of code (in Kotlin)

fun main() {
    val a: Double = 0.1
    val b: Double = 0.2
    println(a + b)
}

prints 0.30000000000000004 (or 0.30000001192092896 for Kotlin/Wasm) instead of 0.3 because of how the Double (which is a float with high precision and high exponent range) is represented in memory. Maybe, in the same way, we just have to live with machine learning models occasionally being wrong.

When it comes to modelling language rules, like in Large Language Models (LLMs), this is the perfect tool for it because the grammar rules of any human language are finite, and so can be modelled exhaustively. Also, at least in the English language, there are usually multiple acceptable ways to say the same thing, so there is flexibility. The LLM can output consistently good grammar. But when it tries to explain about real world stuff based on an understanding modelled with floats, then mistakes are still a risk, because the space of ideas is infinite.