The equation for entropy is very often presented in textbooks without much explanation, other than to say it has the desired properties. Here, I attempt an informal derivation of the equation starting from uniform probability distributions.
A good way to think about information is in terms of sending messages. In the video, we send messages about the colour of items drawn from various probability distributions. Each message is coded as a binary number.
The amount of information in a probability distribution is the average number of bits (1s and 0s) needed to send messages about values drawn from this distribution, using the most efficient code possible.
We’ll see that the log term in the entropy equation is related to the number of bits needed to send a message – that is, the message length. The total entropy is just a weighted sum of message lengths: it’s the average message length.
\(\Sigma\) just means “sum”
\(-log(p_i)\) is the number of bits needed to encode value \(i\) – the interpretation of this is that more frequent values (larger \(p_i\)) should be sent using shorter messages
\(p_i\) is the weight – we are averaging across all message lengths, weighted by how often that particular message is sent
Related forum topics
To discuss this post, start by searching the forums for “Entropy: understanding the equation”.