
blue 1,0,0,0
black 0,1,0,0
white 0,0,1,0
red 0,0,0,1
Table 4.1: One Hot Encoding of colors
Figure 4.1: Thermometer 8.5/10
memory & wastes computation when used in a dense ”way”. Furthermore,
every vector is as far as any other vector, so similar categories (e.g. similar
colors, words, users) are not close together.
An efficient way of encoding is to use an Embedding: a lookup table with
a n-dimensional vector for each category. This is efficient: we only need to
retrieve the category or categories used for a certain training example.
Also, embeddings can be pre-trained on data other than that from the train-
ing task using Transfer Learning.
4.4 Thermometer Encoding
When we have a feature bounded between two numbers we can. However, for
example in linear regression, we will fit a linear line. If we want to find any
other relation. However: we can apply a non-linear function to the input data,
to still be able to find non-linear relation ships in data.
The idea of thermometer encoding (also called unary encoding) is to trans-
form one feature into n features, where each feature will be active at a certain
threshold where each feature holds roughly the same amount of data points.
For example, when we have a variable from 0 to 10 and we want to trans-
form it to a thermometer using stepsize of 1, the thresholds are at the values
0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
Between the buckets, we can interpolate the values to avoid removing infor-
mation.
A value 8.5 can be visualized as a ”thermometer” as in Figure 4.1.
An implementation in NumPy:
def the rmome ter (x , start , end ):
thresho lds = np . arange ( start , end )
th er mo = (x > thres ho lds ). astype( floa t )
th er mo [ np . arange ( len(x) ) ,
( np . f loor (( x - start ))). astype( int). resha pe ( len (x))
] = np . fmod(x , 1 . 0). reshape ( len (x) )
re tu rn thermo
Our thermometer function gives the desired result:
>>> th ermomet er ( np . array ([[ 8 .5] ]) , 0 , 10 )
12