One Hot Encoding: Encoding Categorical Variables

Joy Tech
Feb 12, 2023
1 min read

Updated: Feb 13, 2023

Definition:

One-hot encoding is a process used in data pre-processing to convert categorical data into numerical data that can be easily processed and analyzed by machine learning algorithms. This technique creates a new binary column for each unique category, with the column corresponding to a specific category being marked with a '1' and all other columns marked with a '0'.

The result is a binary matrix where each row represents an instance in the original data, and each column represents a binary feature.

Simple Example:

Consider a dataset with the following categorical column:

	Color
row1	Red
row2	Green
row3	Blue
row4	Red

Using one-hot encoding, we can convert this column into a binary matrix like this:

Row	Color_Red	Color_Green	Color_Blue
row1	1	0	0
row2	0	1	0
row3	0	0	1
row4	1	0	0

In this example, the original data has 4 instances and 1 categorical column of color, and after the one-hot encoding, the transformed data is a 4x3 matrix excluding the row column, with 3 columns representing the 3 unique categories in the original data.

For row1, the value, red, would be represented as [1,0,0].

For row2, the value, green, would be represented as [0,1,0].

Your turn: For row3 and row4, what would be the representation for blue and red?

This transformed, numerical representation of the categorical data can be easily processed and analyzed by machine learning algorithms, allowing them to identify patterns and relationships in the data and make predictions based on the presence or absence of certain categories.

Comments