One Hot Encoding: Encoding Categorical Variables
- Joy Tech

- Feb 12, 2023
- 1 min read
Updated: Feb 13, 2023
Definition:
One-hot encoding is a process used in data pre-processing to convert categorical data into numerical data that can be easily processed and analyzed by machine learning algorithms. This technique creates a new binary column for each unique category, with the column corresponding to a specific category being marked with a '1' and all other columns marked with a '0'.
The result is a binary matrix where each row represents an instance in the original data, and each column represents a binary feature.
Simple Example:
Consider a dataset with the following categorical column:
| Color |
row1 | Red |
row2 | Green |
row3 | Blue |
row4 | Red |
Using one-hot encoding, we can convert this column into a binary matrix like this:
Row | Color_Red | Color_Green | Color_Blue |
row1 | 1 | 0 | 0 |
row2 | 0 | 1 | 0 |
row3 | 0 | 0 | 1 |
row4 | 1 | 0 | 0 |
In this example, the original data has 4 instances and 1 categorical column of color, and after the one-hot encoding, the transformed data is a 4x3 matrix excluding the row column, with 3 columns representing the 3 unique categories in the original data.
For row1, the value, red, would be represented as [1,0,0].
For row2, the value, green, would be represented as [0,1,0].
Your turn: For row3 and row4, what would be the representation for blue and red?
This transformed, numerical representation of the categorical data can be easily processed and analyzed by machine learning algorithms, allowing them to identify patterns and relationships in the data and make predictions based on the presence or absence of certain categories.

Comments