top of page

One Hot Encoding: Encoding Categorical Variables

  • Writer: Joy Tech
    Joy Tech
  • Feb 12, 2023
  • 1 min read

Updated: Feb 13, 2023

Definition:

One-hot encoding is a process used in data pre-processing to convert categorical data into numerical data that can be easily processed and analyzed by machine learning algorithms. This technique creates a new binary column for each unique category, with the column corresponding to a specific category being marked with a '1' and all other columns marked with a '0'.


The result is a binary matrix where each row represents an instance in the original data, and each column represents a binary feature.


Simple Example:

Consider a dataset with the following categorical column:

Color

row1

Red

row2

Green

row3

Blue

row4

Red


Using one-hot encoding, we can convert this column into a binary matrix like this:

Row

Color_Red

Color_Green

Color_Blue

row1

1

0

0

row2

0

1

0

row3

0

0

1

row4

1

0

0


In this example, the original data has 4 instances and 1 categorical column of color, and after the one-hot encoding, the transformed data is a 4x3 matrix excluding the row column, with 3 columns representing the 3 unique categories in the original data.


For row1, the value, red, would be represented as [1,0,0].

For row2, the value, green, would be represented as [0,1,0].


Your turn: For row3 and row4, what would be the representation for blue and red?

This transformed, numerical representation of the categorical data can be easily processed and analyzed by machine learning algorithms, allowing them to identify patterns and relationships in the data and make predictions based on the presence or absence of certain categories.



ree






Comments


Post: Blog2_Post
  • LinkedIn

©2022 by Joy Tech

bottom of page