Handle Categorical features using Python

One Hot Encoding with Python | Handling Categorical Data

Handling Categorical Data in Machine Learning: Easy Explanation for Data Science Interviews

KINDNESS ALWAYS COME BACK

Поважай захисників | GOVOR TikTok #govor #shots

Мы никогда не были так напуганы!

Python Tutorial: Dealing with categorical features

DataCamp

Переглядів 13 610

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 9 бер 2020
Want to learn more? Take the full course at learn.datacamp.com/courses/fe... at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
Categorical variables are used to represent groups that are qualitative in nature. Some examples are colors, such as blue, red, black etc. or country of birth, such as Ireland, England or USA. While these can easily be understood by a human, you will need to encode categorical features as numeric values to use them in your machine learning models.
As an example, here is a table which consists of the country of residence of different respondents in the Stackoverflow survey. To get from qualitative inputs to quantitative features, one may naively think that assigning every category in a column a number would suffice, for example India could be 1, USA 2 etc. But these categories are unordered, so assigning this order may greatly penalize the effectiveness of your model.
Thus, you cannot allocate arbitrary numbers to each category as that would imply some form of ordering in the categories.
Instead, values can be encoded by creating additional binary features corresponding to whether each value was picked or not as shown in the table on the right.
In doing so your model can leverage the information of what country is given, without inferring any order between the different options.
There are two main approaches when representing categorical columns in this way, one hot encoding and dummy encoding. These are very similar and often confused. In fact, by default, pandas performs one-hot encoding when you use the get_dummies() function.
One-hot encoding converts n categories into n features as shown here. You can use the get_dummies() function to one-hot encode columns. The function takes a DataFrame and a list of categorical columns you want converted into one hot encoded columns, and returns an updated DataFrame with these columns included.
Specifying a prefix with the prefix argument can improve readability like the letter C for country has been used here.
On the other hand, dummy encoding creates n-1 features for n categories, omitting the first category. Notice that this time there is no feature for France, the first category. In dummy encoding, the base value, France in this case, is encoded by the absence of all other countries as you can see on the last row here and its value is represented by the intercept. For dummy encoding, you can use the same get_dummies() function with an additional argument, drop_first set to True as shown here.
Both these methods have different advantages. One-hot encoding generally creates much more explainable features, as each country will have its own weight that can be observed after training. But one must be aware that one hot encoding may create features that are entirely collinear due to the same information being represented multiple times.
Take for example a simpler categorical column recording the sex of the survey takers. By recording a 1 for male the information of whether the person is female is already known when the male column is 0. This double representation can lead to instability in your models and dummy values would be more appropriate.
However, both one-hot encoding and dummy encoding may result in a huge number of columns being created if there are too many different categories in a column. In these cases, you may want to only create columns for the most common values. You can check the number of occurrences of different features in a column using the value_counts() method on a specific column.
Once you have your counts of occurrences, you can use it to limit what values you will include by first creating a mask of the values that occur less than n times. A mask is a list of booleans outlining which values in a column should be affected. First we find the categories that occur less than n times using the index attribute and wrap this inside the isin() method.
After you create the mask, you can use it to replace these categories that occur less than n times with a value of your choice as shown here.
Lets put what has been learned into practice and work with some categorical variables.
#PythonTutorial #Python #DataCamp #Engineering #MachineLearning #categorical #features
Наука та технологія

КОМЕНТАРІ • 4

@Aldotronix 3 місяці тому
the replacement with “other” is clever ngl
@marcus_dempsey 2 роки тому ⁺¹
Perfect explaining! Thank you very much
@amirabouamrane7151 4 роки тому ⁺³
thanks i was think that OHE is similar to dummy encoding truly thanks
@yoyo-gv8zs 8 місяців тому
Thanks bro mans been on dis for a whole day. Wasted a lot of time today trying to figure dis out still.

Наступне

Автоматичне відтворення

Handle Categorical features using Python

Handle Categorical features using Python

One Hot Encoding with Python | Handling Categorical Data

One Hot Encoding with Python | Handling Categorical Data

Handling Categorical Data in Machine Learning: Easy Explanation for Data Science Interviews

Handling Categorical Data in Machine Learning: Easy Explanation for Data Science Interviews

KINDNESS ALWAYS COME BACK

KINDNESS ALWAYS COME BACK

Поважай захисників | GOVOR TikTok #govor #shots

Поважай захисників | GOVOR TikTok #govor #shots

Мы никогда не были так напуганы!

Мы никогда не были так напуганы!

Normal vs Psychopath vs Rich How to heal a cut on your finger ☝️❤️‍🩹

Normal vs Psychopath vs Rich How to heal a cut on your finger ☝️❤️‍🩹

Fletcher Riehl: Using Embedding Layers to Manage High Cardinality Categorical Data | PyData LA 2019

Fletcher Riehl: Using Embedding Layers to Manage High Cardinality Categorical Data | PyData LA 2019

One Hot Encoder with Python Machine Learning (Scikit-Learn)

One Hot Encoder with Python Machine Learning (Scikit-Learn)

Analyzing Categorical Data from the General Social Survey in Python

Analyzing Categorical Data from the General Social Survey in Python

Variable Encodings for Machine Learning | Categorical, One-Hot, Dummy, Ordinal | ML Fundamentals 4

Variable Encodings for Machine Learning | Categorical, One-Hot, Dummy, Ordinal | ML Fundamentals 4

How I'd Learn AI in 2024 (if I could start over)

How I'd Learn AI in 2024 (if I could start over)

FASTEST Way To Learn Coding and ACTUALLY Get A Job

FASTEST Way To Learn Coding and ACTUALLY Get A Job

How to do Deep Learning with Categorical Data

How to do Deep Learning with Categorical Data

Label Encoding | Dummies How to Convert Categorical Column into Numerical Column Python Tutorial

Label Encoding | Dummies How to Convert Categorical Column into Numerical Column Python Tutorial

⚠️ СРОЧНО ВЫБРАСЫВАЙ ANDROID! СОВЕТЫ ТОПОВОГО БОРЦА С ХАКЕРАМИ

⚠️ СРОЧНО ВЫБРАСЫВАЙ ANDROID! СОВЕТЫ ТОПОВОГО БОРЦА С ХАКЕРАМИ

Сохрани и проверь свои настройки камеры, именно форматы ProRes ✅

Сохрани и проверь свои настройки камеры, именно форматы ProRes ✅

ВЕЛИКАЯ ЭВОЛЮЦИЯ ЗВУКА: от 8-bit до Hi-Res | РАЗБОР

ВЕЛИКАЯ ЭВОЛЮЦИЯ ЗВУКА: от 8-bit до Hi-Res | РАЗБОР

Правильный Li-Ion Аккумулятор Своими Руками (Aka Kasyan)

Правильный Li–Ion Аккумулятор Своими Руками (Aka Kasyan)

Hisense Official Flagship Store Hisense is the champion What is going on?

Hisense Official Flagship Store Hisense is the champion What is going on?

ИГРОВОВЫЙ НОУТ ASUS ЗА 57 тысяч

ИГРОВОВЫЙ НОУТ ASUS ЗА 57 тысяч

Новый Игровой Изогнутый Монитор Монитор MSI MAG 341CQP QD OLED

Новый Игровой Изогнутый Монитор Монитор MSI MAG 341CQP QD OLED

The Truth about Snapdragon X Laptops…

The Truth about Snapdragon X Laptops…