Now lets simulate some simple data, where we have 5 categories, and each category has the same overall proportion in the data.
Pd_dat = pd.DataFrame(np_dat,columns=list_PC,index=data.index) # Returns a nicer dataframe after transform with named PC columns Ve = pd.DataFrame(var_rat,index=list_PC,columns=) # nice print function for variance explained Loadings = pd.DataFrame(mod.components_.T, # Libraries we need and some functions to First here are the libraries we will be using, and some helper functions to work with sklearn’s pca models. So figured a blog post why this does not make sense is in order. Here on the data science stackexchange we have this advice, and I have gotten this response at a few data science interviews so far. So why not one hot encode the data and then use PCA to solve the problem? As the title of the post says, this does not make sense (as I will show in a bit). Also for high cardinality data, we often use PCA to reduce the number of dimensions. Well, we generally represent categorical values as dummy values. You have some categorical variable with a very high cardinality, say 1000 categories. Here is a general data science snafu I have seen on multiple occasions.