I have a dataset with 150+ features, and I want to separate them as text, categories and numerics. The categorical and text variables are having the Object data type. How do we distinguish between a categorical and text variable? Is there any threshold value for the categorical variable?
There is no clear distinction between categories and text. However, if you want to understand if a particular feature is categorical you can do a simple test.
e.g. if you are using pandas, you can use value_counts() / unique() for a feature. If the number of results are comparable to the size of the dataset, this is not a categorical field.
Similarly for numerics too.. But in numerics it may be Ordinal, meaning there is a clear ordering. e.g., size of t-shirts.
Related
VectorIndexer has the following purpose as I understand it:
In VectorUDT typed columns it converts the values it deems categorical to numerical mappings
However, It operates only on VectorUDT types and these are necessarily already built of Numerical types. We convert categorical features to Numerical features since many ml algorithms do not work with non-numerical types. What is the point of mapping Numerical values to Numerical values which is what the VectorIndexer seems to be useful for?
I m using this dataset of crop agriculture. In order to use it for creating a neural network, I preprocessed the data using MinMaxScalar, this would scale the data between 0 and 1. But my dataset also consist of categorical columns, because of which I got an error during preprocessing. So I tried encoding the categorical columns using OneHotEncoder and LabelEncoder but I don't understand what to do with it then.
My aim is to predict "Crop_Damage".
How do I proceed ?
Link to the dataset -
https://www.kaggle.com/aniketng21600/crop-damage-information-in-india
You have several options.
You may use one hot encoding and pass your categorical variable to network as one-hot network.
You may get inspiration from NLP and their processing. One hot vectors are sparse and may be really huge(depends on unique values of your categorical variable). Please look at techniques Word2vec(cat2vec) or GloVe. Both of them aims to create from categorical element, nonsparse numeric vector(meaningful).
Beside of these two keras offer another way how to obtain this numeric vector. Its called embeded layer. For example, lets consider that you have variable Crop damage with these values:
Huge
Medium
Little
First you assign unique integer for every unique value of your categorical variable.
Huge = 0
Medium = 1
Little= 2
Than you pass translated categorical values(unique integers) to emebeded layer. Embeded layer takes at input sequence of unique integers and produce sequence of dense vectors. Values of these vectors are firstly random, but during training are optimized like regular weights of neural network. So we can say that during the training neural network build vector representation of categories according to loss function.
For me is embeded layer the easiest way to obtain good enough vector representation of categorical variables. But you can try first with one hot if accuracy satisfy you.
here is a one hot encoder. df is the data frame you are working with, column is the name
of the column you want to encode. prefix is a string that gets appended to the column names created by pandas dummies. What happens is the new dummy columns are created and
appended to the data frame as new columns. The original column is then deleted.
There is an excellent series of videos on encoding data frames and other topics on Youtube here.
def onehot_encode(df, column, prefix):
df = df.copy()
dummies = pd.get_dummies(df[column], prefix=prefix)
df = pd.concat([df, dummies], axis=1)
df = df.drop(column, axis=1)
return df
In my situation, I would like to encode around 5 different columns in my dataset but the issue is that these 5 columns have many unique values.
If I encode them using label encoder I add an unnecessary order that is not right whereas if I do OHE or pd.get_dummies then I end up having a lot of features that will add to much sparseness in the data.
I am currently dealing with a supervised learning problem and the following are the unique values per column:
Job_Role : Unique categorical values = 29
Country : Unique categorical values = 12
State : Unique categorical values = 14
Segment : Unique categorical values = 12
Unit : Unique categorical values = 10
I have already looked into multiple references but not sure about the best approach. What should in this situation to have least number of features with maximum positive impact on my model
As far as I know, usually uses OneHotEncoder for these cases but as you said, there are so many unique values in your data. I've looked for a solution for a project before and I saw different ways as follows,
OneHotEncoder + PCA: I think this way is not quite right, because PCA is designed for continuous variables.[*]
Entity Embeddings: I don't know this way very well, but you can check it from the link in the title.
BinaryEncoder: I think, this is useful when you have a large number of categories and doing the one-hot encoding will increase the dimensions and which in turns increases model complexity. So, binary encoding is a good choice to encode the categorical variables with less number of dimensions.
There are some other solutions in category_encoders library.
What is the best method to get the simple descriptive statistics of any column in a dataframe (or list or array), be it nested or not, a sort of advanced df.describe() that also includes nested structures with numerical values.
In my case, I have a dataframe with many columns. Some columns have a numerical list in each row (in my case a time series structure), which is a nested structure.
Such nested structures are meant:
list of arrays,
array of arrays,
series of lists,
dataframe with nested lists of numerical values in some columns (my case)
How to get the simple descriptive statistics from any level of the nested structure in one go?
Asking for
df.describe()
will give me just the statistics of the numerical columns, but not those of the columns that include a list with numerical values.
I cannot get the statistics just by applying
from scipy import stats
stats.describe(arr)
either as it is the solution in How can I get descriptive statistics of a NumPy array? for a non-nested array.
My first approach would be to get the statistics of each numerical list first, and then take the statistics of that again, e.g. the mean of the mean or the mean of the variance would then give me some information as well.
In my first approach here, I convert a specific column that has a nested list of numerical values to a series of nested lists first. Nested arrays or lists might need a small adjustment, not tested.
NESTEDSTRUCTURE = df['nestedColumn']
[stats.describe([a[x] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]]) for x in range(6)]
gives you the stats of the stats for a nested structure column. If you want the mean of all means of a column, you can use
stats.describe([a[2] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]])
as position 2 stands for "mean" in
DescribeResult(nobs=, minmax=(, ), mean=, variance=, skewness=,
kurtosis=)
I expect that there is a better descriptive statistics approach that should also automatically understand nested structures with numerical values, this is just a workaround.
Say I have bunch of categorical string columns in my dataframe. Then I do below transform:
StringIndex the columns
then I use VectorAssembler to assemble all the transformed columns into one vector feature column
do VectorIndexer on the new vector feature column.
Question: for step 3, does it make sense, or is it duplicated effort? I think step 1 already did the index.
Yes it makes sense if you're going to use Spark tree based algorithm (RandomForestClassifier or GBMClassifier) and you have high cardinality features.
E.g. for criteo dataset StringIndexer would convert values in categorical column to integers in range 1 to 65000. It will save this in metadata as a NominalAttribute. Then in RFClassifier it would extract this from metadata as categorical features.
For tree based algorithms you have to specify maxBins parameter that
Must be >= 2 and >= number of categories in any categorical feature.
Too high maxBins parameter would lead to slow performance. To solve this need to use VectorIndexer with .setMaxCategories(64) for example. This will treat as categorical variables only those that has <64 unique values.