I am working with a large tabular dataset that consists of many categorical columns. I want to train a regression model (XGBoost) in this data while using as many regressors as possible.
Because of the size of data, I am using incremental training - where following sklearn API - .fit(X, y) I am not able to fit the entire matrix X into memory and therefore I am training the model in a couple of rows at the time. The problem is that in every batch, the model is expecting the same number of columns in X.
This is where it gets tricky because some variables are categorical it may be that one-hot encoding on a batch of data will same some shape (e.g. 20 columns). However, the next batch will have (26 columns) simply because in the previous batch not every unique level of the categorical feature was present. Sklearn allows for accounting for this and costume function can also be used: To keep some number of columns in matrix X.
import seaborn as sns
import numpy as np
from sklearn.preprocessing import OneHotEncoder
def one_hot_known(dataf, list_levels, col):
"""Creates a dummy coded matrix with as many columns as unique levels"""
return np.array(
[np.eye(len(list_levels))[list_levels.index(i)] for i in dataf[col]])
# Load Some Dataset with categorical variable
df_orig = sns.load_dataset('tips')
# List of unique levels - known apriori
day_level = list(df_orig['day'].unique())
# Image, we have a batch of data (subset of original data) and one categorical level (DAY) is not present here
df = df_orig.loc[lambda d: d['day'] != 'Sun']
# Missing category is filled with 0 and in next batch, if present its columns will have 1.
OneHotEncoder(categories = [day_level], sparse=False).fit_transform(np.array(df['day']).reshape(-1, 1))
#Costum function, can be used in incremental(data batches chunk fashion)
one_hot_known(df, day_level, 'day')
What I would like to do not is to utilize the TargerEncoding approach, so that we do not have matrix X with a huge number of columns. However, it still needs to be done in an Incremental fashion, just like the OneHot Encoding above.
I am writing this as a post because I know this is very useful to many people and would like to know how to utilize the same strategy for TargetEncoding.
I am aware that Deep Learning allows for Embedding layers, which represent categorical features in continuous space but I would like to apply TargetEncoding.
Related
I have a dataset with 10000 samples, where the classes are present in an ordered manner. First I loaded the data into an ImageFolder, then into a DataLoader, and I want to split this dataset into a train-val-test set. I know the DataLoader class has a shuffle parameter, but thats not good for me, because it only shuffles the data when enumeration happens on it. I know about the RandomSampler function, but with it, i can only take n amount of data randomly from the dataset, and i have no control of what is being taken out, so one sample might be present in the train,test and val set at the same time.
Is there a way to shuffle the data in a DataLoader? The only thing i need is the shuffle, after that i can subset the data.
The Subset dataset class takes indices (https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset). You can probably exploit that to get this functionality as below. Essentially, you can get away by shuffling the indices and then picking the subset of the dataset.
# suppose dataset is the variable pointing to whole datasets
N = len(dataset)
# generate & shuffle indices
indices = numpy.arange(N)
indices = numpy.random.permutation(indices)
# there are many ways to do the above two operation. (Example, using np.random.choice can be used here too
# select train/test/val, for demo I am using 70,15,15
train_indices = indices [:int(0.7*N)]
val_indices = indices[int(0.7*N):int(0.85*N)]
test_indices = indices[int(0.85*N):]
train_dataset = Subset(dataset, train_indices)
val_dataset = Subset(dataset, val_indices)
test_dataset = Subset(dataset, test_indices)
I have many csv files that has multiple rows and columns which are mostly floating point numbers (some are categorical but one-hot encoded).
Each csv file is the representation of one training example.It contains dependent and independent variables in the same file.
(for example, its not like machine learning problem where each row contains all the information and predicts y1, y2,y3 of that row, its like all the rows combined of x1 to x8
will predict all rows combined of y1 to y3. Hence each csv becomes one training example.
representation of one such csv
** The above image is the representation of one of such csv files
Please note that the length/size of each csv varies.
I want to build a simple ann or any other neural net model. I have problem in processing input data. As each csv is one single training example, in which format should i have to store data to pass to a neural net.
Thanks in advance,
skw
Let's say you have some .csv file all with same data format stored in a folder data.
First you can use glob to read the filenames and use pandas to read the csv and convert to numpy array.
import glob
import pandas as pd
csv = [] # read as numpy array
for f in glob.glob('path/*.csv'):
csv.append(pd.read_csv(f).to_numpy)
print(csv[0].shape)
# it should print (num_rows_csv, 11) # as, 11 columns
# now, first 8 columns are features, and last 3 columns are response
X = []
y = []
for arr in csv:
X.append(arr[0:8])
y.append(arr[8:])
X = np.array(X)
y = np.array(y)
Now, it's easy to train this with CNN, LSTM, any model you want.
I have a data points in a csr numpy matrix and labels in a pandas series.
I want to do down sampling of the dataset.
I tried re-sampling the data points(matrix) and labels(pandas series) separately using same random state.
X4_train_undersampled = resample(X4_train,replace=False, n_samples=41615, random_state=123)
y_train_undersampled = resample(y_train, replace=False , n_samples=41615, random_state=123)
I want to whether this is the right method to do it.
if yes, how can i test if the same rows are sampled in data points and labels.
if No, please provide another way to do down-sampling.
I have a DataFrame like:
text_data worker_dicts outcomes
0 "Some string" {"Sector":"Finance", 0
"State: NJ"}
1 "Another string" {"Sector":"Programming", 1
"State: NY"}
It has both text information, and a column that is a dictionary. (The real worker_dicts has many more fields). I'm interested in the binary outcome column.
What I initially tried doing was to combine both text_data and worker_dict, crudely concatenating both columns, and then running Multinomial NB on that:
df['stacked_features']=df['text_data'].astype(str)+'_'+df['worker_dicts']
stacked_features = np.array(df['stacked_features'])
outcomes = np.array(df['outcomes'])
text_clf = Pipeline([('vect', TfidfVectorizer(stop_words='english'), ngram_range = (1,3)),
('clf', MultinomialNB())])
text_clf = text_clf.fit(stacked_features, outcomes)
But I got very bad accuracy, and I think that fitting two independent models would be a better use of data than fitting one model on both types of features (as I am doing with stacking).
How would I go about utilizing Feature Union? worker_dicts is a little weird because it's a dictionary, so I'm very confused as to how I'd go about parsing that.
If your dictionary entries are categorical as they appear to be in your example, then I would create different columns from the dictionary entries before doing additional processing.
new_features = pd.DataFrame(df['worker_dicts'].values.tolist())
Then new_features will be its own dataframe with columns Sector and State and you can one hot encode those as needed in addition to TFIDF or other feature extraction for your text_data column. In order to use that in a pipeline, you would need to create a new transformer class, so I might suggest just applying the dictionary parsing and the TFIDF separately, then stacking the results, and adding OneHotEncoding to your pipeline as that allows you to specify columns to apply the transformer to. (As the categories you want to encode are strings you may want to use LabelBinarizer class instead of OneHotEncoder class for the encoding transformation.)
If you want to just use TFIDF on all of the columns individually with a pipeline, you would need to use a nested Pipeline and FeatureUnion set up to extract columns as described here.
If you have your one hot encoded features in dataframes X1 and X2 as described below and your text features in X3, you could do something like the following to create a pipeline. (There are many other options, this is just one way)
X = pd.concat([X1, X2, X3], axis=1)
def select_text_data(X):
return X['text_data']
def select_remaining_data(X):
return X.drop('text_data', axis=1)
# pipeline to get all tfidf and word count for first column
text_pipeline = Pipeline([
('column_selection', FunctionTransformer(select_text_data, validate=False)),
('tfidf', TfidfVectorizer())
])
final_pipeline = Pipeline([('feature-union', FeatureUnion([('text-features', text_pipeline),
('other-features', FunctionTransformer(select_remaining_data))
])),
('clf', LogisticRegression())
])
(MultinomialNB won't work in the pipeline because it doesn't have fit and fit_transform methods)
This is a very common process in Machine Learning.
I have a dataset and I split it into training set and test set.
Since I apply some normalizing and standardization to the training set,
I would like to use the same info of the training set (mean/std/min/max
values of each feature), to apply the normalizing and standardization
to the test set too. Do you know any optimal way to do that?
I am aware of the functions of MinMaxScaler, StandardScaler etc..
You can achieve this via a few lines of code on both the training and test set.
On the training side there are two approaches:
MultivariateStatisticalSummary
http://spark.apache.org/docs/latest/mllib-statistics.html
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each
Using SQL
from pyspark.sql.functions import mean, min, max
In [6]: df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
On the testing data - you can then manually "normalize the data using the statistics obtained above from the training data. You can decide in which sense you wish to normalize: e.g.
Student's T
val normalized = testData.map{ m =>
(m - trainMean) / trainingSampleStddev
}
Feature Scaling
val normalized = testData.map{ m =>
(m - trainMean) / (trainMax - trainMin)
}
There are others: take a look at https://en.wikipedia.org/wiki/Normalization_(statistics)