SkLearn: Feature Union with a dictionary and text data - python-3.x

I have a DataFrame like:
text_data worker_dicts outcomes
0 "Some string" {"Sector":"Finance", 0
"State: NJ"}
1 "Another string" {"Sector":"Programming", 1
"State: NY"}
It has both text information, and a column that is a dictionary. (The real worker_dicts has many more fields). I'm interested in the binary outcome column.
What I initially tried doing was to combine both text_data and worker_dict, crudely concatenating both columns, and then running Multinomial NB on that:
df['stacked_features']=df['text_data'].astype(str)+'_'+df['worker_dicts']
stacked_features = np.array(df['stacked_features'])
outcomes = np.array(df['outcomes'])
text_clf = Pipeline([('vect', TfidfVectorizer(stop_words='english'), ngram_range = (1,3)),
('clf', MultinomialNB())])
text_clf = text_clf.fit(stacked_features, outcomes)
But I got very bad accuracy, and I think that fitting two independent models would be a better use of data than fitting one model on both types of features (as I am doing with stacking).
How would I go about utilizing Feature Union? worker_dicts is a little weird because it's a dictionary, so I'm very confused as to how I'd go about parsing that.

If your dictionary entries are categorical as they appear to be in your example, then I would create different columns from the dictionary entries before doing additional processing.
new_features = pd.DataFrame(df['worker_dicts'].values.tolist())
Then new_features will be its own dataframe with columns Sector and State and you can one hot encode those as needed in addition to TFIDF or other feature extraction for your text_data column. In order to use that in a pipeline, you would need to create a new transformer class, so I might suggest just applying the dictionary parsing and the TFIDF separately, then stacking the results, and adding OneHotEncoding to your pipeline as that allows you to specify columns to apply the transformer to. (As the categories you want to encode are strings you may want to use LabelBinarizer class instead of OneHotEncoder class for the encoding transformation.)
If you want to just use TFIDF on all of the columns individually with a pipeline, you would need to use a nested Pipeline and FeatureUnion set up to extract columns as described here.
If you have your one hot encoded features in dataframes X1 and X2 as described below and your text features in X3, you could do something like the following to create a pipeline. (There are many other options, this is just one way)
X = pd.concat([X1, X2, X3], axis=1)
def select_text_data(X):
return X['text_data']
def select_remaining_data(X):
return X.drop('text_data', axis=1)
# pipeline to get all tfidf and word count for first column
text_pipeline = Pipeline([
('column_selection', FunctionTransformer(select_text_data, validate=False)),
('tfidf', TfidfVectorizer())
])
final_pipeline = Pipeline([('feature-union', FeatureUnion([('text-features', text_pipeline),
('other-features', FunctionTransformer(select_remaining_data))
])),
('clf', LogisticRegression())
])
(MultinomialNB won't work in the pipeline because it doesn't have fit and fit_transform methods)

Related

In Pytorch, how can i shuffle a DataLoader?

I have a dataset with 10000 samples, where the classes are present in an ordered manner. First I loaded the data into an ImageFolder, then into a DataLoader, and I want to split this dataset into a train-val-test set. I know the DataLoader class has a shuffle parameter, but thats not good for me, because it only shuffles the data when enumeration happens on it. I know about the RandomSampler function, but with it, i can only take n amount of data randomly from the dataset, and i have no control of what is being taken out, so one sample might be present in the train,test and val set at the same time.
Is there a way to shuffle the data in a DataLoader? The only thing i need is the shuffle, after that i can subset the data.
The Subset dataset class takes indices (https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset). You can probably exploit that to get this functionality as below. Essentially, you can get away by shuffling the indices and then picking the subset of the dataset.
# suppose dataset is the variable pointing to whole datasets
N = len(dataset)
# generate & shuffle indices
indices = numpy.arange(N)
indices = numpy.random.permutation(indices)
# there are many ways to do the above two operation. (Example, using np.random.choice can be used here too
# select train/test/val, for demo I am using 70,15,15
train_indices = indices [:int(0.7*N)]
val_indices = indices[int(0.7*N):int(0.85*N)]
test_indices = indices[int(0.85*N):]
train_dataset = Subset(dataset, train_indices)
val_dataset = Subset(dataset, val_indices)
test_dataset = Subset(dataset, test_indices)

Incremental OneHotEncoding and Target Encoding

I am working with a large tabular dataset that consists of many categorical columns. I want to train a regression model (XGBoost) in this data while using as many regressors as possible.
Because of the size of data, I am using incremental training - where following sklearn API - .fit(X, y) I am not able to fit the entire matrix X into memory and therefore I am training the model in a couple of rows at the time. The problem is that in every batch, the model is expecting the same number of columns in X.
This is where it gets tricky because some variables are categorical it may be that one-hot encoding on a batch of data will same some shape (e.g. 20 columns). However, the next batch will have (26 columns) simply because in the previous batch not every unique level of the categorical feature was present. Sklearn allows for accounting for this and costume function can also be used: To keep some number of columns in matrix X.
import seaborn as sns
import numpy as np
from sklearn.preprocessing import OneHotEncoder
def one_hot_known(dataf, list_levels, col):
"""Creates a dummy coded matrix with as many columns as unique levels"""
return np.array(
[np.eye(len(list_levels))[list_levels.index(i)] for i in dataf[col]])
# Load Some Dataset with categorical variable
df_orig = sns.load_dataset('tips')
# List of unique levels - known apriori
day_level = list(df_orig['day'].unique())
# Image, we have a batch of data (subset of original data) and one categorical level (DAY) is not present here
df = df_orig.loc[lambda d: d['day'] != 'Sun']
# Missing category is filled with 0 and in next batch, if present its columns will have 1.
OneHotEncoder(categories = [day_level], sparse=False).fit_transform(np.array(df['day']).reshape(-1, 1))
#Costum function, can be used in incremental(data batches chunk fashion)
one_hot_known(df, day_level, 'day')
What I would like to do not is to utilize the TargerEncoding approach, so that we do not have matrix X with a huge number of columns. However, it still needs to be done in an Incremental fashion, just like the OneHot Encoding above.
I am writing this as a post because I know this is very useful to many people and would like to know how to utilize the same strategy for TargetEncoding.
I am aware that Deep Learning allows for Embedding layers, which represent categorical features in continuous space but I would like to apply TargetEncoding.

SparkML: Pipeline predictions have fewer records than the input

How can I find out -- inside a pipeline -- which records are skipped or dropped from the transformation?
I have a pipeline which is like the following:
StringIndexer
OneHotEncoderEstimator
(repeat above for all categorical cols)
VectorAssembler (collecting all encoded and raw numeric cols)
LogisticRegression
Then:
model = pipeline.fit(train)
predicted = model.transform(test)
test.count()
8092
predicted.count()
8091
One record is missing and I'd like to find out which one.
thanks
The handleInvalid option of your StringIndexer is likely set to skip.
You can change this option to error and the transform will fail on never seen labels. As of Spark 2.2 you can also use option keep to keep the rows with unknown labels in a separate bucket for them:
string_indexer = StringIndexer(inputCol="label", outputCol="indexed", handleInvalid='keep')

What is StringIndexer , VectorIndexer, and how to use them?

Dataset<Row> dataFrame = ... ;
StringIndexerModel labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(dataFrame);
VectorIndexerModel featureIndexer = new VectorIndexer()
.setInputCol("s")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(dataFrame);
IndexToString labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels());
What is StringIndexer, VectorIndexer, IndexToString and what is the difference between them? How and When should I use them?
String Indexer - Use it if you want the Machine Learning algorithm to identify column as categorical variable or if want to convert the textual data to numeric data keeping the categorical context.
e,g converting days(Monday, Tuesday...) to numeric representation.
Vector Indexer- use this if we do not know the types of data incoming. so we leave the logic of differentiating between categorical and non categorical data to the algorithm using Vector Indexer.
e,g - Data coming from 3rd Party API, where data is hidden and is ingested directly to the training model.
Indexer to string- just opposite of String indexer, use this if the final output column was indexed using String Indexer and now we want to convert back its numeric representation to textual so that result can be understood better.
I know only about those two:
StringIndexer and VectorIndexer
StringIndexer:
converts a single column to an index column (similar to a factor column in R)
VectorIndexer:
is used to index categorical predictors in a featuresCol column. Remember that featuresCol is a single column consisting of vectors (refer to featuresCol and labelCol). Each row is a vector which contains values from each predictors.
if you have string type predictors, you will first need to use index those columns with StringIndexer. featuresCol contains vectors, and vectors does not contain string values.
Take a look here for example: https://mingchen0919.github.io/learning-apache-spark/StringIndexer-and-VectorIndexer.html

Does Spark.ml LogisticRegression assumes numerical features only?

I was looking at the Spark 1.5 dataframe/row api and the implementation for the logistic regression. As I understand, the train method therein first converts the dataframe to RDD[LabeledPoint] as,
override protected def train(dataset: DataFrame): LogisticRegressionModel = {
// Extract columns from data. If dataset is persisted, do not persist oldDataset.
val instances = extractLabeledPoints(dataset).map {
case LabeledPoint(label: Double, features: Vector) => (label, features)
}
...
And then it proceeds to feature standardization, etc.
What I am confused with is, the DataFrame is of type RDD[Row] and Row is allowed to have any valueTypes, for e.g. (1, true, "a string", null) seems a valid row of a dataframe. If that is so, what does the extractLabeledPoints above mean? It seems it is selecting only Array[Double] as the feature values in Vector. What happens if a column in the data-frame was strings? Also, what happens to the integer categorical values?
Thanks in advance,
Nikhil
Lets ignore Spark for a moment. Generally speaking linear models, including logistic regression, expect numeric independent variables. It is not in any way specific to Spark / MLlib. If input contains categorical or ordinal variables these have to be encoded first. Some languages, like R, handle this in a transparent manner:
> df <- data.frame(x1 = c("a", "b", "c", "d"), y=c("aa", "aa", "bb", "bb"))
> glm(y ~ x1, df, family="binomial")
Call: glm(formula = y ~ x1, family = "binomial", data = df)
Coefficients:
(Intercept) x1b x1c x1d
-2.357e+01 -4.974e-15 4.713e+01 4.713e+01
...
but what is really used behind the scenes is so called design matrix:
> model.matrix( ~ x1, df)
(Intercept) x1b x1c x1d
1 1 0 0 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
...
Skipping over the details it is the same type of transformation as the one performed by the OneHotEncoder in Spark.
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val df = sqlContext.createDataFrame(Seq(
Tuple1("a"), Tuple1("b"), Tuple1("c"), Tuple1("d")
)).toDF("x").repartition(1)
val indexer = new StringIndexer()
.setInputCol("x")
.setOutputCol("xIdx")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("xIdx")
.setOutputCol("xVec")
val encoded = encoder.transform(indexed)
encoded
.select($"xVec")
.map(_.getAs[Vector]("xVec").toDense)
.foreach(println)
Spark goes one step further and all features, even if algorithm allows nominal/ordinal independent variables, have to be stored as Double using a spark.mllib.linalg.Vector. In case of spark.ml it is a DataFrame column, in spark.mllib a field in spark.mllib.regression.LabeledPoint.
Depending on a model interpretation of the feature vector can be different though. As mentioned above for linear model these will be interpreted as numerical variables. For Naive Bayes theses are considered nominal. If model accepts both numerical and nominal variables Spark and treats each group in a different way, like decision / regression trees, you can provide categoricalFeaturesInfo parameter.
It is worth pointing out that dependent variables should be encoded as Double as well but, unlike independent variables, may require additional metadata to be handled properly. If you take a look at the indexed DataFrame you'll see that StringIndexer not only transforms x, but also adds attributes:
scala> org.apache.spark.ml.attribute.Attribute.fromStructField(indexed.schema(1))
res12: org.apache.spark.ml.attribute.Attribute = {"vals":["d","a","b","c"],"type":"nominal","name":"xIdx"}
Finally some Transformers from ML, like VectorIndexer, can automatically detect and encode categorical variables based on the number of distinct values.
Copying clarification from zero323 in the comments:
Categorical values before being passed to MLlib / ML estimators have to be encoded as Double. There quite a few built-in transformers like StringIndexer or OneHotEncoder which can be helpful here. If algorithm treats categorical features in a different manner than a numerical ones, like for example DecisionTree, you identify which variables are categorical using categoricalFeaturesInfo.
Finally some transformers use special attributes on columns to distinguish between different types of attributes.

Resources