What are the dtypes of new features? - featuretools

Why are the new features created using the transformative primitives like WEEKDAY, DayOfMonth, YEAR, MonthOfYear type features created as integer i.e., continuous features? Are they not supposed to be categorical features? i mean when creating these features isn't the dtype of these columns supposed to be 'object' and not 'int' ?

Categorical or ordinal features are best stored as integer values. This is because it more efficient to represent data as an integer than as a string. For example, [1, 4, 3, 1] requires a lot less memory than ["January", "April", "March", "January"]. You can determine the data type of feature using the list of feature definitions that is returned by ft.dfs
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=[],
trans_primitives=["month"])
feature_defs is a list of feature definitions
[<Feature: zip_code>, <Feature: MONTH(join_date)>]
we can get the variable type like this
feature_defs[1].variable_type
this returns
featuretools.variable_types.variable.Ordinal
For encoding discrete features into numeric features for machine learning look at the documentation for ft.encode_features.

Related

how to predict the cluster label of a new observation using a hierarchical clustering?

I want to study a population of 47532 individuals with 16230 features. Thus I created a matrix with 16230 lines and 47532 columns
>>> import scipy.cluster.hierarchy as hcluster
>>> from scipy.spatial import distance
>>> import sklearn.cluster import AgglomerativeClustering
>>> matrix.shape
(16230, 47532)
# remove all duplicate vectors in order to not waste computation time
>>> uniq_vectors, row_index = np.unique(matrix, return_index=True, axis=0)
>>> uniq_vectors.shape
(22957, 16230)
# compute distance between each observations
>>> distance_matrix = distance.pdist(uniq_vectors, metric='jaccard')
>>> distance_matrix_2d = distance.squareform(distance_matrix, force='tomatrix')
>>> distance_matrix_2d.shape
(22957, 22957)
# Perform linkage
>>> linkage = hcluster.linkage(distance_matrix, method='complete')
So now I can use scikit-learn to perform a clustering
>>> model = AgglomerativeClustering(n_clusters=40, affinity='precomputed', linkage='complete')
>>> cluster_label = model.fit_predict(distance_matrix_2d)
How to predict future observations using this model ?
Indeed AgglomerativeClustering do not own a predict method and it will be too long to compute again the distance for 16230 x (47532 + 1)
Is it possible to compute a distance between new observations and all pre-computed cluster ?
Indeed the use of pdist from scipy will compute the distance n x n In my case I would like compute the distance from one observation o vs n samples o x n
Thanks for your highlight
The answer is simple: you cannot. Hierarchical clustering is not designed to predict cluster labels for new observations. The reason why this is happening is because it just links data points according to their distances and it is not defining "regions" for each cluster.
There are two solutions for you at this stage I believe:
For new data points, find the nearest observation in your data set (using the same distance function as during the training) and assign the same cluster label. This requires a bit more coding, and obviously, it is a bit of a hack. But keep in mind that the results might not make a lot of sense as you will be extrapolating cluster labels using a different methodology than the training procedure.
Use another clustering algorithm! It seems like you are using hierarchical clustering when your use case does not match the model. KMeans could be a good choice, as it explicitly can assign new data points to the closest cluster.

SkLearn: Feature Union with a dictionary and text data

I have a DataFrame like:
text_data worker_dicts outcomes
0 "Some string" {"Sector":"Finance", 0
"State: NJ"}
1 "Another string" {"Sector":"Programming", 1
"State: NY"}
It has both text information, and a column that is a dictionary. (The real worker_dicts has many more fields). I'm interested in the binary outcome column.
What I initially tried doing was to combine both text_data and worker_dict, crudely concatenating both columns, and then running Multinomial NB on that:
df['stacked_features']=df['text_data'].astype(str)+'_'+df['worker_dicts']
stacked_features = np.array(df['stacked_features'])
outcomes = np.array(df['outcomes'])
text_clf = Pipeline([('vect', TfidfVectorizer(stop_words='english'), ngram_range = (1,3)),
('clf', MultinomialNB())])
text_clf = text_clf.fit(stacked_features, outcomes)
But I got very bad accuracy, and I think that fitting two independent models would be a better use of data than fitting one model on both types of features (as I am doing with stacking).
How would I go about utilizing Feature Union? worker_dicts is a little weird because it's a dictionary, so I'm very confused as to how I'd go about parsing that.
If your dictionary entries are categorical as they appear to be in your example, then I would create different columns from the dictionary entries before doing additional processing.
new_features = pd.DataFrame(df['worker_dicts'].values.tolist())
Then new_features will be its own dataframe with columns Sector and State and you can one hot encode those as needed in addition to TFIDF or other feature extraction for your text_data column. In order to use that in a pipeline, you would need to create a new transformer class, so I might suggest just applying the dictionary parsing and the TFIDF separately, then stacking the results, and adding OneHotEncoding to your pipeline as that allows you to specify columns to apply the transformer to. (As the categories you want to encode are strings you may want to use LabelBinarizer class instead of OneHotEncoder class for the encoding transformation.)
If you want to just use TFIDF on all of the columns individually with a pipeline, you would need to use a nested Pipeline and FeatureUnion set up to extract columns as described here.
If you have your one hot encoded features in dataframes X1 and X2 as described below and your text features in X3, you could do something like the following to create a pipeline. (There are many other options, this is just one way)
X = pd.concat([X1, X2, X3], axis=1)
def select_text_data(X):
return X['text_data']
def select_remaining_data(X):
return X.drop('text_data', axis=1)
# pipeline to get all tfidf and word count for first column
text_pipeline = Pipeline([
('column_selection', FunctionTransformer(select_text_data, validate=False)),
('tfidf', TfidfVectorizer())
])
final_pipeline = Pipeline([('feature-union', FeatureUnion([('text-features', text_pipeline),
('other-features', FunctionTransformer(select_remaining_data))
])),
('clf', LogisticRegression())
])
(MultinomialNB won't work in the pipeline because it doesn't have fit and fit_transform methods)

How to classify a frequency domain data in Python 3? (could not convert string to float)

I've converted a DataFrame from time domain to frequency domain using :
df = np.fft.fft(df)
Now I need to classify the the data using several machine learning algorithms such as Random Forest and Gaussian Naive Bayes. The problem is I keep getting this error:
could not convert string to float: '(2.9510193818016135-0.47803712350473193j)'
I tried to convert the strings to floats in DataFrame but it is still giving me the same error.
How can I solve this problem in order to get my classification results?
Assuming your results are like the following form, you first need to cast to a real complex type:
In[84]:
# data setup
df = pd.DataFrame({'fft':['(2.9510193818016135-0.47803712350473193j)']})
df
Out[84]:
fft
0 (2.9510193818016135-0.47803712350473193j)
Now cast to complex type:
In[85]:
df['complex'] = df['fft'].apply(complex)
df
Out[85]:
fft complex
0 (2.9510193818016135-0.47803712350473193j) (2.9510193818-0.478037123505j)
Now you can extract as polar coords using apply with cmath.polar:
In[86]:
import cmath
df['polar_x'],df['polar_y'] = df['complex'].apply(lambda x: cmath.polar(x)[0]), df['complex'].apply(lambda x: cmath.polar(x)[1])
df
Out[86]:
fft complex \
0 (2.9510193818016135-0.47803712350473193j) (2.9510193818-0.478037123505j)
polar_x polar_y
0 2.989487 -0.160595
Now the dtypes are compatible so you can pass the float columns:
In[87]:
df.dtypes
Out[87]:
fft object
complex complex128
polar_x float64
polar_y float64
dtype: object
You can also use cmath.rect if desired

What is StringIndexer , VectorIndexer, and how to use them?

Dataset<Row> dataFrame = ... ;
StringIndexerModel labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(dataFrame);
VectorIndexerModel featureIndexer = new VectorIndexer()
.setInputCol("s")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(dataFrame);
IndexToString labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels());
What is StringIndexer, VectorIndexer, IndexToString and what is the difference between them? How and When should I use them?
String Indexer - Use it if you want the Machine Learning algorithm to identify column as categorical variable or if want to convert the textual data to numeric data keeping the categorical context.
e,g converting days(Monday, Tuesday...) to numeric representation.
Vector Indexer- use this if we do not know the types of data incoming. so we leave the logic of differentiating between categorical and non categorical data to the algorithm using Vector Indexer.
e,g - Data coming from 3rd Party API, where data is hidden and is ingested directly to the training model.
Indexer to string- just opposite of String indexer, use this if the final output column was indexed using String Indexer and now we want to convert back its numeric representation to textual so that result can be understood better.
I know only about those two:
StringIndexer and VectorIndexer
StringIndexer:
converts a single column to an index column (similar to a factor column in R)
VectorIndexer:
is used to index categorical predictors in a featuresCol column. Remember that featuresCol is a single column consisting of vectors (refer to featuresCol and labelCol). Each row is a vector which contains values from each predictors.
if you have string type predictors, you will first need to use index those columns with StringIndexer. featuresCol contains vectors, and vectors does not contain string values.
Take a look here for example: https://mingchen0919.github.io/learning-apache-spark/StringIndexer-and-VectorIndexer.html

Does Spark.ml LogisticRegression assumes numerical features only?

I was looking at the Spark 1.5 dataframe/row api and the implementation for the logistic regression. As I understand, the train method therein first converts the dataframe to RDD[LabeledPoint] as,
override protected def train(dataset: DataFrame): LogisticRegressionModel = {
// Extract columns from data. If dataset is persisted, do not persist oldDataset.
val instances = extractLabeledPoints(dataset).map {
case LabeledPoint(label: Double, features: Vector) => (label, features)
}
...
And then it proceeds to feature standardization, etc.
What I am confused with is, the DataFrame is of type RDD[Row] and Row is allowed to have any valueTypes, for e.g. (1, true, "a string", null) seems a valid row of a dataframe. If that is so, what does the extractLabeledPoints above mean? It seems it is selecting only Array[Double] as the feature values in Vector. What happens if a column in the data-frame was strings? Also, what happens to the integer categorical values?
Thanks in advance,
Nikhil
Lets ignore Spark for a moment. Generally speaking linear models, including logistic regression, expect numeric independent variables. It is not in any way specific to Spark / MLlib. If input contains categorical or ordinal variables these have to be encoded first. Some languages, like R, handle this in a transparent manner:
> df <- data.frame(x1 = c("a", "b", "c", "d"), y=c("aa", "aa", "bb", "bb"))
> glm(y ~ x1, df, family="binomial")
Call: glm(formula = y ~ x1, family = "binomial", data = df)
Coefficients:
(Intercept) x1b x1c x1d
-2.357e+01 -4.974e-15 4.713e+01 4.713e+01
...
but what is really used behind the scenes is so called design matrix:
> model.matrix( ~ x1, df)
(Intercept) x1b x1c x1d
1 1 0 0 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
...
Skipping over the details it is the same type of transformation as the one performed by the OneHotEncoder in Spark.
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val df = sqlContext.createDataFrame(Seq(
Tuple1("a"), Tuple1("b"), Tuple1("c"), Tuple1("d")
)).toDF("x").repartition(1)
val indexer = new StringIndexer()
.setInputCol("x")
.setOutputCol("xIdx")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("xIdx")
.setOutputCol("xVec")
val encoded = encoder.transform(indexed)
encoded
.select($"xVec")
.map(_.getAs[Vector]("xVec").toDense)
.foreach(println)
Spark goes one step further and all features, even if algorithm allows nominal/ordinal independent variables, have to be stored as Double using a spark.mllib.linalg.Vector. In case of spark.ml it is a DataFrame column, in spark.mllib a field in spark.mllib.regression.LabeledPoint.
Depending on a model interpretation of the feature vector can be different though. As mentioned above for linear model these will be interpreted as numerical variables. For Naive Bayes theses are considered nominal. If model accepts both numerical and nominal variables Spark and treats each group in a different way, like decision / regression trees, you can provide categoricalFeaturesInfo parameter.
It is worth pointing out that dependent variables should be encoded as Double as well but, unlike independent variables, may require additional metadata to be handled properly. If you take a look at the indexed DataFrame you'll see that StringIndexer not only transforms x, but also adds attributes:
scala> org.apache.spark.ml.attribute.Attribute.fromStructField(indexed.schema(1))
res12: org.apache.spark.ml.attribute.Attribute = {"vals":["d","a","b","c"],"type":"nominal","name":"xIdx"}
Finally some Transformers from ML, like VectorIndexer, can automatically detect and encode categorical variables based on the number of distinct values.
Copying clarification from zero323 in the comments:
Categorical values before being passed to MLlib / ML estimators have to be encoded as Double. There quite a few built-in transformers like StringIndexer or OneHotEncoder which can be helpful here. If algorithm treats categorical features in a different manner than a numerical ones, like for example DecisionTree, you identify which variables are categorical using categoricalFeaturesInfo.
Finally some transformers use special attributes on columns to distinguish between different types of attributes.

Resources