Using pipeline to perform one hot encoding - scikit-learn

I used Pipeline and ColumnTransformer to perform preprocessing.
Let's say I use One-Hot encoder on column D and generate some dummy columns. How can I get those column names and make a df out of those.
preprocess_pipeline = ColumnTransformer([('A', MinMaxScaler(), ['A']),
('B', B_pipeline, ['B']),
('D', D_pipeline, ['D']),
('C', OneHotEncoder(), ['C'])
],remainder='drop')
processed_train_data = preprocess_pipeline.fit_transform(X_train)
processed_train_df = pd.DataFrame(processed_train_data, columns=['A', 'B', 'C', 'D'])
Column D is supposed to be replaced with those dummy columns. I have searched for it and read a lot of cases but none of them could really help fix the problem. The order of column names was a mess and the preprocess_pipeline.get_feature_names_out is pretty hard to understand. What kind of list should I pass to it? ValueError: input_features is not equal to feature_names_in_ happened no matter what.

Related

Calculate similarity of 1-row dataframe and a large dataframe with the same columns in Python?

I have a very large dataframe (millions of rows) and every time I am getting a 1-row dataframe with the same columns.
For example:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,-1], 'c': [-1,0.4,31]})
input = pd.DataFrame([[11, -0.44, 4]], columns=list('abc'))
I would like to calculate cosine similarity between the input and the whole df.
I am using the following:
from scipy.spatial.distance import cosine
df.apply(lambda row: 1 - cosine(row, input), axis=1)
But it's a bit slow. Tried with swifter package, and it seems to run faster.
Please advise what is the best practice for such a task, do it like this or change to another method?
I usually don't do matrix manipulation with DataFrame but with numpy.array. So I will first convert them
df_npy = df.values
input_npy = input.values
And then I don't want to use scipy.spatial.distance.cosine so I will take care of the calculation myself, which is to first normalize each of the vectors
df_npy = df_npy / np.linalg.norm(df_npy, axis=1, keepdims=True)
input_npy = input_npy / np.linalg.norm(input_npy, axis=1, keepdims=True)
And then matrix multiply them together
df_npy # input_npy.T
which will give you
array([[0.213],
[0.524],
[0.431]])
The reason I don't want to use scipy.spatial.distance.cosine is that it only takes care of one pair of vector at a time, but in the way I show, it takes care of all at the same time.

Can Pandas DataFrame to_dict and from_dict lose column order?

When I call to_dict it returns a normal dictionary. However normal dictionaries do not preserve order. The key for the dictionary is the column. Therefore, if had called to_dict on a dataframe and later call from_dict to reconstruct the dataframe, would that not suggest that I could potentially lose column order?
In python 3, dictionaries preserve the order in which keys are inserted, so your assertion isn't true:
In [7]: pd.DataFrame.from_dict(pd.DataFrame({'c': [5], 'a': [2], 'b': [1]}).to_dict())
Out[7]:
c a b
0 5 2 1
Additionally, the pandas.DataFrame.to_dict docs provide a number of additional options for data structures such as OrderedDict:
>>> from collections import OrderedDict, defaultdict
>>> df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])),
('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])

PySpark split DataFrame into multiple frames based on a column key and train an ML lib model on each

I have a PySpark dataframe with a column "group". I also have feature columns and a label column. I want to split the dataframe for each group and then train a model and end up with a dictionary where the keys are the "group" names and the values are the trained models.
This question essentially give an answer to this problem. This method is inefficient.
The obvious problem here is that it requires a full data scan for each level, so it is an expensive operation.
The answer is old and I am hoping there have been improvements in PySpark since then. For my use case I have 10k groups, with heavy skew in the data sizes. The largest group can have 1 Billion records and the smallest group can have 1 record.
Edit: As suggested here is a small reproducible example.
df = sc.createDataFrame(
[
('A', 1, 0, True),
('A', 3, 0, False),
('B', 2, 2, True),
('B', 3, 3, True),
('B', 5, 2, False)
],
('group', 'feature_1', 'feature_2', 'label')
)
I can split the data as suggested in the above link:
from itertools import chain
from pyspark.sql.functions import col
groups = chain(*df.select("group").distinct().collect())
df_by_group = {group:
train_model(df.where(col("group").eqNullSafe(group))) for group in groups}
Where train_model is a function that takes a dataframe with columns=[feature_1, feature_2, label] and returns a trained model on that dataframe.

How to encode multiple categorical columns for test data efficiently?

I have multiple category columns (nearly 50). I using custom made frequency encoding and using it on training data. At last i am saving it as nested dictionary. For the test data I am using map function to encode and unseen labels are replaced with 0. But I need more efficient way?
I have already tried pandas replace method but it don't cares of unseen labels and leaves it as it. Further I am much concerned about the time and i want say 80 columns and 1 row to be encoded within 60 ms. Just need the most efficient way I can do it. I have taken my example from here.
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']})
My dict looks something like this :
enc = {'pets': {'cat': 0, 'dog': 1, 'monkey': 2},
'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
'location': {'New_York': 0, 'San_Diego': 1}}
for col in enc:
if col in input_df.columns:
input_df[col]= input_df[col].map(dict_online['encoding'][col]).fillna(0)
Further I want multiple columns to be encoded at once. I don't want any loop for every column.... I guess we cant do it in map. Hence replace is good choice but in that as said it doesn't cares about unseen labels.
EDIT:
This the code i am using for now, Please note there is only 1 row in test data frame ( Not very sure i should handle it like numpy array to reduce time...). But i need to decrease this time to under 60 ms: Further i have dictionary only for mapping ( Cant use one hot because of use case). Currently time = 331.74 ms. Any idea how to do it more efficiently. Not sure that multiprocessing will work..? Further with replace method i have got many issues like : 1. It does not handle unseen labels and leave them as it is ( for string its issue). 2. It has problem with overlapping of keys and values.
from string import ascii_lowercase
import itertools
import pandas as pd
import numpy as np
import time
def iter_all_strings():
for size in itertools.count(1):
for s in itertools.product(ascii_lowercase, repeat=size):
yield "".join(s)
l = []
for s in iter_all_strings():
l.append(s)
if s == 'gr':
break
columns = l
df = pd.DataFrame(columns=columns)
for col in df.columns:
df[col] = np.random.randint(1, 4000, 3000)
transform_dict = {}
for col in df.columns:
cats = pd.Categorical(df[col]).categories
d = {}
for i, cat in enumerate(cats):
d[cat] = i
transform_dict[col] = d
print(f"The length of the dictionary is {len(transform_dict)}")
# Creating another test data frame
df2 = pd.DataFrame(columns=columns)
for col in df2.columns:
df2[col] = np.random.randint(1, 4000, 1)
print(f"The shape of teh 2nd data frame is {df2.shape}")
t1 = time.time()
for col in df2.columns:
df2[col] = df2[col].map(transform_dict[col]).fillna(0)
print(f"Time taken is {time.time() - t1}")
# print(df)
Firstly, when you want to encode categorical variables, which is not ordinal (meaning: there is no inherent ordering between the values of the variable/column. ex- cat, dog), you must use one hot encoding.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']})
enc = [['cat','dog','monkey'],
['Brick', 'Champ', 'Ron', 'Veronica'],
['New_York', 'San_Diego']]
ohe = OneHotEncoder(categories=enc, handle_unknown='ignore', sparse=False)
Here, I have modified your enc in a way that can be fed into the OneHotEncoder.
Now comes the point of how can we going to handle the unseen
labels?
when you handle_unknown as False, the unseen values will have zeros in all the dummy variables, which in a way would help the model to understand its a unknown value.
colnames= ['{}_{}'.format(col,val) for col,unique_values in zip(df.columns,ohe.categories_) \
for val in unique_values]
pd.DataFrame(ohe.fit_transform(df), columns=colnames)
Update:
If you are fine with ordinal endocing, the following change could help.
df2.apply(lambda row: [transform_dict[val].get(col,0) \
for val,col in row.items()],
axis=1,
result_type='expand')
#1000 loops, best of 3: 1.17 ms per loop

Insert field into structured array at a specific column index

I'm currently using np.loadtxt to load some mixed data into a structured numpy array. I do some calculations on a few of the columns to output later. For compatibility reasons I need to maintain a specific output format so I'd like to insert those columns at specific points and use np.savetxt to export the array in one shot.
A simple setup:
import numpy as np
x = np.zeros((2,),dtype=('i4,f4,a10'))
x[:] = [(1,2.,'Hello'),(2,3.,'World')]
newcol = ['abc','def']
For this example I'd like to make newcol the 2nd column. I'm very new to Python (coming from MATLAB). From my searching all I've been able to find so far are ways to append newcol to the end of x to make it the last column, or x to newcol to make it the first column. I also turned up np.insert but it doesn't seem to work on a structured array because it's technically a 1D array (from my understanding).
What's the most efficient way to accomplish this?
EDIT1:
I investigated np.savetxt a little further and it seems like it can't be used with a structured array, so I'm assuming I would need to loop through and write each row with f.write. I could specify each column explicitly (by field name) with that approach and not have to worry about the order in my structured array, but that doesn't seem like a very generic solution.
For the above example my desired output would be:
1, abc, 2.0, Hello
2, def, 3.0, World
This is a way to add a field to the array, at the position you require:
from numpy import zeros, empty
def insert_dtype(x, position, new_dtype, new_column):
if x.dtype.fields is None:
raise ValueError, "`x' must be a structured numpy array"
new_desc = x.dtype.descr
new_desc.insert(position, new_dtype)
y = empty(x.shape, dtype=new_desc)
for name in x.dtype.names:
y[name] = x[name]
y[new_dtype[0]] = new_column
return y
x = zeros((2,), dtype='i4,f4,a10')
x[:] = [(1, 2., 'Hello'), (2, 3., 'World')]
new_dt = ('my_alphabet', '|S3')
new_col = ['abc', 'def']
x = insert_dtype(x, 1, new_dt, new_col)
Now x looks like
array([(1, 'abc', 2.0, 'Hello'), (2, 'def', 3.0, 'World')],
dtype=[('f0', '<i4'), ('my_alphabet', 'S3'), ('f1', '<f4'), ('f2', 'S10')])
The solution is adapted from here.
To print the recarray to file, you could use something like:
from matplotlib.mlab import rec2csv
rec2csv(x,'foo.txt')

Resources