Question on ColumnTransformer OneHotEncoder VS mode_onehot_pipe - scikit-learn

I would like to ask what's the different between OneHotEncoder and mode_onehot_pipe
mode_onehot_pipe = Pipeline([
('encoder', SimpleImputer(strategy = 'most_frequent')),
('one hot encoder', OneHotEncoder(handle_unknown = 'ignore'))])
transformer = ColumnTransformer([
('one hot', OneHotEncoder(handle_unknown = 'ignore'), ['Gender', 'Age', 'Working_Status', 'Annual_Income', 'Visit_Duration', 'Spending_Time', 'Outlet_Location', 'Member_Card', 'Average_Spending']),
('mode_onehot_pipe', mode_onehot_pipe, ['Visit_Plan'])], remainder = 'passthrough')
Thanks a lot!

The main difference between the two is the way they handle nan values.
mode_onehot_pipe will replace nan by the most frequent value according to the SimpleImputer configuration while OneHotEncoder will create a category for nan values.
If you pass the same feature, you will end up with one extra feature for the OneHotEncoder which will represents the nan values.

Related

tfidf TfidfVectorizer to data frame Shape error

I have some training data that am I trying to calculate the tif-idf values for:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
file_name = '../../data/spam.csv'
spam_data_df = pd.read_csv(file_name)
spam_data_df['target'] = np.where(spam_data_df['target']=='spam',1,0)
X_train, X_test, y_train, y_test = train_test_split(spam_data_df['text'],
spam_data_df['target'],
test_size=0.3,
random_state=0)
X_train_list = X_train.tolist()
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer_fit = tfidf_vectorizer.fit(X_train_list)
tfidf_vectorizer_vectors = tfidf_vectorizer.transform(X_train_list)
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_vectorizer_dense = tfidf_vectorizer_vectors.todense()
tfidf_dense_list = tfidf_vectorizer_dense.tolist()
df = pd.DataFrame(tfidf_vectorizer_dense,
index=feature_names,
columns=["tfidf"]).reset_index()
What I am looking for is to contract a table that looks like the following:
token tfidf
Mathews 0.99343
tait 0.02342
edwards 0.45453
anderson 0.21216
Here is an excerpt of the data:
text,target
"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",ham
Ok lar... Joking wif u oni...,ham
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,spam
U dun say so early hor... U c already then say...,ham
"Nah I don't think he goes to usf, he lives around here though",ham
The error I am seeing is:
ValueError: Shape of passed values is (3900, 7098), indices imply (7098, 1)
Please help
You can do
df = pd.DataFrame(tfidf_vectorizer_dense.T,
index=feature_names).reset_index()
# columns=["tfidf"])
It will return something like
token 0 1 2 ... 3899
Mathews 0.99343 0.12421 0.00000 ... 0.48674
tait 0.02342 0.00000 0.00000 ... 0.12421
edwards 0.45453 0.40727 0.09323 ... 0.00000
anderson 0.21216 0.30638 0.44592 ... 0.32154
...
Explanation
You have 3900 texts with 7098 tfidf features.
ValueError: Shape of passed values is (3900, 7098), indices imply (7098, 1)
The error implies that there is a mismatch between
shape of tfidf_vectorizer_dense - (3900, 7098)
shape set by index=feature_names (7098) and column=["tfidf"] (1) - (7098, 1).
Your goal is to match them so both are (7098, 3900).
You can do a transpose, tfidf_vectorizer_dense.T. After the transposition, it will have a shape of (7098, 3900). This aligns with the length of index.
For columns, you can just remove column=.

OneHotEncoder ValueError: Found unknown categories

I am building the OneHotEncoder using the full file.
def buildOneHotEncoder(training_file_name, categoricalCols):
one_hot_encoder = OneHotEncoder(sparse=False)
df = pd.read_csv(training_file_name, skiprows=0, header=0)
df = df[categoricalCols]
df = removeNaN(df, categoricalCols)
logging.info(str(df.columns))
one_hot_encoder.fit(df)
return one_hot_encoder
def removeNaN(df, categoricalCols):
# Replace any NaN values
for col in categoricalCols:
df[[col]] = df[[col]].fillna(value=CONSTANT_FILLER)
return df
Now i am using this same encoder when i processing the same file in chunks
for chunk in pd.read_csv(training_file_name, chunksize=CHUNKSIZE):
....
INPUT = chunk[categoricalCols]
INPUT = removeNaN(INPUT, categoricalCols)
one_hot_encoded = one_hot_encoder.transform(INPUT)
....
It's giving me error 'ValueError: Found unknown categories ['missing'] in column 2 during transform'
I can't process the full file at once as during training iterations memory is required to use all cores.
One workaround is to initialize OneHotEncoder with handle_unknown= parameter:
one_hot_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
The issue was with applying
df_merged_set_test = chunk.where(chunk['weblab']=="missing")
I was filtering the dataset based on the a field so for the all rows which are it fill NaN. I was later on replacing them with a missing flag.
The correct way
Clean the dataset i.e fill all na values for all columns
Then filter and drop NaN rows i.e all values NaN rows .where(chunk['weblab']=="missing").dropna()
Clean data from any nan data.
The following code show you the count of nan data for each column
total_missing_data = data.isnull().sum().sort_values(ascending=False)
percent_of_missing_data = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending=False)
missing_data = pd.concat(
[
total_missing_data,
percent_of_missing_data
],
axis=1,
keys=['Total', 'Percent']
)
print(missing_data.head(10))
output such as:
Total Percent
age 2 0.284091
To get it is location:
df.loc[(data['age'].isnull())]
Then fill nan column using mean or meadiean:
df.age[62]=data.age.median()
or drop all nan rows:
df.dropna(inplace=True)

How do I make my algo work with KNN text classification?

Trying to make my classification accepting a text (string) and not just a number (numeric). Working with data, carrying a load of pulled articles, I want the classification algo to show which ones to proceed with and which ones to drop. Applying a number, things are working just fine, yet this is not very intuitive, although I know that the number represents a relationship to one of the two classes I am working with.
How do I change the logic in the algo to make it accept a text as search criteria and not just an anonymous number, picked from the 'Unique_id' column? Columns are, btw...'Title', 'Abstract', 'Relevant', 'Label', 'Unique_id'. The reason for concatenating df's at algo end is that I want to compare results. Finally. it should be noted that the col 'Label' consists of a list of keywords, so basically I want the algo to read from that col.
I did try, reading from data sources, changing the 'index_col='Unique_id' to 'index_col='Label', but that did not work out either.
An example of what I want:
print("\nPrint KNN1")
print(get_closest_neighs1('search word'), "\n")
print("\nPrint KNN2")
print(get_closest_neighs2('search word'), "\n")
print("\nPrint KNN3")
print(get_closest_neighs3('search word'), "\n")
This is the full code (view end of algo to see above example as it runs today, using a number to identify nearest neighbor):
import pandas as pd
print("\nPerforming Analysis using Text Classification")
data = pd.read_csv('File_1_coltest_demo.csv', sep=';', encoding="ISO-8859-1").dropna()
data['Unique_id'] = data.groupby(['Title', 'Abstract', 'Relevant']).ngroup()
data.to_csv('File_2_coltest_demo_KNN.csv', sep=';', encoding="ISO-8859-1", index=False)
data1 = pd.read_csv('File_2_coltest_demo_KNN.csv', sep=';', encoding="ISO-8859-1", index_col='Unique_id')
data2 = pd.DataFrame(data1, columns=['Abstract', 'Relevant'])
data2.to_csv('File_3_coltest_demo_KNN_reduced.csv', sep=';', encoding="ISO-8859-1", index=False)
print("\nData top 25 items")
print(data2.head(25))
print("\nData info")
print(data2.info())
print("\nData columns")
print(data2.columns)
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(lowercase=True, stop_words='english', ngram_range=(1, 1), tokenizer=token.tokenize)
text_counts = cv.fit_transform(data2['Abstract'])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
text_counts, data2['Abstract'], test_size=0.5, random_state=1)
print("\nTF IDF")
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
text_tf = tf.fit_transform(data2['Abstract'])
print(text_tf)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
text_tf, data2['Abstract'], test_size=0.3, random_state=123)
from sklearn.neighbors import NearestNeighbors
import pandas as pd
nbrs = NearestNeighbors(n_neighbors=20, metric='euclidean').fit(text_tf)
def get_closest_neighs1(Abstract):
row = data2.index.get_loc(Abstract)
distances, indices = nbrs.kneighbors(text_tf.getrow(row))
names_similar = pd.Series(indices.flatten()).map(data2.reset_index()['Abstract'])
result = pd.DataFrame({'distance1' : distances.flatten(), 'Abstract' : names_similar}) # 'Unique_id' : names_similar,
return result
def get_closest_neighs2(Unique_id):
row = data2.index.get_loc(Unique_id)
distances, indices = nbrs.kneighbors(text_tf.getrow(row))
names_similar = pd.Series(indices.flatten()).map(data2.reset_index()['Unique_id'])
result1 = pd.DataFrame({'Distance' : distances.flatten() / 10, 'Unique_id' : names_similar}) # 'Unique_id' : names_similar,
return result1
def get_closest_neighs3(Relevant):
row = data2.index.get_loc(Relevant)
distances, indices = nbrs.kneighbors(text_tf.getrow(row))
names_similar = pd.Series(indices.flatten()).map(data2.reset_index()['Relevant'])
result2 = pd.DataFrame({'distance2' : distances.flatten(), 'Relevant' : names_similar}) # 'Unique_id' : names_similar,
return result2
print("\nPrint KNN1")
print(get_closest_neighs1(114), "\n")
print("\nPrint KNN2")
print(get_closest_neighs2(114), "\n")
print("\nPrint KNN3")
print(get_closest_neighs3(114), "\n")
data3 = pd.DataFrame(get_closest_neighs1(114))
data4 = pd.DataFrame(get_closest_neighs2(114))
data5 = pd.DataFrame(get_closest_neighs3(114))
del data5['distance2']
data6 = pd.concat([data3, data4, data5], axis=1).reindex(data3.index)
del data6['distance1']
data6.to_csv('File_4_coltest_demo_KNN_results.csv', sep=';', encoding="ISO-8859-1", index=False)
If I understand you right you are trying to do this:
You have vectorised all your documents by their "Abstract" field. Therefore documents with abstracts with similar word distributions should be nearby in TFIDF space.
You want to find the nearest neighbours to a document which has the search keyword.
Therefore you'd need to search the original corpus for the first or all documents which have that keyword
then find the index of that/those document(s), and then find their neighbours.
if there are multiple documents with that keyword, you would need to sort the index list and merge the overall results somehow with some weightings.
If this is true, then the keyword search/lookup isn't really "inside" the model, it's just preselecting a document from the corpus. Once you have the document index, you can perform the KNN (repeatedly).
I'm not hugely familiar with Pandas, but I've done this kind of thing "manually" before e.g. by keeping the document titles in a separate array, with a map to the indexes.
I would imagine you would need to replace your data2.index.get_loc() calls with an iteration over the column values for "Label" and do a simple string search on each. Or does Pandas provide search functions within the corpus?
e.g. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query

How to implement dynamic parameter estimation with missing data in Gekko?

Going back and forth through the documentation, I was able to set-up a dynamic parameter estimation in Gekko.
Here's the code, with measurement values shown below (the file is named MeasuredAlgebrProductionRate_30min_18h.csv on my system, and uses ;as separator):
import numpy as np
import matplotlib.pyplot as plt
from gekko import GEKKO
#%% Read measurement data from CSV file
t_x_q_obs = np.genfromtxt('MeasuredAlgebrProductionRate_30min_18h.csv', delimiter=';')
#t_obs, x_obs, q_obs = t_xq_obs[:,0:3]
#%% Initialize Model
m = GEKKO(remote=False)
m.time = t_x_q_obs[:,0] #np.arange(0, 18/24+1e-6, 1/2*1/24)
# Declare parameter
V_liq = m.Param(value = 159.0)
# Declare FVs
k_1 = m.FV(value = 0.80)
k_1.STATUS = 1
f_1 = m.FV(value = 10.0)
f_1.STATUS = 1
# Diff. Variables
X = m.Var(value = 80.0) # at t=0
Y = m.Var(value = 80.0*0.2)
rho_1 = m.Intermediate(k_1*X)
#q_prod = m.Intermediate(0.52*f_1*X/24)
#X = m.CV(value = t_x_q_obs[:,1])
q_prod = m.CV(value = t_x_q_obs[:,2])
#%% Equations
m.Equations([X.dt() == -rho_1, Y.dt() == 0, q_prod == 0.52*f_1*X/24])
m.options.IMODE = 5
m.solve(disp=False)
#%% Plot some results
plt.plot(m.time, np.array(X.value)/10, label='X')
plt.plot(t_x_q_obs[:,0], t_x_q_obs[:,2], label='q_prod Meas.')
plt.plot(m.time, q_prod.value, label='q_prod Sim.')
plt.xlabel('time')
plt.ylabel('X / q_prod')
plt.grid()
plt.legend(loc='best')
plt.show()
0.0208333333 NaN 30.8306036
0.0416666667 NaN 29.1200832
0.0625 74.866 28.7700549
0.0833333333 NaN 29.2318865
0.104166667 NaN 30.7727362
0.125 NaN 29.8743804
0.145833333 NaN 29.9923447
0.166666667 NaN 30.9169679
0.1875 NaN 28.5956184
0.208333333 NaN 27.7361632
0.229166667 NaN 26.6669496
0.25 NaN 27.17477
0.270833333 75.751 23.6270346
0.291666667 NaN 23.0646928
0.3125 NaN 23.6442113
0.333333333 NaN 23.089118
0.354166667 NaN 22.9101616
0.375 NaN 22.7453854
0.395833333 NaN 23.2182759
0.416666667 NaN 21.4901903
0.4375 NaN 21.1449899
0.458333333 NaN 20.7093537
0.479166667 NaN 20.3109086
0.5 NaN 20.6825141
0.520833333 NaN 19.199583
0.541666667 NaN 19.6173416
0.5625 NaN 19.5543139
0.583333333 NaN 20.4501879
0.604166667 NaN 18.7678061
0.625 NaN 18.4629262
0.645833333 NaN 18.3730322
0.666666667 NaN 19.5375442
0.6875 NaN 18.1975297
0.708333333 NaN 18.0370627
0.729166667 NaN 17.5734727
0.75 NaN 18.8632046
So far, so good. Suppose I also have measurements of X (second column) at some time points (first column), the rest is not available (therefore NaN).
I would like to adjust k_1 and f_1, so that simulated and observed variables X and q_prod match as closely as possible.
Is this feasible with Gekko? If so, how?
Another question: Gekko throws an error if m.time has more elements than there are time points of observed variables. However, my initial values of X and Y refer to t=0, not t=0.0208333333. Hence, the commented out part after m.time =, see above. (Measurements at t=0 are not available.) Do initial conditions in Gekko refer to the first element of m.time, as they do in Matlab, or to t=0?
If you have a missing measurement then you can include a non-numeric value such as NaN and Gekko ignores that entry in the objective function. Here is a test case with one NaN value in ym:
Nonlinear Regression with NaN Data Value
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
xm = np.array([0,1,2,3,4,5])
ym = np.array([0.1,0.2,np.nan,0.5,0.8,2.0])
m = GEKKO(remote=False)
x = m.Param(value=xm,name='x')
a = m.FV()
a.STATUS=1
y = m.CV(value=ym,name='y')
y.FSTATUS=1
m.Equation(y==0.1*m.exp(a*x))
m.options.IMODE = 2
m.options.SOLVER = 1
m.solve(disp=True)
print('Optimized, a = ' + str(a.value[0]))
plt.plot(xm,ym,'bo')
plt.plot(xm,y.value,'r-')
m.open_folder()
plt.show()
When you open the run folder with m.open_folder() and look at the data file gk_model0.csv, there is the NaN in the y value column.
y,x
0.1,0
0.2,1
nan,2
0.5,3
0.8,4
2.0,5
This is IMODE=2 so it is a steady state regression problem but shows the same thing that happens with dynamic estimation problems. There is more information on the objective function with m.options.EV_TYPE=1 (default) or m.options.EV_TYPE=2 for estimation and how bad values are handled in a data file. When the measurement is a non-numeric value, that bad value is dropped from the objective function summation. Here is a version with a dynamic model:
Dynamic Regression with Fixed Initial Condition
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
xm = np.array([0,1,2,3,4,5])
ym = np.array([2.0,1.5,np.nan,2.2,3.0,5.0])
m = GEKKO(remote=False)
m.time = xm
a = m.FV(lb=0.1,ub=2.0)
a.STATUS=1
y = m.CV(value=ym,name='y',fixed_initial=False)
y.FSTATUS=1
m.Equation(y.dt()==a*y)
m.options.IMODE = 5
m.options.SOLVER = 1
m.solve(disp=True)
print('Optimized, a = ' + str(a.value[0]))
plt.figure(figsize=(6,2))
plt.plot(xm,ym,'bo',label='Meas')
plt.plot(xm,y.value,'r-',label='Pred')
plt.ylabel('y')
plt.ylim([0,6])
plt.legend()
plt.show()
As you observed, you need to have the same length for m.time as for your measurement values. If you are missing values then you can include append a np.nan to the beginning of the data horizon. By default, Gekko uses the first value specified in the value property to set the initial condition. If you don't want Gekko to use that value then set fixed_initial=False for your CV.
Dynamic Regression with Free Initial Condition
y = m.CV(value=ym,name='y',fixed_initial=False)

Label encoder gives error as "too many indices for array" for training set on only single row

I have a dataset and the variable is a string and it of one single row containing 5959 columns and I want to encode categorical data.
from sklearn.preprocessing import LabelEncoder , OneHotEncoder
label_encoder = LabelEncoder()
y_train[:,0] = label_encoder.fit_transform(y_train[:,0])
onehot_encoder = OneHotEncoder(categorical_features = [0])
y_train = onehot_encoder.fit_transform(y_train).toarray()
The data should be categorized for further processing and analysis.

Resources