OneHotEncoder ValueError: Found unknown categories - scikit-learn

I am building the OneHotEncoder using the full file.
def buildOneHotEncoder(training_file_name, categoricalCols):
one_hot_encoder = OneHotEncoder(sparse=False)
df = pd.read_csv(training_file_name, skiprows=0, header=0)
df = df[categoricalCols]
df = removeNaN(df, categoricalCols)
logging.info(str(df.columns))
one_hot_encoder.fit(df)
return one_hot_encoder
def removeNaN(df, categoricalCols):
# Replace any NaN values
for col in categoricalCols:
df[[col]] = df[[col]].fillna(value=CONSTANT_FILLER)
return df
Now i am using this same encoder when i processing the same file in chunks
for chunk in pd.read_csv(training_file_name, chunksize=CHUNKSIZE):
....
INPUT = chunk[categoricalCols]
INPUT = removeNaN(INPUT, categoricalCols)
one_hot_encoded = one_hot_encoder.transform(INPUT)
....
It's giving me error 'ValueError: Found unknown categories ['missing'] in column 2 during transform'
I can't process the full file at once as during training iterations memory is required to use all cores.

One workaround is to initialize OneHotEncoder with handle_unknown= parameter:
one_hot_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

The issue was with applying
df_merged_set_test = chunk.where(chunk['weblab']=="missing")
I was filtering the dataset based on the a field so for the all rows which are it fill NaN. I was later on replacing them with a missing flag.
The correct way
Clean the dataset i.e fill all na values for all columns
Then filter and drop NaN rows i.e all values NaN rows .where(chunk['weblab']=="missing").dropna()

Clean data from any nan data.
The following code show you the count of nan data for each column
total_missing_data = data.isnull().sum().sort_values(ascending=False)
percent_of_missing_data = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending=False)
missing_data = pd.concat(
[
total_missing_data,
percent_of_missing_data
],
axis=1,
keys=['Total', 'Percent']
)
print(missing_data.head(10))
output such as:
Total Percent
age 2 0.284091
To get it is location:
df.loc[(data['age'].isnull())]
Then fill nan column using mean or meadiean:
df.age[62]=data.age.median()
or drop all nan rows:
df.dropna(inplace=True)

Related

How to extract 4dimensional data from a list of pandas dataframes?

I have a list of 500 dataframes (in the form of .csv files); 500 = 20(time) x 25(energy) bins. In other words, each dataframe is a measurement of flux at a single time and energy and is represented as 150x150 mesh grid corresponding to x and y spatial coordinates. However, I would like to transform these data into 4-d coordinates of the form Flux(x, y, t, E) such that I have new set of dataframes with columns E and rows t for any given (x,y) position.
I am not sure how to approach the problem. I would appreciate your help in giving me some sort of roadmap for doing this procedure.
Note:
The time and energy of each dataframe is in the name of the corresponding .csv file in the form time-5e+35-energy0.00023-position.csv where t=-5 10^35 and E=0.00023.
What I know:
500 dataframes of 20tx25E must be converted to 22,500 dataframes of 150x150 coordinates. However, this is very time consuming and I am not sure if there is any other package in python3 that can do the job easier.
Code that combines your files into one big Pandas dataframe of size 11,250,000 or 25 × 20 × 150 × 150:
import pandas as pd
from glob import glob
import re
from datetime import datetime
pattern_file_name = re.compile(r'time-(.*)-energy(.*)-position.csv')
start_time = datetime.now()
result_df = None
for file_name in glob('time-*.csv'):
# extract time and energy values from file name
if not pattern_file_name.match(file_name):
raise ValueError(f'file name {file_name} failed pattern match.')
time_s, energy_s = pattern_file_name.findall(file_name)[0]
time, energy = float(time_s), float(energy_s)
print(f'Processing | {time_s} | {energy_s} |...')
df = pd.read_csv(file_name, header=None)
# assuming the CSV (i) has no headers (ii) is an array of 150x150...
# ...floats with no missing or problematic values (iii) each row...
# ...represents a fixed y-coordinate; adjust to your needs
df.index.name = 'y'
df = df.stack()
df.index.rename('x', level=-1, inplace=True)
df = df.swaplevel().sort_index().reset_index().rename(columns={0: 'flux'})
# df is now (x, y, f)
# x and y will each vary from 0 to 149
df.insert(0, 't', time)
df.insert(0, 'E', energy)
result_df = df if result_df is None else pd.concat([result_df, df])
result_df = result_df.set_index(['E', 't', 'x', 'y']).sort_index()
# result_df is now (E, t, x, y) -> flux
result_df.to_csv('output.csv', index=True)
final_time = datetime.now()
delta_time = final_time - start_time
print(f'Completed in {delta_time}')
The main steps are as follows:
Loop over file names
Extract t and E values from file name
Read square matrix of flux values from file
Transform 150 × 150 square matrix to Pandas dataframe of length 22,500
Add columns to keep track of E and t
Append local result to a global, ever-increasing result vector
Finally, leave the loop and save results to disk as CSV
The resulting CSV file will have 5 columns. The first four would represent (E,t,x,y) and the last column would be the value of the flux field at those co-ordinates.

Is there another way to specify exactly which columns or rows to be selected for preprocessing(cleaning data frame)?

Firstly, I have a dataset like the following:
Date Time Open ... Adj Close Volume
0 1/3/2017 0:00 225.039993 ... 208.895447 NaN
1 1/4/2017 0:00 225.619995 ... 210.138245 78744400.0
...
I wonder how to put columns' lable in the variable imputer.
But, I do not want to use iloc or loc.
When I run the code= Volume can not be iterated .
In addition, [1]: https://scikit-learn.org/0.18/modules/generated/sklearn.preprocessing.Imputer.html
df = pd.read_csv(r"C:\Users\v\Desktop\sp500.csv")
df1 = df1.replace(nan,1)
# retrieve the numpy array
values =df1.values
# define the imputer
imputer = SimpleImputer(missing_values=1, strategy='mean', column= Volume)
There are multiple ways of specifying the column name given the Dataframe (df). You can pass df1.Volume or df1['Volume'] in the SimpleImputer method
imputer = SimpleImputer(missing_values=1, strategy='mean', column= df1.Volume)
OR
Imputer = SimpleImputer(missing_values=1, strategy='mean', column= df1['Volume'])

Concat dataframe to multi index dataframe with gradient values

I have a Multi-index dataframe with multiple test result values.
For further data analysis I want to add the derivation to the dataframe.
I tried to either calculate it via a lambda function directly after I grouped the dataframe. Grouping (mean values) is required due to the noise in the sampling.
Later I want to delete the rows from my dataframes where the derivative is <= 0.
The simplified Multi-index dataframe looks like this:
arrays = [['LS13', 'LS13', 'LS13', 'LS13','LS14','LS14','LS14','LS14','LS14','LS14','LS14','LS14'],[0, 2, 2.5, 3,0,2,5,5.5,6,6.5,7,7.5]]
index = pd.MultiIndex.from_arrays(arrays, names=('File', 'Flow Rate Setpoint [l/s]'))
df = pd.DataFrame({('Flow Rate [l/s]','mean') : [-0.057,2.089,2.496,3.011,0.056,2.070,4.995,5.519,6.011,6.511,7.030,7.499],('Time [s]','mean') : [42.225,104.909,165.676,226.446,42.225,104.918,469.560,530.328,591.100,651.864,712.660,773.034],('Shear Stress [Pa]','mean') : [-0.698,5.621,7.946,11.278,-0.774,6.557,40.610,48.370,54.685,58.414,58.356,56.254]},index=index)
if I run my code:
import numpy as np
xls = ['LS13', 'LS14']
gradient = [pd.Series(np.gradient(df.loc[(i),('Shear Stress [Pa]','mean')],df.loc[(i),('Time [s]','mean')])) for i in xls]
now I want to concat gradient to df on axis = 1, Title could be df['Gradient''values'].
So my pd.Series looks like:
Gradient
values
0 0.100808
1 0.069048
2 0.04654
3 0.054801
0 0.116941
1 0.087431
2 0.149521
3 0.115805
4 0.082639
5 0.030213
6 -0.017938
7 -0.034806
next step would be to remove/drop the rows where ['Gradient','values'] <= 0, in my example ['LS14','7':'7.5']
When I tried to concatenate both Dataframe df and Series gradient (I'm aware that the indexes are different)
merged = pd.concat([pd.DataFrame(df),pd.Series(gradient)], axis=1 , ignore_index = True)
Errors are usually one of the following:
ValueError: Buffer dtype mismatch, expected 'Python object' but got
'long long'
TypeError: cannot concatenate object of type "<class 'list'>"; only
pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
I would also assume there is an easier way to get this done with an lambda function and just apply it in place.
merged = pd.concat([df, pd.Series([gradient], name=('Gradient','value'))], axis=1)
I would have expected that to work, but I also get a miss match error:
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
when I try:
df[("Gradient","value")] =pd.Series([pd.Series(np.gradient(df.loc[(i),('Shear Stress [Pa]','mean')],df.loc[(i),('Time [s]','mean')])) for i in xls])
The 'Gradient','value' column gets correctly added to the dataframe but the values are again NaN.
You can try groupby().apply():
def get_gradients(x):
gradients = np.gradient(x[('Shear Stress [Pa]', 'mean')],x[('Time [s]', 'mean')] )
return pd.Series(gradients, index=x.index)
df[('Gradient','Value')] = (df.groupby('File', group_keys=False)
.apply(get_gradients)
)

How to plot multi-index, categorical data?

Given the following data:
DC,Mode,Mod,Ven,TY1,TY2,TY3,TY4,TY5,TY6,TY7,TY8
Intra,S,Dir,C1,False,False,False,False,False,True,True,False
Intra,S,Co,C1,False,False,False,False,False,False,False,False
Intra,M,Dir,C1,False,False,False,False,False,False,True,False
Inter,S,Co,C1,False,False,False,False,False,False,False,False
Intra,S,Dir,C2,False,True,True,True,True,True,True,False
Intra,S,Co,C2,False,False,False,False,False,False,False,False
Intra,M,Dir,C2,False,False,False,False,False,False,False,False
Inter,S,Co,C2,False,False,False,False,False,False,False,False
Intra,S,Dir,C3,False,False,False,False,True,True,False,False
Intra,S,Co,C3,False,False,False,False,False,False,False,False
Intra,M,Dir,C3,False,False,False,False,False,False,False,False
Inter,S,Co,C3,False,False,False,False,False,False,False,False
Intra,S,Dir,C4,False,False,False,False,False,True,False,True
Intra,S,Co,C4,True,True,True,True,False,True,False,True
Intra,M,Dir,C4,False,False,False,False,False,True,False,True
Inter,S,Co,C4,True,True,True,False,False,True,False,True
Intra,S,Dir,C5,True,True,False,False,False,False,False,False
Intra,S,Co,C5,False,False,False,False,False,False,False,False
Intra,M,Dir,C5,True,True,False,False,False,False,False,False
Inter,S,Co,C5,False,False,False,False,False,False,False,False
Imports:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
To reproduce my DataFrame, copy the data then use:
df = pd.read_clipboard(sep=',')
I'd like to create a plot conveying the same information as my example, but not necessarily with the same shape (I'm open to suggestions). I'd also like to hover over the color and have the appropriate Ven displayed (e.g. C1, not 1).:
Edit 2018-10-17:
The two solutions provided so far, are helpful and each accomplish a different aspect of what I'm looking for. However, the key issue I'd like to resolve, which wasn't explicitly stated prior to this edit, is the following:
I would like to perform the plotting without converting Ven to an int; this numeric transformation isn't practical with the real data. So the actual scope of the question is to plot all categorical data with two categorical axes.
The issue I'm experiencing is the data is categorical and the y-axis is multi-indexed.
I've done the following to transform the DataFrame:
# replace False witn nan
df = df.replace(False, np.nan)
# replace True with a number representing Ven (e.g. C1 = 1)
def rep_ven(row):
return row.iloc[4:].replace(True, int(row.Ven[1]))
df.iloc[:, 4:] = df.apply(rep_ven, axis=1)
# drop the Ven column
df = df.drop(columns=['Ven'])
# set multi-index
df_m = df.set_index(['DC', 'Mode', 'Mod'])
Plotting the transformed DataFrame produces:
plt.figure(figsize=(20,10))
heatmap = plt.imshow(df_m)
plt.xticks(range(len(df_m.columns.values)), df_m.columns.values)
plt.yticks(range(len(df_m.index)), df_m.index)
plt.show()
This plot isn't very streamlined, there are four axis values for each Ven. This is a subset of data, so the graph would be very long with all the data.
Here's my solution. Instead of plotting I just apply a style to the DataFrame, see https://pandas.pydata.org/pandas-docs/stable/style.html
# Transform Ven values from "C1", "C2" to 1, 2, ..
df['Ven'] = df['Ven'].str[1]
# Given a specific combination of dc, mode, mod, ven,
# do we have any True cells?
g = df.groupby(['DC', 'Mode', 'Mod', 'Ven']).any()
# Let's drop any rows with only False values
g = g[g.any(axis=1)]
# Convert True, False to 1, 0
g = g.astype(int)
# Get the values of the ven index as an int array
# Note: we don't want to drop the ven index!!
# Otherwise styling won't work
ven = g.index.get_level_values('Ven').values.astype(int)
# Multiply 1 and 0 with Ven value
g = g.mul(ven, axis=0)
# Sort the index
g.sort_index(ascending=False, inplace=True)
# Now display the dataframe with styling
# first we get a color map
import matplotlib
cmap = matplotlib.cm.get_cmap('tab10')
def apply_color_map(val):
# hide the 0 values
if val == 0:
return 'color: white; background-color: white'
else:
# for non-zero: get color from cmap, convert to hexcode for css
s = "color:white; background-color: " + matplotlib.colors.rgb2hex(cmap(val))
return s
g
g.style.applymap(apply_color_map)
The available matplotlib colormaps can be seen here: Colormap reference, with some additional explanation here: Choosing a colormap
Explanation: Remove rows where TY1-TY8 are all nan to create your plot. Refer to this answer as a starting point for creating interactive annotations to display Ven.
The below code should work:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_clipboard(sep=',')
# replace False witn nan
df = df.replace(False, np.nan)
# replace True with a number representing Ven (e.g. C1 = 1)
def rep_ven(row):
return row.iloc[4:].replace(True, int(row.Ven[1]))
df.iloc[:, 4:] = df.apply(rep_ven, axis=1)
# drop the Ven column
df = df.drop(columns=['Ven'])
idx = df[['TY1','TY2', 'TY3', 'TY4','TY5','TY6','TY7','TY8']].dropna(thresh=1).index.values
df = df.loc[idx,:].sort_values(by=['DC', 'Mode','Mod'], ascending=False)
# set multi-index
df_m = df.set_index(['DC', 'Mode', 'Mod'])
plt.figure(figsize=(20,10))
heatmap = plt.imshow(df_m)
plt.xticks(range(len(df_m.columns.values)), df_m.columns.values)
plt.yticks(range(len(df_m.index)), df_m.index)
plt.show()

Find SARIMAX AIC and pdq values, statsmodels

I'm trying to find the values of p,d,q and the seasonal values of P,D,Q using statsmodels as "sm" in python.
The data set i'm using is a csv file that contains time series data over three years recording the energy consumption. The file was split into a smaller data frame in order to work with it. Here is what df_test.head() looks like.
time total_consumption
122400 2015-05-01 00:01:00 106.391
122401 2015-05-01 00:11:00 120.371
122402 2015-05-01 00:21:00 109.292
122403 2015-05-01 00:31:00 99.838
122404 2015-05-01 00:41:00 97.387
Here is my code so far.
#Importing the timeserie data set from local file
df = pd.read_csv(r"C:\Users\path\Name of the file.csv")
#Rename the columns, put time as index and assign datetime to the column time
df.columns = ["time","total_consumption"]
df['time'] = pd.to_datetime(df.time)
df.set_index('time')
#Select test df (there is data from the 2015-05-01 2015-06-01)
df_test = df.loc[(df['time'] >= '2015-05-01') & (df['time'] <= '2015-05-14')]
#Find minimal AIC value for the ARIMA model integers
p = range(0,2)
d = range(0,2)
q = range(0,2)
pdq = list(itertools.product(p,d,q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p,d,q))]
warnings.filterwarnings("ignore")
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(df_test,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
except:
continue
When I try to run the code as it is, the program doesn't even acknowledge the "for" loop. But when I take out the
try:
except:
continue
the program gives me this error message
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
How could I remedy that and is there a way to automate the process directly output the parameters with the lowest AIC value (without having to look for it throught all the possibilities).
Thanks !

Resources