While creating dummy variables getting memory error - python-3.x

I am working on a project and While getting the dummies value i was getting memory exception
I have tried using .astype(np.int8) and i have also tried writing exception handling code by importing psutil
i am using below code
dummy_cols = ['emp_title','grade','home_ownership','verification_status','addr_state','pub_rec','application_type']
df_dummies = pd.get_dummies(df[dummy_cols], drop_first = True)
It's not working and throwing an error

pandas.get_dummies creates a dense representation of dummy variables, which may request lots of memory depending on the number of levels in the categorical features.
I would prefer scikit-learn.preprocessing.OneHotEncoder that outputs sparse matrices:
The code would look like this :
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Create a fake dataframe
df = pd.DataFrame(
{
"df1": np.random.choice(["a", "b"], 100),
"df2": np.random.choice(["c", "d"], 100)
}
)
dummy_cols = ["df1", "df2"]
# LabelEncode categoricals
for f in dummy_cols:
df[f] = LabelEncoder().fit_transform(df[f])
# Transform to dummies in sparse representation (csr_matrix)
df_dummies = OneHotEncoder().fit_transform(df[dummy_cols])

Related

A function to insert data in dataset using python

I create a program that predict digits from in a dataset. I want when it predict data their should be two cases if it predict right then data should added automatically in dataset otherwise it takes right answer throw user and insert to dataset.
code
import numpy as np
import pandas as pd
import matplotlib.pyplot as pt
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv("train.csv").values
clf = DecisionTreeClassifier()
xtrain = data[0:21000,1:]
train_label=data[0:21000,0]
clf.fit(xtrain,train_label)
xtest = data[21000: ,1:]
actual_label=data[21000:,0]
d = xtest[9]
d.shape = (28,28)
pt.imshow(d,cmap='gray')
print(clf.predict([xtest[9]]))
pt.show()
I'm not sure I'm following your question, but if you want to distinguish between good and wrong predictions and take different ways, you should specific do that.
predictions = clf.predict(xtest)
good_predictions = xtest[pd.Series(predictions == actual_label)]
bad_predictions = xtest[pd.Series(predictions != actual_label)]
So, in good_predictions will be all the rows in xtest that where predicted right.

Recover feature names from pipeline [duplicate]

This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
There's an extensive discussion about it here, but I don't think anything has been finalized yet.
Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.
From Documentation of ColumnTransformer:
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Try this!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'sales': [10000, 100020, 110000, 100],
'foo' : [1,0,0,1],
'text': ['I will test this',
'need to write more sentence',
'want to keep it simple',
'hope you got that these sentences are junk'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('text', text_pipeline, 'text'),
('simple_transformer', MinMaxScaler(), ['sales']),
]
combined_pipe = ColumnTransformer(
transformers, remainder='passthrough')
transformed_data = combined_pipe.fit_transform(
train.drop('y',1), train['y'])
def get_feature_out(estimator, feature_in):
if hasattr(estimator,'get_feature_names'):
if isinstance(estimator, _VectorizerMixin):
# handling all vectorizers
return [f'vec_{f}' \
for f in estimator.get_feature_names()]
else:
return estimator.get_feature_names(feature_in)
elif isinstance(estimator, SelectorMixin):
return np.array(feature_in)[estimator.get_support()]
else:
return feature_in
def get_ct_feature_names(ct):
# handles all estimators, pipelines inside ColumnTransfomer
# doesn't work when remainder =='passthrough'
# which requires the input column names.
output_features = []
for name, estimator, features in ct.transformers_:
if name!='remainder':
if isinstance(estimator, Pipeline):
current_features = features
for step in estimator:
current_features = get_feature_out(step, current_features)
features_out = current_features
else:
features_out = get_feature_out(estimator, features)
output_features.extend(features_out)
elif estimator=='passthrough':
output_features.extend(ct._feature_names_in[features])
return output_features
pd.DataFrame(transformed_data,
columns=get_ct_feature_names(combined_pipe))

How do i fix "If using all scalar values, you must pass an index" error?

I am manually trying to build a linear regression model for understanding purpose without using the builtin function. I am getting the error while plotting the regression line. Kindly help me fix it.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sb
data = {'X': list(np.arange(0,10,1)), 'Y': [1,3,2,5,7,8,8,9,10,12]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(np.ones(10), columns = ['ones'])
df_new = pd.concat([df2,df], axis = 1)
X = df_new.loc[:, ['ones', 'X']].values
Y = df_new['Y'].values.reshape(-1,1)
theta = np.array([0.5, 0.2]).reshape(-1,1)
Y_pred = X.dot(theta)
sb.lineplot(df['X'].values.reshape(-1,1),Y_pred)
plt.show()
Error message:
If using all scalar values, you must pass an index
You are passing a 2d array, while seaborn's lineplot expects a 1d array (or a pandas column which is basically same). So change it to
sb.lineplot(df['X'].values,Y_pred.reshape(-1))

Unable to use "from sklearn.preprocessing import Imputer" , it shows the exception " Data must be 1-dimensional"

I have made a model for the artificial neural network(ANN). I want to preprocess the data before train the model.
I have tried the code given below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Update-Detaset with hacking1.csv')
y=[]
X = dataset.iloc[:,2:7]
y = dataset.iloc[:,8]
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
Y = np.reshape(y,(-1,1))
imputer = imputer.fit(Y)
Y= imputer.transform(Y)
Exception: Data must be 1-dimensional
Here, Update-Detaset with hacking1.csv is the .csv file. The dataset is lookig like:
Please click the link to see the demo of the csv file
It shows the following errors.
How can I solve this?
This has nothing to do with Imputer. You should have been able to tell this from the line number that threw the Exception. The error is from you trying to reshape a pandas DataFrame. Change
y = dataset.iloc[:,8]
to
y = dataset.iloc[:,8].values
and it should work.

Dask: How would I parallelize my code with dask delayed?

This is my first venture into parallel processing and I have been looking into Dask but I am having trouble actually coding it.
I have had a look at their examples and documentation and I think dask.delayed will work best. I attempted to wrap my functions with the delayed(function_name), or add an #delayed decorator, but I can't seem to get it working properly. I preferred Dask over other methods since it is made in python and for its (supposed) simplicity. I know dask doesn't work on the for loop, but they say it can work inside a loop.
My code passes files through a function that contains inputs to other functions and looks like this:
from dask import delayed
filenames = ['1.csv', '2.csv', '3.csv', etc. etc. ]
for count, name in enumerate(filenames)"
name = name.split('.')[0]
....
then do some pre-processing ex:
preprocess1, preprocess2 = delayed(read_files_and_do_some_stuff)(name)
then I call a constructor and pass the pre_results in to the function calls:
fc = FunctionCalls()
Daily = delayed(fc.function_runs)(filename=name, stringinput='Daily',
input_data=pre_result1, model1=pre_result2)
What i do here is I pass the file into the for loop, do some pre-processing and then pass the file into two models.
Thoughts or tips on how to do parallelize this? I began getting odd errors and I had no idea how to fix the code. The code does work as is. I use a bunch of pandas dataframes, series, and numpy arrays, and I would prefer not to go back and change everything to work with dask.dataframes etc.
The code in my comment may be difficult to read. Here it is in a more formatted way.
In the code below, when I type print(mean_squared_error) I just get: Delayed('mean_squared_error-3009ec00-7ff5-4865-8338-1fec3f9ed138')
from dask import delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = ['file1.csv']
for count, name in enumerate(filenames):
file1 = pd.read_csv(name)
df = pd.DataFrame(file1)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = delayed(mse)(observed, prediction)
You need to call dask.compute to eventually compute the result. See dask.delayed documentation.
Sequential code
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
results = []
for count, name in enumerate(filenames):
file1 = pd.read_csv(name)
df = pd.DataFrame(file1) # isn't this already a dataframe?
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = mse(observed, prediction)
results.append(mean_squared_error)
Parallel code
import dask
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
delayed_results = []
for count, name in enumerate(filenames):
df = dask.delayed(pd.read_csv)(name)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = dask.delayed(mse)(observed, prediction)
delayed_results.append(mean_squared_error)
results = dask.compute(*delayed_results)
A much clearer solution, IMO, than the accepted answer is this snippet.
from dask import compute, delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
def compute_mse(file_name):
df = pd.read_csv(file_name)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
return mse(observed, prediction)
delayed_results = [delayed(compute_mse)(file_name) for file_name in filenames]
mean_squared_errors = compute(*delayed_results, scheduler="processes")

Resources