A function to insert data in dataset using python - python-3.x

I create a program that predict digits from in a dataset. I want when it predict data their should be two cases if it predict right then data should added automatically in dataset otherwise it takes right answer throw user and insert to dataset.
code
import numpy as np
import pandas as pd
import matplotlib.pyplot as pt
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv("train.csv").values
clf = DecisionTreeClassifier()
xtrain = data[0:21000,1:]
train_label=data[0:21000,0]
clf.fit(xtrain,train_label)
xtest = data[21000: ,1:]
actual_label=data[21000:,0]
d = xtest[9]
d.shape = (28,28)
pt.imshow(d,cmap='gray')
print(clf.predict([xtest[9]]))
pt.show()

I'm not sure I'm following your question, but if you want to distinguish between good and wrong predictions and take different ways, you should specific do that.
predictions = clf.predict(xtest)
good_predictions = xtest[pd.Series(predictions == actual_label)]
bad_predictions = xtest[pd.Series(predictions != actual_label)]
So, in good_predictions will be all the rows in xtest that where predicted right.

Related

How do i fix "If using all scalar values, you must pass an index" error?

I am manually trying to build a linear regression model for understanding purpose without using the builtin function. I am getting the error while plotting the regression line. Kindly help me fix it.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sb
data = {'X': list(np.arange(0,10,1)), 'Y': [1,3,2,5,7,8,8,9,10,12]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(np.ones(10), columns = ['ones'])
df_new = pd.concat([df2,df], axis = 1)
X = df_new.loc[:, ['ones', 'X']].values
Y = df_new['Y'].values.reshape(-1,1)
theta = np.array([0.5, 0.2]).reshape(-1,1)
Y_pred = X.dot(theta)
sb.lineplot(df['X'].values.reshape(-1,1),Y_pred)
plt.show()
Error message:
If using all scalar values, you must pass an index
You are passing a 2d array, while seaborn's lineplot expects a 1d array (or a pandas column which is basically same). So change it to
sb.lineplot(df['X'].values,Y_pred.reshape(-1))

Unable to use "from sklearn.preprocessing import Imputer" , it shows the exception " Data must be 1-dimensional"

I have made a model for the artificial neural network(ANN). I want to preprocess the data before train the model.
I have tried the code given below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Update-Detaset with hacking1.csv')
y=[]
X = dataset.iloc[:,2:7]
y = dataset.iloc[:,8]
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
Y = np.reshape(y,(-1,1))
imputer = imputer.fit(Y)
Y= imputer.transform(Y)
Exception: Data must be 1-dimensional
Here, Update-Detaset with hacking1.csv is the .csv file. The dataset is lookig like:
Please click the link to see the demo of the csv file
It shows the following errors.
How can I solve this?
This has nothing to do with Imputer. You should have been able to tell this from the line number that threw the Exception. The error is from you trying to reshape a pandas DataFrame. Change
y = dataset.iloc[:,8]
to
y = dataset.iloc[:,8].values
and it should work.

Why am I getting a score of 0.0 when finding the score of test data using Gaussian NB classifier?

I have two different data sets. One for training my classifier and the other one is for testing. Both the datasets are text files with two columns separated by a ",". FIrst column (numbers) is for the independent variable (group) and the second column is for the dependent variable.
Training data set
(just few lines for example. there are no empty lines between each row):
EMI3776438,1
EMI3776438,1
EMI3669492,1
EMI3752004,1
Testing data setup
(as you can see, i have picked data from the training data to be sure that the score surely can't be zero)
EMI3776438,1
Code in Python 3.6:
# #all the import statements have been ignored to keep the code short
# #loading the training data set
training_file_path=r'C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\modified_columns.txt'
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
training_file_data = pandas.read_table(training_file_path,
header=None,
names=['numbers','group'],
sep=',')
training_file_data = training_file_data.apply(le.fit_transform)
features = ['numbers']
x = training_file_data[features]
y = training_file_data["group"]
from sklearn.model_selection import train_test_split
training_x,testing_x, training_y, testing_y = train_test_split(x, y,
random_state=0,
test_size=0.1)
from sklearn.naive_bayes import GaussianNB
gnb= GaussianNB()
gnb.fit(training_x, training_y)
# #loading the testing data
testing_final_path=r"C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\testing_final.txt"
testing_sample_data=pandas.read_table(testing_final_path,
sep=',',
header=None,
names=['numbers','group'])
testing_sample_data = testing_sample_data.apply(le.fit_transform)
category = ["numbers"]
testing_sample_data_x = testing_sample_data[category]
# #finding the score of the test data
print(gnb.score(testing_sample_data_x, testing_sample_data["group"]))
First, the above data samples dont show how many classes are there in it. You need to describe more about it.
Secondly, you are calling le.fit_transform again on test data which will forget all the training samples mappings from strings to numbers. The LabelEncoder le will start encoding the test data again from scratch, which will not be equal to how it mapped training data. So the input to GaussianNB is now incorrect and hence incorrect results.
Change that to:
testing_sample_data = testing_sample_data.apply(le.transform)
UPDATE:
I'm sorry I overlooked the fact that you had two columns in your data. LabelEncoder only works on a single column of data. For making it work on multiple pandas columns at once, look at the answers of following question:
Label encoding across multiple columns in scikit-learn
If you are using the latest version of scikit (0.20) or can update to it, then you would not need any such hacks and directly use the OrdinalEncoder:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
training_file_data = enc.fit_transform(training_file_data)
And during testing:
training_file_data = enc.transform(training_file_data)

Dask: How would I parallelize my code with dask delayed?

This is my first venture into parallel processing and I have been looking into Dask but I am having trouble actually coding it.
I have had a look at their examples and documentation and I think dask.delayed will work best. I attempted to wrap my functions with the delayed(function_name), or add an #delayed decorator, but I can't seem to get it working properly. I preferred Dask over other methods since it is made in python and for its (supposed) simplicity. I know dask doesn't work on the for loop, but they say it can work inside a loop.
My code passes files through a function that contains inputs to other functions and looks like this:
from dask import delayed
filenames = ['1.csv', '2.csv', '3.csv', etc. etc. ]
for count, name in enumerate(filenames)"
name = name.split('.')[0]
....
then do some pre-processing ex:
preprocess1, preprocess2 = delayed(read_files_and_do_some_stuff)(name)
then I call a constructor and pass the pre_results in to the function calls:
fc = FunctionCalls()
Daily = delayed(fc.function_runs)(filename=name, stringinput='Daily',
input_data=pre_result1, model1=pre_result2)
What i do here is I pass the file into the for loop, do some pre-processing and then pass the file into two models.
Thoughts or tips on how to do parallelize this? I began getting odd errors and I had no idea how to fix the code. The code does work as is. I use a bunch of pandas dataframes, series, and numpy arrays, and I would prefer not to go back and change everything to work with dask.dataframes etc.
The code in my comment may be difficult to read. Here it is in a more formatted way.
In the code below, when I type print(mean_squared_error) I just get: Delayed('mean_squared_error-3009ec00-7ff5-4865-8338-1fec3f9ed138')
from dask import delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = ['file1.csv']
for count, name in enumerate(filenames):
file1 = pd.read_csv(name)
df = pd.DataFrame(file1)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = delayed(mse)(observed, prediction)
You need to call dask.compute to eventually compute the result. See dask.delayed documentation.
Sequential code
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
results = []
for count, name in enumerate(filenames):
file1 = pd.read_csv(name)
df = pd.DataFrame(file1) # isn't this already a dataframe?
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = mse(observed, prediction)
results.append(mean_squared_error)
Parallel code
import dask
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
delayed_results = []
for count, name in enumerate(filenames):
df = dask.delayed(pd.read_csv)(name)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = dask.delayed(mse)(observed, prediction)
delayed_results.append(mean_squared_error)
results = dask.compute(*delayed_results)
A much clearer solution, IMO, than the accepted answer is this snippet.
from dask import compute, delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
def compute_mse(file_name):
df = pd.read_csv(file_name)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
return mse(observed, prediction)
delayed_results = [delayed(compute_mse)(file_name) for file_name in filenames]
mean_squared_errors = compute(*delayed_results, scheduler="processes")

Missing data Prediction

I have a jester data, the data has 100 movies and it's raiting which is given by 24983 user and the data has lots of missing datas. My job is predict its.
I want to start with Decision Tree,
I'm thinking that, First I will select first column of data(it has first movies raitings) and then I will delete first column from data. Then I will fit them, and finally I will found prediction probablity of first column(which is deleted from data)
I'm working on Python
import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestClassifier
df = pd.read_excel(input_file, header=None)
matrix = df.as_matrix()
imp = Imputer(missing_values=99, strategy='mean', axis=0)
imp.fit(matrix)
matrix= imp.transform(matrix)
train_data = matrix[:,:90] #train data (train data has 90 column)
test_data = matrix[:,90:] #%10 test data (test data has 10 column)
array2 = train_data.copy()
column = array2[:,0] # 0. column should be delete
array2 = np.delete(array2,0,axis=1) # 0. column should be select
clf = RandomForestClassifier(n_estimators=25)
clf.fit(array2.astype(int), column.astype(int))
clf_probs = clf.predict_proba(column)
my last giving error -> ValueError: Number of features of the model must match the input. Model n_features is 89 and input n_features is 24983
I have to predict the column like what I tell you (above the code)
What should I do? I really need help.

Resources