Replace a bunch of if-else conditions with scikit-learn - python-3.x

I'm trying to wrap my head around ML with scikit-learn
Here is what I'm trying to do:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
df = pd.DataFrame({
"f1": [1, 1],
"f2": [0, 0],
"c": [1, 0]
})
#df
f1 f2 c # f1, f2 - features / c - class/ classifier
1 1 1 # for f1 = 1 and f2 = 1 > expected c = 1
0 0 0 # for f1 = 0 and f2 = 0 > expected c = 0
dtc_clf = DecisionTreeClassifier()
features = df[["f1", "f2"]]
labels = df[["c"]]
dtc_clf.fit(features, labels)
test_features = pd.DataFrame({"ft1": [1, 1],
"ft2": [0, 0]})
#test_features
ft1 ft2 #I added for test exactly the training data
1 1
0 0
dtc_clf.predict(test_features)
#I'm getting this result:
#array([0, 0])
#I expected this result
#array([1, 0])
If '1,1 => 1' then '0, 0 => 0'
It should be 'array([1, 0])' right?
Each column is a condition which if it's respected will be 1 if not 0.
Basically I'm trying to replace a lot of if else conditions with ML.

Works with DecisionTreeRegressor
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
# "beer": 1
# "wine": 2
df = pd.DataFrame({
"boy": [1, 0],
"hetero": [1, 1],
"drink": [1, 2]
})
X = df[["boy", "hetero"]]
y = df[["drink"]]
regr = DecisionTreeRegressor(random_state=0)
model = regr.fit(X, y)
# Make new observation
observation = [[1, 1]]
# Predict observation's value
model.predict(observation)
Result :
array([ 1.])

Related

List [0-1] to binary representation fast

I am trying to convert the rows [0-1] of a matrix to representation in number (binary equivalent), the code I have is the following:
import numpy as np
def generate_binary_matrix(matrix):
result = []
for i in matrix:
val = '0b' + ''.join([str(x) for x in i])
result.append(int(val, 2))
result = np.array(result)
return result
initial_matrix = np.array([[0, 1, 0], [1, 0, 0], [0, 0, 1]])
result = generate_binary_matrix(initial_matrix )
print(result)
This code works but it is very slow, does anyone know how to do it in a faster way?
You can convert a 0/1 list to binary using just arithmetic, which should be faster:
from functools import reduce
b = reduce(lambda r, x: 2*r + x, i)
Suppose you matrix numpy array is A with m rows and n columns.
Create a b vector with nelements by:
b = np.power(2, np.arange(n))[::-1]
then your answer is A # b
Example:
import numpy as np
A = np.array([[0, 0, 1], [1, 0, 1]])
n = A.shape[1]
b = np.power(2, np.arange(n))[::-1]
print(A # b) # --> [1 5]
update - I reversed b as the MSB (2^n-1) is A[:,0] + power arguments were mistakenly flipped + add an example.

How can I change the color of an image by using cv2?

Now , I got an image , and I want to change the color of it.
Then, show the before and after
This is how I write
import numpy as np
import cv2
Original_img = cv2.imread('img.jpg')
New_img = Original_img
print(Original_img[0 , 20] , New_img[0 , 20])
New_img[0 , 20] = 0 #change the color of new
print( Original_img[0 , 20] , New_img[0 , 20])
But it turn out that both change.
But , I only want the new one changes
Output:
[55 69 75] [55 69 75]
[0 0 0] [0 0 0]
This is a tricky one. It turns out that that your Original_img and New_img both refer to the same underlying object in Python. You need to make a copy to create a new object:
New_img = Original_img.copy() # use copy function from numpy
Python lists behave this way too. Here is a simple annotated example using a interactive Python session:
>>> a = [1,2,3]
>>> b = a
>>> b
[1, 2, 3]
>>> b[1] = 3.1415927 # we think we are only changing b
>>> b
[1, 3.1415927, 3] # b is changed
>>> a
[1, 3.1415927, 3] # a is also changed
Same example, using copy()
>>> from copy import copy
>>> a = [1,2,3]
>>> b = copy(a) # now we copy a
>>> b
[1, 2, 3]
>>> b[1] = 3.1415927
>>> b
[1, 3.1415927, 3] # b is changed
>>> a
[1, 2, 3] # a is unchanged!

Scikit-learn ColumnTransformer + OneHotEncoder

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import numpy as np
X = [['male', 0, 3], ['male', 1, 0], ['female', 2, 1], ['female', 0, 2]]
# 字符串编码为整形
sex_enc = OrdinalEncoder(dtype = np.int)
# 独热编码
one_hot_enc = OneHotEncoder(sparse=False, handle_unknown='ignore', dtype=np.int)
# 对第0列的字符串做整形转换, 然后对所有列做one-hot
col_transformer = ColumnTransformer(transformers = [('sex_enc', sex_enc, [0]), ('one_hot_enc', one_hot_enc, [0])])
# 训练编码
col_transformer.fit(X)
X_trans = col_transformer.transform(X)
print(X_trans)
[[1 0 1] [1 0 1] [0 1 0] [0 1 0]]
feature0 has values male and female, why one-hot outputs threes cols with columntransformer?

Concatenate two 1 column DataFrames doesn't return both columns

I'm using Python 3.6 and I'm a newbie so thanks in advance for your patience.
I have a function that sums the difference between 3 points. It should then take the 'differences' and concatenate them with another DataFrame called labels. k and length are integers. I expected the resulting DataFrame to have two columns but it only has one.
Sample Code:
def distance(df1,df2,labels,k,length):
total_dist = 0
for i in range(length):
dist_dif = df1.iloc[:,i] - df2.iloc[:,i]
sq_dist = dist_dif ** 2
root_dist = sq_dist ** 0.5
total_dist = total_dist + root_dist
return total_dist
distance_df = pd.concat([total_dist, labels], axis=1)
distance_df.sort(ascending=False, axis=1, inplace=True)
top_knn = distance_df[:k]
return top_knn.value_counts().index.values[0]
Sample Data:
d1 = {'Z_Norm_Age': [1.20, 2.58,2.54], 'Pclass': [3, 3, 2], 'Conv_Sex': [0, 1, 0]}
d2 = {'Z_Norm_Age': [-0.51, 0.24,0.67], 'Pclass': [3, 1, 3], 'Conv_Sex': [0, 1, 1]}
lbl = {'Survived': [0, 1,1]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
labels = pd.DataFrame(data=lbl)
I expected the data to look something like this:
total_dist labels
0 1.715349 0
1 2.872991 1
2 4.344087 1
but instead it looks like this:
0 1.715349
1 4.344087
2 2.872991
dtype: float64
The output doesn't do the following:
1. Return the labels column data
2. Sort the data in descending order
If someone could point me in the right direction, I'd truly appreciate it.
Given two DataFrames, df1-df2 will perform the subtraction element-wise. Use abs() to take the absolute value of that difference, and finally sum each row. That's the explanation to the first command in the following function. The other lines are similar to your code.
import numpy as np
import pandas as pd
def calc_abs_distance_between_rows_then_add_labels_and_sort(df1, df2, labels):
diff = np.sum(np.abs(df1-df2), axis=1) # np.sum(..., axis=1) sums the rows
diff.name = 'total_abs_distance' # Not really necessary, but just to refer to it later
diff = pd.concat([diff, labels], axis=1)
diff.sort_values(by='total_abs_distance', axis=0, ascending=True, inplace=True)
return diff
So for your example data:
d1 = {'Z_Norm_Age': [1.20, 2.58,2.54], 'Pclass': [3, 3, 2], 'Conv_Sex': [0, 1, 0]}
d2 = {'Z_Norm_Age': [-0.51, 0.24,0.67], 'Pclass': [3, 1, 3], 'Conv_Sex': [0, 1, 1]}
lbl = {'Survived': ['a', 'b', 'c']}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
labels = pd.DataFrame(data=lbl)
calc_abs_distance_between_rows_then_add_labels_and_sort(df1, df2, labels)
We get hopefully what you wanted:
total_abs_distance Survived
0 1.71 a
2 3.87 c
1 4.34 b
A few notes:
Did you really want the L1-norm? If you wanted the L2-norm (Euclidean distance), then replace the first command in that function above by np.sqrt(np.sum(np.square(df1-df2),axis=1)).
What's the purpose of those labels? Consider using the index of the DataFrames instead. Maybe it will fit your purposes better? For example:
# lbl_series = pd.Series(['a','b','c'], name='Survived') # Try this later instead of lbl_list, to further explore the wonders of Pandas indexes :)
lbl_list = ['a', 'b', 'c']
df1.index = lbl_list
df2.index = lbl_list
# Then the L1-norm is simply this:
np.sum(np.abs(df1 - df2), axis=1).sort_values()
# Whose output is the Series: (with the labels as its index)
a 1.71
c 3.87
b 4.34
dtype: float64

xarray equivalent to pandas subtract/add

I'm looking for a concise way to do arithmetics on a single dimension of a DataArray, and then have the result returned as a new DataArray (both the changed and unchanged parts). In pandas, I would do this using df.subtract(), but I haven't found the way to do this with xarray.
Here's how I would subtract the value 2 from the x dimension in pandas:
data = np.arange(0,6).reshape(2,3)
xc = np.arange(0, data.shape[0])
yc = np.arange(0, data.shape[1])
df1 = pd.DataFrame(data, index=xc, columns=yc)
df2 = df1.subtract(2, axis='columns')
For xarray though I don't know:
da1 = xr.DataArray(data, coords={'x': xc, 'y': yc}, dims=['x' , 'y'])
da2 = ?
In xarray, you can subtract from the rows or columns of an array by using broadcasting by dimension name.
For example:
>>> foo = xarray.DataArray([[1, 2, 3], [4, 5, 6]], dims=['x', 'y'])
>>> bar = xarray.DataArray([1, 4], dims='x')
# subtract along 'x'
>>> foo - bar
<xarray.DataArray (x: 2, y: 3)>
array([[0, 1, 2],
[0, 1, 2]])
Dimensions without coordinates: x, y
>>> baz = xarray.DataArray([1, 2, 3], dims='y')
# subtract along 'y'
>>> foo - baz
<xarray.DataArray (x: 2, y: 3)>
array([[0, 0, 0],
[3, 3, 3]])
Dimensions without coordinates: x, y
This works similar to axis='columns' vs axis='index' options that pandas provides, except the desired dimension is referenced by name.
When you do:
df1 = pd.DataFrame(data, index=xc, columns=yc)
df2 = df1.subtract(2, axis='columns')
You really are just subtracting 2 from the entire dataset...
Here is your output from above:
In [15]: df1
Out[15]:
0 1 2
0 0 1 2
1 3 4 5
In [16]: df2
Out[16]:
0 1 2
0 -2 -1 0
1 1 2 3
Which is equivalent to:
df3 = df1.subtract(2)
In [20]: df3
Out[20]:
0 1 2
0 -2 -1 0
1 1 2 3
And equivalent to:
df4 = df1 -2
In [22]: df4
Out[22]:
0 1 2
0 -2 -1 0
1 1 2 3
Therefore, for an xarray data array:
da1 = xr.DataArray(data, coords={'x': xc, 'y': yc}, dims=['x' , 'y'])
da2 = da1-2
In [24]: da1
Out[24]:
<xarray.DataArray (x: 2, y: 3)>
array([[0, 1, 2],
[3, 4, 5]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 0 1
In [25]: da2
Out[25]:
<xarray.DataArray (x: 2, y: 3)>
array([[-2, -1, 0],
[ 1, 2, 3]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 0 1
Now, if you would like to subtract from a specific column, that's a different problem, which I believe would require assignment indexing.

Resources