How does LabelEncoder() encode values? - scikit-learn

I want to know how does LabelEncoder() function.
This is a part of my code
for att in all_features_test:
if (str(test_home_data[att].dtypes) == 'object'):
test_home_data[att].fillna( 'Nothing', inplace = True)
train_home_data[att].fillna( 'Nothing', inplace = True)
train_home_data[att] = LabelEncoder().fit_transform(train_home_data[att])
test_home_data[att] = LabelEncoder().fit_transform(test_home_data[att])
else:
test_home_data[att].fillna( 0, inplace = True)
train_home_data[att].fillna( 0, inplace = True)
Both train and test data set has an attribute 'Condition' which can hold values - Bad, Average and Good
Lets say LabelEncoder() would encode Bad as 0, Average as 2, and Good as 1 in train_home_data. Now would that be same for test_home data?
If not, then what should I do?

You should not label after the split, but before.
The unique labels (= classes) are ordered according to alphabet, see uniques = sorted(set(values)) in this source code snipped from sklearn.preprocessing.LabelEncoder which links to the [source] on the upper right of the page.
python method:
def _encode_python(values, uniques=None, encode=False):
# only used in _encode below, see docstring there for details
if uniques is None:
uniques = sorted(set(values))
uniques = np.array(uniques, dtype=values.dtype)
if encode:
table = {val: i for i, val in enumerate(uniques)}
try:
encoded = np.array([table[v] for v in values])
except KeyError as e:
raise ValueError("y contains previously unseen labels: %s"
% str(e))
return uniques, encoded
else:
return uniques
Same for numpy arrays as classes, see return np.unique(values), because unique() sorts by default:
numpy method:
def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
# only used in _encode below, see docstring there for details
if uniques is None:
if encode:
uniques, encoded = np.unique(values, return_inverse=True)
return uniques, encoded
else:
# unique sorts
return np.unique(values)
if encode:
if check_unknown:
diff = _encode_check_unknown(values, uniques)
if diff:
raise ValueError("y contains previously unseen labels: %s"
% str(diff))
encoded = np.searchsorted(uniques, values)
return uniques, encoded
else:
return uniques
You can never be sure that the test set and training set have the exactly same classes. The training or testing set might simply lack a class of the three label column 'Condition'.
If you desparately want to encode after the train/test split, you need to check that the number of classes is the same in both sets before the encoding.
Quoting the script:
Uses pure python method for object dtype, and numpy method for all
other dtypes.
python method (object type):
assert sorted(set(train_home_data[att])) == sorted(set(test_home_data[att]))
numpy method (all other types):
assert np.unique(train_home_data[att]) == np.unique(test_home_data[att])

I got the answer for this I guess.
Code
data1 = [('A', 1), ('B', 2),('C', 3) ,('D', 4)]
data2 = [('D', 1), ('A', 2),('A', 3) ,('B', 4)]
df1 = pd.DataFrame(data1, columns = ['col1', 'col2'])
df2 = pd.DataFrame(data2, columns = ['col1', 'col2'])
print(df1['col1'])
print(df2['col1'])
df1['col1'] = LabelEncoder().fit_transform(df1['col1'])
df2['col1'] = LabelEncoder().fit_transform(df2['col1'])
print(df1['col1'])
print(df2['col1'])
Output
0 A
1 B
2 C
3 D
Name: col1, dtype: object # df1
0 D
1 A
2 A
3 B
Name: col1, dtype: object # df2
0 0
1 1
2 2
3 3
Name: col1, dtype: int64 #df1 encoded
0 2
1 0
2 0
3 1
Name: col1, dtype: int64 #df2 encoded
B of df1 is encoded to 1.
and,
B of df2 is encoded to 1 as well
So if I encode training and testing data sets, then the encoded values in training set would reflect in testing data set (only if both are label encoded)

I would suggest fitting the label encoder on one dataset and transforming both:
data1 = [('A', 1), ('B', 2),('C', 3) ,('D', 4)]
data2 = [('D', 1), ('A', 2),('A', 3) ,('B', 4)]
df1 = pd.DataFrame(data1, columns = ['col1', 'col2'])
df2 = pd.DataFrame(data2, columns = ['col1', 'col2'])
# here comes the new code:
le = LabelEncoder()
df1['col1'] = le.fit_transform(df1['col1'])
df2['col1'] = le.transform(df2['col1'])

Related

Python: How to use one dictionary to use to decode the other?

Say if I had two dictionaries:
d1 = {'a':1, 'b':2}
d2 = {'a':'b', 'b':'b', 'a':'a'}
How can I use dictionary d1 as the rules to decode d2, such as:
def decode(dict_rules, dict_script):
//do something
return dict_result
decode(d1,d2)
>> {1:2, 2:2, 1:1}
of course it can be written much shorter, but here a version to see the principle:
result_list = list()
result_dict = dict()
for d2_key in d2.keys():
d2_key_decoded = d1[d2_key]
d2_value = d2[d2_key]
d2_value_decoded = d1[d2_value]
result_dict[d2_key_decoded] = d2_value_decoded
# add a tuple to the result list
result_list.append((d2_key_decoded, d2_value_decoded))
the result might be unexpected - because the resulting dict would have entries with the same key, what is not possible, so the key 1 is overwritten:
>>> # equals to :
>>> result_dict[1] = 2
>>> result_dict[2] = 2
>>> result_dict[1] = 1
>>> # Result : {1:1, 2:2}
>>> # therefore I added a list of Tuples as result :
>>> # [(1, 2), (2, 2), (1, 1)]
but as #Patrik Artner pointed out, that is not possible, because already the input dictionary can not have duplicate keys !

How to skip over np.nan while iterating through a dataframe for sentiment analysis

I have a data frame with 201279 entries, the last column is labeled "text" with customer reviews. The problem is that most of them are missing values, and come up as NaN.
I read some interesting information from this question:
Python numpy.nan and logical functions: wrong results
and I tried applying it to my problem:
df1.columns
Index(['id', 'sku', 'title', 'reviewCount', 'commentCount', 'averageRating',
'date', 'time', 'ProductName', 'CountOfBigTransactions', 'ClassID',
'Weight', 'Width', 'Depth', 'Height', 'LifeCycleName', 'FinishName',
'Color', 'Season', 'SizeOrUtility', 'Material', 'CountryOfOrigin',
'Quartile', 'display-name', 'online-flag', 'long-description', 'text'],
dtype='object')
I tried experimentingby doing this:
df['firstName'][202360]== np.nan
which returns False but indeed that index contains an np.nan.
So I looked for an answer, read through the question I linked, and saw that
np.bool(df1['text'][201279])==True
is a true statement. I thought, okay, I can run with this.
So, here's my code so far:
from textblob import TextBlob
import string
def remove_num_punct(aText):
p = string.punctuation
d = string.digits
j = p + d
table = str.maketrans(j, len(j)* ' ')
return aText.translate(table)
#Process text
aList = []
for text in df1['text']:
if np.bool(df1['text'])==True:
aList.append(np.nan)
else:
b = remove_num_punct(text)
pol = TextBlob(b).sentiment.polarity
aList.append(pol)
Then I would just convert aList with the sentiment to pd.DataFrame and join it to df1, then impute the missing values with K-nearest neighbors.
My problem is that the little routine I made throws a value error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So I'm not really sure what else to try. Thanks in advance!
EDIT: I have tried this:
i = 0
aList = []
for txt in df1['text'].isnull():
i += 1
if txt == True:
aList.append(np.nan)
which correctly populates the list with NaN.
But this gives me a different error:
i = 0
aList = []
for txt in df1['text'].isnull():
if txt == True:
aList.append(np.nan)
else:
b = remove_num_punct(df1['text'][i])
pol = TextBlob(b).sentiment.polarity
aList.append(pol)
i+=1
AttributeError: 'float' object has no attribute 'translate'
Which doesn't make sense, since if it is not NaN, then it contains text, right?
import pandas as pd
import numpy as np
df = pd.DataFrame({'age': [5, 6, np.NaN],
'born': [pd.NaT, pd.Timestamp('1939-05-27'), pd.Timestamp('1940-04-25')],
'name': ['Alfred', 'Batman', ''],
'toy': [None, 'Batmobile', 'Joker']})
df1 = df['toy']
for i in range(len(df1)):
if not df1[i]:
df2 = df1.drop(i)
df2
you can try in this way to deal the text which is null
I fixed it, I had to move the i += 1 back from the else indentation to the for indentation:
i = 0
aList = []
for txt in df1['text'].isnull():
if txt == True:
aList.append(np.nan)
else:
b = remove_num_punct(df1['text'][i])
pol = TextBlob(b).sentiment.polarity
aList.append(pol)
i+=1

Using non-zero values from columns in function - pandas

I am having the below dataframe and would like to calculate the difference between columns 'animal1' and 'animal2' over their sum within a function while only taking into consideration the values that are bigger than 0 in each of the columns 'animal1' and 'animal2.
How could I do this?
import pandas as pd
animal1 = pd.Series({'Cat': 4, 'Dog': 0,'Mouse': 2, 'Cow': 0,'Chicken': 3})
animal2 = pd.Series({'Cat': 2, 'Dog': 3,'Mouse': 0, 'Cow': 1,'Chicken': 2})
data = pd.DataFrame({'animal1':animal1, 'animal2':animal2})
def animals():
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ ['animal2'])
return data['anim_diff'].abs().idxmax()
print(data)
I believe you need check all rows are greater by 0 with DataFrame.gt with test DataFrame.all and filter by boolean indexing:
def animals(data):
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ data['animal2'])
return data['anim_diff'].abs().idxmax()
df = data[data.gt(0).all(axis=1)].copy()
#alternative for not equal 0
#df = data[data.ne(0).all(axis=1)].copy()
print (df)
animal1 animal2
Cat 4 2
Chicken 3 2
print(animals(df))
Cat

How to iterate over dfs and append data with combine names

i have this problem to solve, this is a continuation of a previus question How to iterate over pandas df with a def function variable function and the given answer worked perfectly, but now i have to append all the data in a 2 columns dataframe (Adduct_name and mass).
This is from the previous question:
My goal: i have to calculate the "adducts" for a given "Compound", both represents numbes, but for eah "Compound" there are 46 different "Adducts".
Each adduct is calculated as follow:
Adduct 1 = [Exact_mass*M/Charge + Adduct_mass]
where exact_mass = number, M and Charge = number (1, 2, 3, etc) according to each type of adduct, Adduct_mass = number (positive or negative) according to each adduct.
My data: 2 data frames. One with the Adducts names, M, Charge, Adduct_mass. The other one correspond to the Compound_name and Exact_mass of the Compounds i want to iterate over (i just put a small data set)
Adducts: df_al
import pandas as pd
data = [["M+3H", 3, 1, 1.007276], ["M+3Na", 3, 1, 22.989], ["M+H", 1, 1,
1.007276], ["2M+H", 1, 2, 1.007276], ["M-3H", 3, 1, -1.007276]]
df_al = pd.DataFrame(data, columns=["Ion_name", "Charge", "M", "Adduct_mass"])
Compounds: df
import pandas as pd
data1 = [[1, "C3H64O7", 596.465179], [2, "C30H42O7", 514.293038], [4,
"C44H56O8", 712.397498], [4, "C24H32O6S", 448.191949], [5, "C20H28O3",
316.203834]]
df = pd.DataFrame(data1, columns=["CdId", "Formula", "exact_mass"])
The solution to this problem was:
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 5.
for i in range(5):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
Output
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
Now that is the rigth calculations but i need now a file where:
-only exists 2 columns (Name and mass)
-All the different adducts are appended one after another
desired out put
Name Mass
a_M+3H 199.82902
a_M+3Na 221.810726
a_M+H 597.472455
a_2M+H 1193.937634
a_M-3H 197.814450
b_M+3H 514.293038
.
.
.
c_M+3H
and so on.
Also i need to combine the name of the respective compound with the ion form (M+3H, M+H, etc).
At this point i have no code for that.
I would apprecitate any advice and a better approach since the begining.
This part is an update of the question above:
Is posible to obtain and ouput like this one:
Name Mass RT
a_M+3H 199.82902 1
a_M+3Na 221.810726 1
a_M+H 597.472455 1
a_2M+H 1193.937634 1
a_M-3H 197.814450 1
b_M+3H 514.293038 3
.
.
.
c_M+3H 2
The RT is the same value for all forms of a compound, in this example is RT for a =1, b = 3, c =2, etc.
Is posible to incorporate (Keep this column) from the data set df (which i update here below)?. As you can see that df has more columns like "Formula" and "RT" which desapear after calculations.
import pandas as pd
data1 = [[a, "C3H64O7", 596.465179, 1], [b, "C30H42O7", 514.293038, 3], [c,
"C44H56O8", 712.397498, 2], [d, "C24H32O6S", 448.191949, 4], [e, "C20H28O3",
316.203834, 1.5]]
df = pd.DataFrame(data1, columns=["Name", "Formula", "exact_mass", "RT"])
Part three! (sorry and thank you)
this is a trial i did on a small data set (df) using the code below, with the same df_al of above.
df=
Code
#Defining variables for calculation
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
df_ID= df["Name"]
#Defining the RT dictionary
RT = dict(zip(df["Name"], df["RT"]))
#Removing RT column
df=df.drop(columns=["RT"])
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 46.
for i in range(47):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
df
output
#Melting
df = pd.melt(df, id_vars=['Name'], var_name = "Adduct", value_name= "Exact_mass", value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
df['RT'] = df.Name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
del df['Name']
del df['Adduct']
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
output
Why NaN?
Here is how I will go about it, pandas.melt comes to rescue:
import pandas as pd
import numpy as np
from io import StringIO
s = StringIO('''
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
''')
df = pd.read_csv(s, sep="\s+")
df = pd.melt(df, id_vars=['Name'], value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
del df['Name']
del df['variable']
RT = {'a':1, 'b':2, 'c':3, 'd':5, 'e':1.5}
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
Here is the output:

Pandas -- .add() resulting in TypeError: 'int' object is not iterable

I've run into a bit of a problem adding Pandas dataframes using the .add() method. I have a data generator I'm using to generate synthetic data along a normal distribtuion:
import pandas as pd
import numpy as np
def DataSynthNormal(data, sel, column, fracFull, TotalRows, SelRows, mean, std, abst=False):
fraction = data.loc[data['A'] == sel, column].sample(frac = fracFull).index
if abst:
data1 = pd.DataFrame(np.absolute(np.random.normal(mean, std, round(SelRows*fracFull)).astype('int64')), index=fraction).reindex(range(TotalRows))
else:
data1 = pd.DataFrame(np.random.normal(mean, std, round(SelRows*fracFull)).astype('int64'), index=fraction).reindex(range(TotalRows))
data[column] = data[column].add(data1, fill_value=0)
Using a this dataframe as an example:
empty = pd.DataFrame(columns=['A','B'], index=range(0,10))
empty.A[0:4] = "C"; empty.A[4:7] = "D"; empty.A[7:10] = "E"
print(empty)
A B
0 C NaN
1 C NaN
2 C NaN
3 C NaN
4 D NaN
5 D NaN
6 D NaN
7 E NaN
8 E NaN
9 E NaN
And running the data generator:
DataSynthNormal(empty, 'C', 'B', 0.8, 10, 4, 0, 1)
I get the following error:
Traceback (most recent call last):
File "", line 1, in
DataSynthNormal2(empty, 'C', 'B', 0.8, 10, 4, 0, 1)
File "", line 7, in DataSynthNormal2
data[column] = data[column].add(data1, fill_value=0)
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\ops.py",
line 1358, in flex_wrapper
self.index).finalize(self)
File
"C:\Users\User\Anaconda3\lib\site-packages\pandas\core\series.py",
line 274, in init
raise_cast_failure=True)
File
"C:\Users\User\Anaconda3\lib\site-packages\pandas\core\series.py",
line 4163, in _sanitize_array
subarr = com._asarray_tuplesafe(data, dtype=dtype)
File
"C:\Users\User\Anaconda3\lib\site-packages\pandas\core\common.py",
line 317, in _asarray_tuplesafe
values = [tuple(x) for x in values]
File
"C:\Users\User\Anaconda3\lib\site-packages\pandas\core\common.py",
line 317, in
values = [tuple(x) for x in values]
TypeError: 'int' object is not iterable
I'm trying to use .add() here because it conserves NaN when two dataframes are added, as opposed to .fillna(0) (which has been outputting n x n matrices, for some reason). I want it to do this because the real data this is trying to emulate has both blanks and 0's throughout. I also can't use "data[column] = data1", because I need to use other conditionals (=='D', =='E') at different times and with different mean and std.
Does anyone know how to solve this problem?
Came up with a solution, which involved creating a second function:
def DataSynthNormal(data, sel, column, fracFull, TotalRows, selRows, mean, std, abst=False):
fraction = data.loc[data['A'] == sel, column].sample(frac = fracFull).index
if abst:
data1 = pd.DataFrame(np.absolute(np.random.normal(mean, std, round(selRows*fracFull)).astype('int64')), index=fraction).reindex(range(TotalRows))
else:
data1 = pd.DataFrame(np.random.normal(mean, std, round(selRows*fracFull)).astype('int64'), index=fraction).reindex(range(TotalRows))
data[column] = data1
Here's the first one, works as you'd expect.
def DataSynthNormal2x(data, sel1, sel2, column, fracFull1, fracFull2, TotalRows, selRows1, selRows2, mean1, std1, mean2, std2, abst=False):
fraction1 = data.loc[data['A'] == sel1, column].sample(frac = fracFull1).index
fraction2 = data.loc[data['A'] == sel2, column].sample(frac = fracFull2).index
if abst:
data1 = pd.DataFrame(np.absolute(np.random.normal(mean1, std1, round(selRows1*fracFull1)).astype('int64')), index=fraction1).reindex(range(TotalRows))
data2 = pd.DataFrame(np.absolute(np.random.normal(mean2, std2, round(selRows2*fracFull2)).astype('int64')), index=fraction2).reindex(range(TotalRows))
else:
data1 = pd.DataFrame(np.random.normal(mean1, std1, round(selRows1*fracFull1)).astype('int64'), index=fraction1).reindex(range(TotalRows))
data2 = pd.DataFrame(np.random.normal(mean2, std2, round(selRows2*fracFull2)).astype('int64'), index=fraction2).reindex(range(TotalRows))
data12 = data1.add(data2, fill_value=0)
data[column] = data12
And the second, which takes double the inputs and combines it all in-function. These seem to work.

Resources