What is the meaning of 'NK' in pandas int64? - python-3.x

I have a column pathsize (int64). However, I got some values define as 'NK'. I've tried to convert this value into an integer, but it doesn't seem to have any effect.
NK 687
15 180
12 172
14 166
...
3 123
Name: pathsize, Length: 92, dtype: int64
The script I used to convert NK into 0:
def pathsize(row):
if (row["pathsize"] != 'NK'):
return row["pathsize"]
return 0
df['pathsize'] = df.apply(pathsize, axis=1)
The script works fine, but when I try to process the data (convert it as a float), I got this following error:
ValueError: could not convert string to float: ' NK'

Related

Pandas: MID & FIND Function

I have the a column in my dataframe that shows different combinations of the values below. I know that I could use the .str[:3] function and then convert this to a value, but the differing string lengths are throwing me off. How would I do a MID(x,FIND(",",x,1)+1,10) esk function on this column to find the sentiment and subjectivity values?
String samples:
df['Output'] =
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.03958333333333333, subjectivity=0.5020833333333334)
Sentiment(polarity=0.16472802559759075, subjectivity=0.4024750611707134)
Error:
def senti(x):
return TextBlob(x).sentiment
df['Output'] = df['stop'].apply(senti)
df.Output.str.split(',|=',expand=True).iloc[:,[1,3]]
IndexError: positional indexers are out-of-bounds
Outputs:
0 (0.0, 0.0)
1 (0.0028273809523809493, 0.48586309523809534)
2 (0.153726035868893, 0.5354359925788496)
3 (0.04357142857142857, 0.5319047619047619)
4 (0.07575757575757575, 0.28446969696969693)
...
92 (0.225, 0.39642857142857146)
93 (0.0, 0.0)
94 (0.5428571428571429, 0.6428571428571428)
95 (0.14393939393939395, 0.39999999999999997)
96 (0.35833333333333334, 0.5777777777777778)
Name: Output, Length: 97, dtype: object
df[['polarity', 'subjectivity']] = df.Output.str.split(',|=|\)',expand=True).iloc[:,[1,3]]
Result:
Output polarity subjectivity
0 Sentiment(polarity=0.0, subjectivity=0.0) 0.0 0.0
1 Sentiment(polarity=-0.03958333333333333, subje... -0.03958333333333333 0.5020833333333334
2 Sentiment(polarity=0.16472802559759075, subjec... 0.16472802559759075 0.4024750611707134
Try:
df['polarity']=df['Output'].str.extract(r"polarity=([-\.\d]+)")
df['subjectivity']=df['Output'].str.extract(r"subjectivity=([-\.\d]+)")
Outputs:
>>> df.iloc[:, -2:]
polarity subjectivity
0 0.0 0.0
1 -0.03958333333333333 0.5020833333333334
2 0.16472802559759075 0.4024750611707134

Drop similar text rows of one column in Python

import pandas as pd
from difflib import SequenceMatcher
df = pd.DataFrame({"id":[9,12,13,14],
"text":["Error number 609 at line 10", "Error number 609 at line 22", "Error string 'foo' at line 11", "Error string 'bar' at line 14"]})
Output:
id text
0 9 Error number 609 at line 10
1 12 Error number 609 at line 22
2 13 Error string 'foo' at line 11
3 14 Error string 'bar' at line 14
I want to use difflib.SequenceMatcher to remove similarity score lower than 80 rows and only keep one.
a = "Error number 609 at line 10"
b = "Error number 609 at line 22"
c = "Error string 'foo' at line 11"
d = "Error string 'bar' at line 14"
print(SequenceMatcher(None, a, b).ratio()*100) #92.5925925925926
print(SequenceMatcher(None, b, c).ratio()*100) #60.71428571428571
print(SequenceMatcher(None, c, d).ratio()*100) #86.20689655172413
print(SequenceMatcher(None, a, c).ratio()*100) #64.28571428571429
How can I get expected result as follows in Python? You can use difflib or other python packages. Thank you.
id text
0 9 Error number 609 at line 10
2 13 Error string 'foo' at line 11
You can use:
#cross join with filter onl text column
df = df.assign(a=1).merge(df[['text']].assign(a=1), on='a')
#filter out same columns per rows
df = df[df['text_x'] != df['text_y']]
#sort columns per rows
df[['text_x','text_y']] = pd.DataFrame(np.sort(df[['text_x','text_y']],axis=1), index=df.index)
#remove duplicates
df = df.drop_duplicates(subset=['text_x','text_y'])
#get similarity
df['r'] = df.apply(lambda x: SequenceMatcher(None, x.text_x, x.text_y).ratio(), axis=1)
#filtering
df = df[df['r'] > 0.8].drop(['a','r'], axis=1)
print (df)
id text_x text_y
1 9 Error number 609 at line 10 Error number 609 at line 22
11 13 Error string 'bar' at line 14 Error string 'foo' at line 11

I want to get/print df by range instead of head or tail

I can't find or understand how to get the data I want by range
I want to know how to get df['Close']from x to y then .mean to sum it up
I have tried "costomclose = df['Close'],range(dagartot,val)"
But it gives me something else like heads and tails from df
if len(df) >= 34:
dagartot = len(df)
valdagar = 5
val = dagartot-valdagar
costomclose = df['Close'],range(dagartot,val)
print(costomclose)
edit:
<bound method NDFrame.tail of High Low ... Volume Adj Close
Date ...
2005-09-29 24.083300 23.583300 ... 74400.0 4.038682
2005-09-30 23.833300 23.500000 ... 148200.0 4.081495
2005-10-03 24.000000 23.333300 ... 27600.0 3.995869
2005-10-04 23.500000 23.416700 ... 132000.0 4.024417
2005-10-05 23.750000 23.500000 ... 15600.0 4.067230
... ... ... ... ... ...
2019-07-25 196.000000 193.050003 ... 355952.0 194.000000
2019-07-26 196.350006 194.000000 ... 320752.0 195.199997
2019-07-29 196.350006 193.550003 ... 301389.0 195.250000
2019-07-30 197.949997 194.850006 ... 233989.0 197.100006
2019-07-31 198.550003 195.600006 ... 323473.0 197.899994
[3479 rows x 6 columns]>
stop
Here is an example of slicing out the middle of something based on the encounter index:
>>> s = pd.Series(list('abcdefghijklmnop'))
>>> s
Out[135]:
0 a
1 b
...
12 m
13 n
14 o
15 p
dtype: object
>>> s.iloc[6:9]
Out[136]:
6 g
7 h
8 i
dtype: object
This also works for DataFrames, e.g. df.iloc[0] returns the first row and df.iloc[5:8] returns those rows, end not included.
You can also slice by actual index of the DataFrame, which is not necessarily a serially-counting sequence of integers by substituting iloc for loc.
Here is an example of slicing out the middle of a dataframe that stores the alphabet:
>>> df = pd.DataFrame([dict(num=i + 65, char=chr(i + 65)) for i in range(26)])
>>> df[(76 <= df.num) & (df.num < 81)]
num char
11 76 L
12 77 M
13 78 N
14 79 O
15 80 P

How to impute values in a column and overwrite existing values

Im trying to learn machine learning and i need to fill in the missing values for the cleaning stage of the workflow. i have 13 columns and need to impute the values for 8 of them. One column is called Dependents and i want to fill in the blanks with the word missing and change the cells that do contain data as follows: 1 to one, two to 2, 3 to three and 3+ to threePlus.
Im running the program in Anaconda and the name of the dataframe is train
train.columns
this gives me
Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
dtype='object')
next
print("Dependents")
print(train['Dependents'].unique())
this gives me
Dependents
['0' '1' '2' '3+' nan]
now i try imputing values as stated
def impute_dependent():
my_dict={'1':'one','2':'two','3':'three','3+':'threePlus'};
return train.Dependents.map(my_dict).fillna('missing')
def convert_data(dataset):
temp_data = dataset.copy()
temp_data['Dependents'] = temp_data[['Dependents']].apply(impute_dependent,axis=1)
return temp_data
this gives the error
TypeError Traceback (most recent call last)
<ipython-input-46-ccb1a5ea7edd> in <module>()
4 return temp_data
5
----> 6 train_dataset = convert_data(train)
7 #test_dataset = convert_data(test)
<ipython-input-46-ccb1a5ea7edd> in convert_data(dataset)
1 def convert_data(dataset):
2 temp_data = dataset.copy()
----> 3 temp_data['Dependents'] =
temp_data[['Dependents']].apply(impute_dependent,axis=1)
4 return temp_data
5
D:\Anaconda2\lib\site-packages\pandas\core\frame.py in apply(self, func,
axis, broadcast, raw, reduce, result_type, args, **kwds)
6002 args=args,
6003 kwds=kwds)
-> 6004 return op.get_result()
6005
6006 def applymap(self, func):
D:\Anaconda2\lib\site-packages\pandas\core\apply.py in get_result(self)
140 return self.apply_raw()
141
--> 142 return self.apply_standard()
143
144 def apply_empty_result(self):
D:\Anaconda2\lib\site-packages\pandas\core\apply.py in apply_standard(self)
246
247 # compute the result using the series generator
--> 248 self.apply_series_generator()
249
250 # wrap results
D:\Anaconda2\lib\site-packages\pandas\core\apply.py in
apply_series_generator(self)
275 try:
276 for i, v in enumerate(series_gen):
--> 277 results[i] = self.f(v)
278 keys.append(v.name)
279 except Exception as e:
TypeError: ('impute_dependent() takes 0 positional arguments but 1 was
given', 'occurred at index 0')
i expected one, two , three and threePlus to replace the existing values and missing to fill in the blanks
Would this do?
my_dict = {'1':'one','2':'two','3':'three','3+':'threePlus', np.nan: 'missing'}
def convert_data(dataset):
temp_data = dataset.copy()
temp_data.Dependents = temp_data.Dependents.map(my_dict)
return temp_data
As a side note, part of your problem might be the use of apply: essentially apply passes data through a function and puts in what comes out. I might be wrong but I think your function needs to take the input given by apply, eg:
def impute_dependent(dep):
my_dict = {'1':'one','2':'two','3':'three','3+':'threePlus', np.nan: 'missing'}
return my_dict[dep]
df.dependents = df.dependents.apply(impute_dependents)
This way, for every value in df.dependents, apply will take that value and give it to impute_dependents as an argument, then take the returned value as output. As is, when I trial your code I get an error because impute_dependent takes no arguments.

ValueError: could not convert string to float: left_column_pixel

l can't read values of pixels from pandas in img() opencv here are my code and the reported errorr
import cv2
import numpy as np
import csv
import os
import pandas as pd
path_csv='/home/'
npa=pd.read_csv(path_csv+"char.csv", usecols=[2,3,4,5], header=None)
nb_charac=npa.shape[0]-1
#stock the actual letters of your csv in an array
characs=[]
cpt=0
#take characters
f = open(path_csv+"char.csv", 'rt')
reader = csv.reader(f)
for row in reader:
if cpt>=1: #skip header
characs.append(str(row[1]))
cpt+=1
#open your image
path_image= '/home/'
img=cv2.imread(os.path.join(path_image,'image1.png'))
path_save= '/home/2/'
i=0
#for every line on your csv,
for i in range(nb_charac):
#get coordinates
#coords=npa[i,:]
coords=npa.iloc[[i]]
charac=characs[i]
#actual cropping of the image (easy with numpy)
img_charac=img[int(coords[2]):int(coords[4]),int(coords[3]):int(coords[5])]
img_charac=cv2.resize(img_charac, (32, 32), interpolation=cv2.INTER_NEAREST)
i+=1
#charac=charac.strip('"\'')
#x=switch(charac)
#saving the image
cv2.imwrite(path_save+str(charac)+"_"+str(i)+"_"+str(img_charac.shape)+".png",img_charac)
img_charac2 = 255 - img_charac
cv2.imwrite(path_save +str(charac)+ "_switched" + str(i) + "_" + str(img_charac2.shape) + ".png", img_charac2)
print(i)
l got the following error
img_charac=img[int(coords[2]):int(coords[3]),int(coords[0]):int(coords[1])]
File "/usr/lib/python2.7/dist-packages/pandas/core/series.py", line 79, in wrapper
return converter(self.iloc[0])
ValueError: invalid literal for int() with base 10: 'left_column_pixel'
the error is related to this line of code :
img_charac=img[int(coords[2]):int(coords[4]),int(coords[3]):int(coords[5])]
such that my variable coords is as follow :
>>> coords=npa.iloc[[1]]
>>> coords
2 3 4 5
1 38 104 2456 2492
and the different values of the column 2,3,4,5 needed in image_char are :
>>> coords[2]
1 38
Name: 2, dtype: object
>>> coords[3]
1 104
Name: 3, dtype: object
>>> coords[4]
1 2456
Name: 4, dtype: object
>>> coords[5]
1 2492
Name: 5, dtype: object
l updated the line of img_charac as follow
img_charac = img[int(float(coords[2].values[0])):int(float(coords[4].values[0])), int(float(coords[3].values[0])):int(float(coords[5].values[0]))]
l don't have anymore
ValueError: invalid literal for int() with base 10: 'left_column_pixel'
but l got the following error :
ValueError: could not convert string to float: left_column_pixel
l noticed that outside the loop img_charac works
I think the ValueError occurs because you are reading the header row of your csv file within the first iteration of your for-loop. The header contains string labels which can't converted to integers:
for i in range(nb_charac) will start with i having 0 as the first value.
Then, coords=npa.iloc[[i]] will return the first row (0th row) of your csv-file.
Since you've set header=None in npa=pd.read_csv(path_csv+"char.csv", usecols=[2,3,4,5], header=None), you iterate over strings within your header row.
So either set header=0 or for i in range(1, nb_charac).

Resources