How convert "object" to numeric values including Nan - python-3.x

I have this data for many years, and need to plot error graph for different years, 1993 was selected with
fm93 = fmm[(fmm.Year == 1993)]
then the fm93 data frame is
Year moB m1 std1 co1 min1 max1 m2S std2S co2S min2S max2S
1993 1 296.42 18.91 31 262.4 336 -- -- -- -- --
1993 2 280.76 24.59 28 239.4 329.3 -- -- -- -- --
1993 3 271.41 19.16 31 236.4 304.8 285.80 20.09 20 251.6 319.7
1993 4 287.98 22.52 30 245.9 341 296.75 21.77 27 261.1 345.7
1993 5 287.05 30.79 30 229.2 335.7 300.06 27.64 24 249.5 351.8
1993 6 288.65 11.29 4 275.9 301.9 263.70 73.40 7 156.5 361
1993 7 280.11 36.01 12 237 363 302.67 26.39 22 262.9 377.1
1993 8 296.51 34.55 31 234.8 372.9 305.85 39.95 28 234.1 417.9
1993 9 321.31 34.54 25 263.8 396 309.01 42.52 29 205.9 403.2
1993 10 315.80 8.63 2 309.7 321.9 288.65 35.86 31 230.9 345.4
1993 11 288.26 24.07 30 231.4 322.8 297.99 23.81 28 238 336.5
1993 12 296.87 18.31 31 257.6 331.5 303.02 20.02 29 265.7 340.7
When I try to plot moB,m1 with err std1 appear the error
ValueError: err must be [ scalar | N, Nx1 or 2xN array-like ]
That is because the values are "object"..
array([[1993, 1, '296.42', '18.91', '31', '262.4', '336', '--', '--', '--',
'--', '--'],
[1993, 2, '280.76', '24.59', '28', '239.4', '329.3', '--', '--',
'--', '--', '--'],
[1993, 3, '271.41', '19.16', '31', '236.4', '304.8', '285.80',
'20.09', '20', '251.6', '319.7'],
[1993, 4, '287.98', '22.52', '30', '245.9', '341', '296.75',
'21.77', '27', '261.1', '345.7'],
[1993, 5, '287.05', '30.79', '30', '229.2', '335.7', '300.06',
'27.64', '24', '249.5', '351.8'],
[1993, 6, '288.65', '11.29', '4', '275.9', '301.9', '263.70',
'73.40', '7', '156.5', '361'],
[1993, 7, '280.11', '36.01', '12', '237', '363', '302.67', '26.39',
'22', '262.9', '377.1'],
[1993, 8, '296.51', '34.55', '31', '234.8', '372.9', '305.85',
'39.95', '28', '234.1', '417.9'],
[1993, 9, '321.31', '34.54', '25', '263.8', '396', '309.01',
'42.52', '29', '205.9', '403.2'],
[1993, 10, '315.80', '8.63', '2', '309.7', '321.9', '288.65',
'35.86', '31', '230.9', '345.4'],
[1993, 11, '288.26', '24.07', '30', '231.4', '322.8', '297.99',
'23.81', '28', '238', '336.5'],
[1993, 12, '296.87', '18.31', '31', '257.6', '331.5', '303.02',
'20.02', '29', '265.7', '340.7']], dtype=object)
I try convert this data with
fm93_1 = fm93.astype('float64', raise_on_error = False)
But the problem remain.... How can convert Nan values ('--') or ignore to plot my results?
thanks in advance

First you should try to plot a sample of data after the '--' to see if a plot can be generated. You can try:
# This should plot from row 3 onwards, omitting row 1 and 2
df.iloc[3:].plot()
Assuming '--' is the problem, you can replace with np.NaN. NaN values are not plotted.
df.replace('--', np.NaN, inplace=True)
df.plot()
Another way is to select df that does not contain '--':
mask = df[df == '--'].any(axis=1) # Check if row contains '--'
valid_indexes = df[mask == False].index # Return index for rows w/o '--'
df.iloc[valid_indexes].plot()

Related

Spread dataframe issue in Python pandas?

I am trying to reformat / spread my dataframe from key, value columns to wide format:
test = pd.DataFrame({
'id': [0,1,2,3,4,5,6,7],
'a':['aa','bb', 'a', 'k', 'aa','bb', 'a', 'k'],
'value': ['zzuz', 44, 'DE', 55, 'zdfdz', 454, 'SE', 155]})
test.pivot(columns='a', values='value',index='id')
I want the outcome to be:
aa bb a k
zzuz 44 DE 55
zdfdz 454 SE 155
I am trying to do this with .pivot without luck, please guide me what I am missing here?
test.pivot(columns='a', values='value')
Try via groupby()+cumcount() to tracking position and that will act as index of pivot() and rename_axis() to renaming axis(a bit of cleanup/if needed):
test['key']=test.groupby('a').cumcount()
out=test.pivot(columns='a', values='value',index='key')
out=out.rename_axis(columns=None,index=None)
OR(in 1 step)
out=(test.assign(key=test.groupby('a').cumcount())
.pivot(columns='a', values='value',index='key')
.rename_axis(columns=None,index=None))
output of out:
a aa bb k
0 DE zzuz 44 55
1 SE zdfdz 454 155

drop duplicated and concat pandas

I have a dataframe that looks like this:
'id': ["1", "2", "1", "3", "3", "4"],
'date': ["2017", "2011", "2019", "2013", "2017", "2018"],
'code': ["CB25", "CD15", "CZ10", None, None, "AZ51"],
'col_example': ["22", None, "22", "55", "55", "121"],
'comments': ["bonjour", "bonjour", "bonjour", "hola", "Hello", None]
Result:
id date code col_example .... comments
0 1 2019 CB25/CZ10 22 .... bonjour (and not bonjour // bonjour)
1 2 2011 CD15 None .... bonjour
2 3 2017 None 55 .... hola // Hello
3 4 2018 AZ51 121 .... None
I want to keep a single id
If two ids are the same, I would like:
If comments = None and = str: Keep only the comments which are not None (example: id = 1, keep the comments "hello")
If two comments = str: Concaten the two comments with a "//" (example id = 3, comments = "hola // hello")
For the moment I tried with sort_value, and drop_duplicate without success
thank you
I believe you need DataFrame.dropna by column comments and then GroupBy.agg with join and GroupBy.last, last add DataFrame.mask for replace empty strings to None rows:
df1 = (df.groupby('id')
.agg({'date': 'last',
'comments': lambda x: ' // '.join(x.dropna())})
.replace({'comments': {'': None}})
.reset_index())
print (df1)
id date comments
0 1 2019 bonjour
1 2 2011 bonjour
2 3 2017 hola // Hello
3 4 2018 None
EDIT: For avoid removed all columns is necessary aggregate all of them, you can create dictionary for aggregation dynamic like:
df = pd.DataFrame({'id': ["1", "2", "1", "3", "3", "4"],
'date': ["2017", "2011", "2019", "2013", "2017", "2018"],
'code': ["CB25", "CD15", "CB25", None, None, "AZ51"],
'col_example': ["22", None, "22", "55", "55", "121"],
'comments': [None, "bonjour", "bonjour", "hola", "Hello", None]})
print (df)
id date code col_example comments
0 1 2017 CB25 22 None
1 2 2011 CD15 None bonjour
2 1 2019 CB25 22 bonjour
3 3 2013 None 55 hola
4 3 2017 None 55 Hello
5 4 2018 AZ51 121 None
d = dict.fromkeys(df.columns.difference(['id','comments']), 'last')
d['comments'] = lambda x: ' // '.join(x.dropna())
print (d)
{'code': 'last', 'col_example': 'last', 'date': 'last',
'comments': <function <lambda> at 0x000000000ECA99D8>}
df1 = (df.groupby('id')
.agg(d)
.replace({'comments': {'': None}})
.reset_index())
print (df1)
id code col_example date comments
0 1 CB25 22 2019 bonjour
1 2 CD15 None 2011 bonjour
2 3 None 55 2017 hola // Hello
3 4 AZ51 121 2018 None

Handle the string

I have a problem with Python str which I've tried multiple variations, but none of them seem to work.
Here is my problem:
string = '18.0 8 307.0 130.0 3504. 12.0 70 1\t"chevrolet chevelle malibu"'
I want to handle this string, and get the return like this:
['18.0','8','307.0','130.0','3504.','12.0','70','1','"chevrolet chevelle malibu"']
or like this:
['18.0','8','307.0','130.0','3504.','12.0','70','1','chevrolet chevelle malibu']
I have tried to use re.complie(), but I failed to build a rule.
Please help!
If the last value is always delimited by '\t' you can use this:
s = '18.0 8 307.0 130.0 3504. 12.0 70 1\t"chevrolet chevelle malibu"'
lst = [*s.split('\t')[0].split(), s.split('\t')[-1]]
print(lst)
Prints:
['18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', '1', '"chevrolet chevelle malibu"']
It can be achieved with the following piece of code:
>>> string = '18.0 8 307.0 130.0 3504. 12.0 70 1\t"chevrolet chevelle malibu"'
>>> [y for (i, x) in enumerate(string.split('"')) for y in ([y.strip() for y in x.split()] if i % 2 == 0 else [x])]
['18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', '1', 'chevrolet chevelle malibu']

Writing dict of dicts of dicts into Excel in certain format using Python

I have got some data which I have read into dictionary of dictionaries
EDIT : Posting original data format
Original Data is one excel files for each user every month
Alpha - Jan 2018 .. following format
Score
English 70
Social Science 80
Maths 90
History 45
Science 50
I read all these excels into python and get them into dictionaries as mentioned below. Some students may miss some exams hence for those months their data will be missing. Hence variation will be complete month data missing for few students.
{alpha: {u'Jan-2018': {'Eng': '70', 'math': '90', 'sci': '50'}, u'feb-2018': {'Eng': '75', 'math': '85', 'sci': '60'}, u'mar-2018': {'Eng': '60', 'math': '92', 'sci': '40'}}
{beta : {u'Jan-2018': {'Eng': '30', 'math': '50', 'sci': '40'}, u'feb-2018': {'Eng': '55', 'math': '45', 'sci': '70'}, u'may-2018': {'Eng': '50', 'math': '52', 'sci': '45'}}
{gamma : {u'Jan-2018': {'Eng': '50', 'math': '50', 'sci': '40'}, u'feb-2018': {'Eng': '55', 'math': '75', 'sci': '40'}, u'may-2018': {'Eng': '56', 'math': '59', 'sci': '35'}}
I want to get these on Excel in following format. On sheet 1 it should publish only Eng data for different months and on second sheet math data and third sci data. For whichever month for someone data is missing, that should be left blank or maybe 0
Sheet1(Eng):
Jan-2018 Feb-2018 Mar-2018 May-2018
alpha 70 75 60 0
beta 30 55 0 50
gamma 50 55 0 56
similarly for other two sheets.
I have tried following code, however there are two issues with it:
It doesn't consider the missing months, and prints sequentially
It doesn't print the month name on top of the every column
List1 contains dict of dicts mentioned above
alleng = {}
allmath = {}
allsci = {}
for i in list1:
englist = []
mathlist = []
scilist = []
for m in list1[i]:
for h in list1[i][m]:
value1 = list1[i][m][h]
if h == 'Eng':
englist.append(value1)
if h == 'Math':
mathlist.append(value1)
if h == 'Sci':
scilist.append(value1)
alleng[i] = englist
allmath[i] = mathlist
allsci[i] = scilist
writer = ExcelWriter('final-sheet.xlsx')
frame = pd.DataFrame.from_dict(allsci, orient='index')
frame = frame.transpose()
frame = frame.transpose()
frame.to_excel(writer , sheet_name = 'Sci')
frame1 = pd.DataFrame.from_dict(alleng, orient='index')
frame1 = frame1.transpose()
frame1 = frame1.transpose()
frame1.to_excel(writer , sheet_name = 'Eng')
frame2 = pd.DataFrame.from_dict(allmath, orient='index')
frame2 = frame2.transpose()
frame2 = frame2.transpose()
frame2.to_excel(writer , sheet_name = 'Math')
I also tried using following solution, however it didn't help:
Dict of dicts of dicts to DataFrame
I tried following code to convert dicts to dataframe and it helps
df1=pd.DataFrame(list1).stack().apply(pd.Series).unstack()
It will give the data on a single sheet itself in following format:
Jan-2018 feb-2018 mar-2018 april-2018
Eng Alpha 70 75 60 0
Beta 30 55 0 50
Gamma 50 55 0 56
Math Alpha 90 85 92 0
Beta 50 45 0 52
Gamma 50 75 0 59

Pandas - Fastest way indexing with 2 dataframes

I am developing a software in Python 3 with Pandas library.
Time is very important but memory not so much.
For better visualization I am using the names a and b with few values, although there are many more:
a -> 50000 rows
b -> 5000 rows
I need to select from dataframe a and b (using multiples conditions)
a = pd.DataFrame({
'a1': ['x', 'y', 'z'] ,
'a2': [1, 2, 3],
'a3': [3.14, 2.73, -23.00],
'a4': [pd.np.nan, pd.np.nan, pd.np.nan]
})
a
a1 a2 a3 a4
0 x 1 3.14 NaN
1 y 2 2.73 NaN
2 z 3 -23.00 NaN
b = pd.DataFrame({
'b1': ['x', 'y', 'z', 'k', 'l'],
'b2': [2018, 2019, 2020, 2015, 2012]
})
b
b1 b2
0 x 2018
1 y 2019
2 z 2020
3 k 2015
4 l 2012
So far my code is like this:
for index, row in a.iterrows():
try:
# create a key
a1 = row["a1"]
mask = b.loc[(b['b1'] == a1) & (b['b2'] != 2019)]
# check if exists
if (len(mask.index) != 0): #not empty
a.loc[[index], ['a4']] = mask.iloc[0]['b2']
except KeyError: #not found
pass
But as you can see, I'm using for iterrows that is very slow compared to other methods and I'm changing the value of the DataFrame I'm iterating, that is not recommended.
Could you help me find a better way? The results should be like this:
a
a1 a2 a3 a4
0 x 1 3.14 2018
1 y 2 2.73 NaN
2 z 3 -23.00 2020
I tried things like this below, but I didnt made it work.
a.loc[ (a['a1'] == b['b1']) , 'a4'] = b.loc[b['b2'] != 2019]
*the real code has more conditions
Thanks!
EDIT
I benchmark using: iterrows, merge, set_index/loc. Here is the code:
import timeit
import pandas as pd
def f_iterrows():
for index, row in a.iterrows():
try:
# create a key
a1 = row["a1"]
a3 = row["a3"]
mask = b.loc[(b['b1'] == a1) & (b['b2'] != 2019)]
# check if exists
if len(mask.index) != 0: # not empty
a.loc[[index], ['a4']] = mask.iloc[0]['b2']
except: # not found
pass
def f_merge():
a.merge(b[b.b2 != 2019], left_on='a1', right_on='b1', how='left').drop(['a4', 'b1'], 1).rename(columns={'b2': 'a4'})
def f_lock():
df1 = a.set_index('a1')
df2 = b.set_index('b1')
df1.loc[:, 'a4'] = df2.b2[df2.b2 != 2019]
#variables for testing
number_rows = 100
number_iter = 100
a = pd.DataFrame({
'a1': ['x', 'y', 'z'] * number_rows,
'a2': [1, 2, 3] * number_rows,
'a3': [3.14, 2.73, -23.00] * number_rows,
'a4': [pd.np.nan, pd.np.nan, pd.np.nan] * number_rows
})
b = pd.DataFrame({
'b1': ['x', 'y', 'z', 'k', 'l'] * number_rows,
'b2': [2018, 2019, 2020, 2015, 2012] * number_rows
})
print('For: %s s' % str(timeit.timeit(f_iterrows, number=number_iter)))
print('Merge: %s s' % str(timeit.timeit(f_merge, number=number_iter)))
print('Loc: %s s' % str(timeit.timeit(f_iterrows, number=number_iter)))
They all worked :) and the time to run is:
For: 277.9994369489998 s
Loc: 274.04929955067564 s
Merge: 2.195712725706926 s
So far Merge is the fastest.
If another option appears I will update here, thanks again.
IIUC
a.merge(b[b.b2!=2019],left_on='a1',right_on='b1',how='left').drop(['a4','b1'],1).rename(columns={'b2':'a4'})
Out[263]:
a1 a2 a3 a4
0 x 1 3.14 2018.0
1 y 2 2.73 NaN
2 z 3 -23.00 2020.0

Resources