Replace apply logic with something else - python-3.x
I have a small df (173, 21).
I wrote a function that works, however I am using apply() and I would like to, if possible,
do it another way only because of apply()'s reputation for being slow.
On this particular data set it doesn't matter at all as it is so small, but I am trying
to avoid apply() if possible.
The function takes in a row, checks each of five columns (see code below), and if the value
in any given cell is 'YES' increment a counter. Possible cell values are 'YES', 'NO' or 'NaN'
Here is the working code:
def clean_deaths(row):
num_deaths = 0
columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
for c in columns:
death = row[c]
if death == 'YES':
num_deaths += 1
return num_deaths
true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)
total = true_avengers['Deaths'].sum()
print(total, '\n') # 88
You are right: you should avoid apply(..., axis=1).
Try this:
true_avengers['Deaths'] = (true_avengers[['Death1', 'Death2', 'Death3', 'Death4', 'Death5']] =='YES').sum(axis=1)
Related
replace row values in a single column pandas
I keep getting warning "A value is trying to be set on a copy of a slice from a DataFrame". How could I fix it? Any alternative //check for NAN //capitalise first letter //assign 'Male' for 'm', //assign 'Female for 'f' myDataFrame.to_csv('new_H.csv') genderList = myDataFrame.loc[:,"Gender"] //extract Gender column for i in range(0, len(genderList)): if type(genderList[i]) == float: #check for empty spaces genderList[i] = 'NAN' elif genderList[i].startswith('f'): genderList[i] = 'Female' elif genderList[i].startswith('m'): genderList[i] = 'Male'
for row in myDataFrame.itertuples(): if type(row["Gender"]) == float: row["Gender"] = 'NAN' elif row["Gender"].startswith('f'): row["Gender"] = 'Female' elif row["Gender"].startswith('m'): row["Gender"] = 'Male' The line genderList = myDataFrame.loc[:,"Gender"] cause warning since you are assigning a piece of your data frame, which could result a copy so update may not be applied to original dataframe. In code above, I used itertuples method which is a more "correct" way to iterate through rows in pandas. If you want to perform an action on a specific row, you do need to create a slice of it first - you just update the value of this column in every row. From what I see, you goal is to replace values on Gender based on previous values. In that case I recommend to check pandas's replace method which is made for that exact reason together with filter. But, since your filter is quite simple, you can do the following: myDataFrame[myDataFrame["Gender"].str.contains('^f')] = "Female" To update all female. I used slicing of dataframe (myDataFrame[...]) and the condition is myDataFrame["Gender"].str.contains('^f').
How do you replace characters in a column of a dataframe using pandas?
From a dataframe, one column has int64 values and also some '?' where the data is not present. The task is to replace the '?' with the mean of the integers in the column. The column looks something like this: 30.82 26.67 17.56 ? 34.99 ? . . . Till now i tried using a for loop to calculate the mean while skipping the index where s[i] == '?'. But once i try to replace the characters with mean value it gives me an error. def fillreal(column) s = pd.Series(column) count = 0 summ = 0 for i in range(s.size): if s[i] == '?': continue else: summ += pd.to_numeric(s[i]) count = count+1 av = round(summ/count,2) column.replace('?', str(av)) return column function call is: dataR = fillreal(df['col2']) How should i correct the code so that it works fine, and also which functions can be used to optimise the code? TIA
df.replace('?', np.mean(pd.to_numeric(df['30.82'], errors='coerce'))) 30.82 here is the name of the column. Make sure you have inplace=True if you want the dataframe itself modifed. as shown below. you can assign the above statement to a new variable (ex:new_df) and you will get a new df will ? repalce (original remains as it is) df.replace('?', np.mean(pd.to_numeric(df['30.82'], errors='coerce')),inplace=True)
Using while and if function together with a condition change
I am trying to use python to conduct a calculation which will sum the values in a column only for the time period that a certain condition is met. However, the summation should begin when the conditions are met (runstat == 0 and oil >1). The summation should then stop at the point when oil == 0. I am new to python so I am not sure how to do this. I connected the code to a spreadsheet for testing purposes but the intent is to connect to live data. I figured a while loop in combination with an if function might work but I am not winning. Basically I want to have the code start when runstat is zero and oil is higher than 0. It should stop summing the values of oil when the oil row becomes zero and then it should write the data to a SQL database (this I will figure out later - for now I just want to see if it can work). This is what code I have tried so far. import numpy as np import pandas as pd data = pd.read_excel('TagValues.xlsx') df = pd.DataFrame(data) df['oiltag'] = df['oiltag'].astype(float) df['runstattag'] = df['runstattag'].astype(float) oil = df['oiltag'] runstat = df['runstattag'] def startup(oil,runstat): while oil.all() > 0: if oil > 0 and runstat == 0: totaloil = sum(oil.all()) print(totaloil) else: return None return print(startup(oil.all(), runstat.all())) It should sum the values in the column but it is returning: None
OK, so I think that what you want to do is get the subset of rows between the two conditions, then get a sum of those. Method: Slice the dataframe to get the relevant rows and then sum. import numpy as np import pandas as pd data = pd.read_excel('TagValues.xlsx') df = pd.DataFrame(data) df['oiltag'] = df['oiltag'].astype(float) df['runstattag'] = df['runstattag'].astype(float) def startup(dframe): start_row = dframe[(dframe.oiltag > 0) & (dframe.runstattag == 0)].index[0] end_row = dframe[(dframe.oiltag == 0) & (dframe.index > start_row)].index[0] subset = dframe[start_row:end_row+1] # +1 because the end slice is non-inclusive totaloil = subset.oiltag.sum() return totaloil print(startup(df)) This code will raise an error if it can't find a subset of rows which match your criteria. If you need to handle that case, then we could add some exception handling. EDIT: Please note this assumes that your criteria is only expected to occur once per excel. If you have multiple “chunks” that you will want to sum then this will need tweaking.
multidimensional array (nested loop) does not function properly and returns duplicate incorrect results
I have a two-dimensional array (nested loop) that compares two lists, correct format and wrong format using the hamming distance in which if the difference between the strings in the correct and wrong format = 1 it adds the correct format to a third list "corrected" and removes the worng formated string from the wrong format list. the data inside the lists looks like this ['BWI0520BG6,ATT7791R,AMS,DEN,1420564394,1001\n', 'BWI0520BG6,BER7172M,KUL,LAS,1420565167,1848\n',] so I have to slice it to get each part of the list to compare between the correct and wrong lists the code used to compare between them corrected = [] correct_format = ['EZC9678QI6,VYW5940P,LAS,SIN,1420565203,1843\n', 'EZC9678QI6,RUM0422W,MUC,MAD,1420563539,194\n', 'CKZ3132BR4,XXQ4064B,JFK,FRA,1420563917,802\n', 'HCA3158QA6,GMO5938W,LHR,PEK,1420564317,1057\n', 'JBE2302VO4,VDC9164W,FCO,LAS,1420564698,1276\n', 'XFG5747ZT9,PME8178S,DEN,PEK,1420564409,1322\n', 'CDC0302NN5,QHU1140O,CDG,LAS,1420564498,1133\n', 'CYJ0225CH1,YZO4444S,BKK,MIA,1420565330,2027\n', 'PIT2755XC1,VYW5940P,LAS,SIN,1420565203,1843\n', 'IEG9308EA5,SQU6245R,DEN,FRA,1420564460,1049\n', 'LLZ3798PE3,ULZ8130D,CAN,DFW,1420564983,1683\n', 'LLZ3798PE3,MBA8071P,KUL,PEK,1420563856,572\n', 'PIT2755XC1,SOH3431A,ORD,MIA,1420563649,250\n', 'XFG5747ZT9,XXQ4064B,JFK,FRA,1420563917,802\n', 'HCA3158QA6,SQU6245R,DEN,FRA,1420564460,1049\n', 'JBE2302VO4,HZT2506M,IAH,AMS,1420564324,1044\n', 'VZY2993ME1,WSK1289Z,CLT,DEN,1420563542,278\n', 'SJD8775RZ4,TMV7633W,UGK,DXB,1420563958,849\n', 'EDV2089LK5,ATT7791R,AMS,DEN,1420564394,1001\n', 'SPR4484HA6,VDC9164W,FCO,LAS,1420564698,1276\n', 'UES9151GS5,DAU2617A,CGK,SFO,1420564986,1811\n', 'WBE6935NU3,KJR6646J,IAH,BKK,1420565203,1928\n', 'CDC0302NN5,XIL3623J,PEK,LAX,1420564414,1302\n', 'WYU2010YH8,JVY9791G,PVG,FCO,1420564561,1189\n'] wrong_format = ['BWI0520BG6,VYW5940P,LAS,SI|,1420565203,1843\n', 'CKZ3132BR4,RUM0422W,MUC,;AD,1420563539,194\n', 'CKZ313\\BR4,QHU1140O,CDG,LAS,1420564498,1133\n', 'CXN7304ER2,GMO593[W,LHR,PEK,1420564317,1057\n', 'CXN7304ER2,VDCP164W,FCO,LAS,1420564698,1276\n', 'DAZ3029XA0,WPW9201U,DFW,yEK,1420564869,1452\n', 'HGO4350KK1,QHU1140O,CDG,vAS,1420564498,1133\n', 'JJM4724RF7,YZO4444S,BKK,MI^,1420565330,2027\n', 'KKP5277HZ7,VYW5940P,LAS,:IN,1420565203,1843\n', 'MXU9187YC7,MOO1786A,MAD,]RA,1420563408,184\n', 'ONL0812DH1,BER7172M,KUL,[AS,1420565167,1848\n', 'PAJ3974RK1,EWH6301Y,~AN,DFW,1420564967,1683\n', 'POP2875LH3,MBw8071P,KUL,PEK,1420563856,572\n', 'PUD8209OG3,SOH3431A,OR8,MIA,1420563649,250\n', 'PUD8209OG3,XXQ4064%,JFK,FRA,1420563917,802\n', 'SJD8775RZ4,4QU6245R,DEN,FRA,1420564460,1049\n', 'SJD8775RZ4,HZT2506M,IAH,#MS,1420564324,1044\n', 'SJD8775RZ4,WSK1289Z,CLT,vEN,1420563542,278\n', 'SJD8|75RZ4,ULZ8130D,CAN,DFW,1420564983,1683\n', 'SPR4484HA6,ATT7791R,AM%,DEN,1420564394,1001\n', 'SPRb484HA6,VYW5940P,LAS,SIN,1420565203,1843\n', 'UES915*GS5,SQU6245R,DEN,FRA,1420564460,1049\n', 'WBE6935NU$,XOY7948U,ATL,LHR,1420564038,877\n', 'WTC9125IE5,XIL3623J,PEK,L}X,1420564414,1302\n', 'WYU2010YH8,XIL3623J,PEe,LAX,1420564414,1302\n', 'WYu2010YH8,FYL5866L,ATL,HKG,1420565140,1751\n', 'YMH6360YP0,ATT7791R,A;S,DEN,1420564394,1001\n'] def hamming_distance(s1, s2): return sum(c1 != c2 for c1, c2 in zip(s1, s2)) for r in correct_format: for i in wrong_format: if hamming_distance(r[0:10], i[0:10]) == 1: corrected.append(r) wrong_format.remove(i) elif hamming_distance(r[11:19], i[11:19]) == 1: corrected.append(r) wrong_format.remove(i) elif hamming_distance(r[20:23], i[20:23]) == 1: corrected.append(r) wrong_format.remove(i) elif hamming_distance(r[24:27], i[24:27]) == 1: corrected.append(r) wrong_format.remove(i) So, the corrected list is expected to be filled with the correct format and the wrong format should be empty corrected = ['POP2875LH3,MBW8071P,KUL,PEK,1420563856,572\n', 'UES9157GS5,SQU6245R,DEN,FRA,1420564460,1049\n', 'CXN7304ER2,GMO5937W,LHR,PEK,1420564317,1057\n', 'SJD8775RZ4,BQU6245R,DEN,FRA,1420564460,1049\n', 'PUD8209OG3,XXQ4064X,JFK,FRA,1420563917,802\n', 'SJD8775RZ4,ULZ8130D,CAN,DFW,1420564983,1683\n', 'ONL0812DH1,BER7172M,KUL,LAS,1420565167,1848\n', 'YMH6360YP0,ATT7791R,AMS,DEN,1420564394,1001\n',] However, this is what I get corrected = ['POP2875LH3,MBW8071P,KUL,PEK,1420563856,572\n', 'UES9157GS5,SQU6245R,DEN,FRA,1420564460,1049\n', 'UES9157GS5,SQU6245R,DEN,FRA,1420564460,1049\n', 'CXN7304ER2,GMO5937W,LHR,PEK,1420564317,1057\n', 'SJD8775RZ4,4QU6245R,DEN,FRA,1420564460,1049\n', 'PUD8209OG3,XXQ4064X,JFK,FRA,1420563917,802\n', 'ONL0812DH1,BER7172M,KUL,LAS,1420565167,1848\n', 'YMH6360YP0,ATT7791R,AMS,DEN,1420564394,1001\n'] with this small sample data I get duplicate data. but with large one I get duplicate and incorrect results too. Any thoughts of what causing this issue Thank you very much for your help.
Some changes and suggestions to the code You would want to compare the item at the same indexes for both arrays, but right now you have 2 for loops which compare each element in both lists. You can avoid removing data from wrong_format. You can split both strings on , instead of having explicit indexes, use an generator over the indexes to compare both items and append accordingly So the updated code will look like corrected = [] correct_format = ['EZC9678QI6,VYW5940P,LAS,SIN,1420565203,1843\n', 'EZC9678QI6,RUM0422W,MUC,MAD,1420563539,194\n', 'CKZ3132BR4,XXQ4064B,JFK,FRA,1420563917,802\n', 'HCA3158QA6,GMO5938W,LHR,PEK,1420564317,1057\n', 'JBE2302VO4,VDC9164W,FCO,LAS,1420564698,1276\n', 'XFG5747ZT9,PME8178S,DEN,PEK,1420564409,1322\n', 'CDC0302NN5,QHU1140O,CDG,LAS,1420564498,1133\n', 'CYJ0225CH1,YZO4444S,BKK,MIA,1420565330,2027\n', 'PIT2755XC1,VYW5940P,LAS,SIN,1420565203,1843\n', 'IEG9308EA5,SQU6245R,DEN,FRA,1420564460,1049\n', 'LLZ3798PE3,ULZ8130D,CAN,DFW,1420564983,1683\n', 'LLZ3798PE3,MBA8071P,KUL,PEK,1420563856,572\n', 'PIT2755XC1,SOH3431A,ORD,MIA,1420563649,250\n', 'XFG5747ZT9,XXQ4064B,JFK,FRA,1420563917,802\n', 'HCA3158QA6,SQU6245R,DEN,FRA,1420564460,1049\n', 'JBE2302VO4,HZT2506M,IAH,AMS,1420564324,1044\n', 'VZY2993ME1,WSK1289Z,CLT,DEN,1420563542,278\n', 'SJD8775RZ4,TMV7633W,UGK,DXB,1420563958,849\n', 'EDV2089LK5,ATT7791R,AMS,DEN,1420564394,1001\n', 'SPR4484HA6,VDC9164W,FCO,LAS,1420564698,1276\n', 'UES9151GS5,DAU2617A,CGK,SFO,1420564986,1811\n', 'WBE6935NU3,KJR6646J,IAH,BKK,1420565203,1928\n', 'CDC0302NN5,XIL3623J,PEK,LAX,1420564414,1302\n', 'WYU2010YH8,JVY9791G,PVG,FCO,1420564561,1189\n'] wrong_format = ['BWI0520BG6,VYW5940P,LAS,SI|,1420565203,1843\n', 'CKZ3132BR4,RUM0422W,MUC,;AD,1420563539,194\n', 'CKZ313\\BR4,QHU1140O,CDG,LAS,1420564498,1133\n', 'CXN7304ER2,GMO593[W,LHR,PEK,1420564317,1057\n', 'CXN7304ER2,VDCP164W,FCO,LAS,1420564698,1276\n', 'DAZ3029XA0,WPW9201U,DFW,yEK,1420564869,1452\n', 'HGO4350KK1,QHU1140O,CDG,vAS,1420564498,1133\n', 'JJM4724RF7,YZO4444S,BKK,MI^,1420565330,2027\n', 'KKP5277HZ7,VYW5940P,LAS,:IN,1420565203,1843\n', 'MXU9187YC7,MOO1786A,MAD,]RA,1420563408,184\n', 'ONL0812DH1,BER7172M,KUL,[AS,1420565167,1848\n', 'PAJ3974RK1,EWH6301Y,~AN,DFW,1420564967,1683\n', 'POP2875LH3,MBw8071P,KUL,PEK,1420563856,572\n', 'PUD8209OG3,SOH3431A,OR8,MIA,1420563649,250\n', 'PUD8209OG3,XXQ4064%,JFK,FRA,1420563917,802\n', 'SJD8775RZ4,4QU6245R,DEN,FRA,1420564460,1049\n', 'SJD8775RZ4,HZT2506M,IAH,#MS,1420564324,1044\n', 'SJD8775RZ4,WSK1289Z,CLT,vEN,1420563542,278\n', 'SJD8|75RZ4,ULZ8130D,CAN,DFW,1420564983,1683\n', 'SPR4484HA6,ATT7791R,AM%,DEN,1420564394,1001\n', 'SPRb484HA6,VYW5940P,LAS,SIN,1420565203,1843\n', 'UES915*GS5,SQU6245R,DEN,FRA,1420564460,1049\n', 'WBE6935NU$,XOY7948U,ATL,LHR,1420564038,877\n', 'WTC9125IE5,XIL3623J,PEK,L}X,1420564414,1302\n', 'WYU2010YH8,XIL3623J,PEe,LAX,1420564414,1302\n', 'WYu2010YH8,FYL5866L,ATL,HKG,1420565140,1751\n', 'YMH6360YP0,ATT7791R,A;S,DEN,1420564394,1001\n'] def hamming_distance(s1, s2): return sum(c1 != c2 for c1, c2 in zip(s1, s2)) #Iterate over both lists together for r, i in zip(correct_format, wrong_format): #Split string into words by comma li_r = r.split(',') li_i = i.split(',') #Check all if conditions and append accordingly if any(hamming_distance(item_r, item_i) == 1 for item_r, item_i in zip(li_r,li_i)): corrected.append(r) print(corrected)
How to properly set an increment in a while loop?
I have a dataframe (DF) I need to loop over each row and check if some conditions are met in that row if they are then flag that row (say I add another column labeled "flag" and equalize it to 1)- in the same loop check if there are other rows that have similar conditions, if they do then flag them as well. At the next loop look at the same DF but exclude the flagged rows. The size of the DF will go from NxM to (N-n) x M where n is the number of rows flagged. The loop will go on until the len(DF)is <=1 (meaning until all rows are flagged as 1). The for loop does not work because as the loop goes on the size of DF shrinks so I can only use while loop with increment. However, how can I set the increment ( it should be dynamic)? I am really not sure how to tackle this problem. Here is a failed attempt. a=len(DF.loc[DF['flag'] != 1]) #should be (NxM) initially i = 0 # at every loop we redefine size of DF in variable a while a >= 1: print(i) # select first row row = DF.loc[DF['flag'] != 1].iloc[[i]] # flag row if conditions are met DF['flag'].values[i] = np.where(if conditions met, 1, '') #there is another piece of code that looks for rows with similar #conditions but won't add it here # the following variable a redefines length of DF a=len(allHoldingsLookUp.loc[allHoldingsLookUp['flag'] != 1]) i+=1 I have a problem here. The increment I do not work. Say "i" reaches 100 and the length of DF shrinks to 70, then the code fails. The increase needs to be set differently but not sure how. Any comments or suggestions are more than welcome.
Please try if this change works.. a=len(DF.loc[DF['flag'] != 1]) #should be (NxM) initially # at every loop we redefine size of DF in variable a while a >= 1: i = 1 # select first row row = DF.loc[DF['flag'] != 1].iloc[[i]] # flag row if conditions are met DF['flag'].values[i] = np.where(if conditions met, 1, '') #there is another piece of code that looks for rows with similar #conditions but won't add it here # the following variable a redefines length of DF try: a=len(DF.loc[DF['flag'] != 1]) except: break
Can you please try this. Hope it works. def recur(DF): row = DF.loc[DF['flag'] != 1].iloc[[1]] DF['flag'].values[1] = np.where(if conditions met, 1, '') #there is another piece of code that looks for rows with similar #conditions but won't add it here # the following variable a redefines length of DF a=len(DF.loc[DF['flag'] != 1]) if a >= 1: recur(DF.loc[DF['flag'] != 1]) return none