replace row values in a single column pandas - python-3.x

I keep getting warning "A value is trying to be set on a copy of a slice from a DataFrame".
How could I fix it? Any alternative
//check for NAN
//capitalise first letter
//assign 'Male' for 'm',
//assign 'Female for 'f'
myDataFrame.to_csv('new_H.csv')
genderList = myDataFrame.loc[:,"Gender"] //extract Gender column
for i in range(0, len(genderList)):
if type(genderList[i]) == float: #check for empty spaces
genderList[i] = 'NAN'
elif genderList[i].startswith('f'):
genderList[i] = 'Female'
elif genderList[i].startswith('m'):
genderList[i] = 'Male'

for row in myDataFrame.itertuples():
if type(row["Gender"]) == float:
row["Gender"] = 'NAN'
elif row["Gender"].startswith('f'):
row["Gender"] = 'Female'
elif row["Gender"].startswith('m'):
row["Gender"] = 'Male'
The line genderList = myDataFrame.loc[:,"Gender"] cause warning since you are assigning a piece of your data frame, which could result a copy so update may not be applied to original dataframe. In code above, I used itertuples method which is a more "correct" way to iterate through rows in pandas. If you want to perform an action on a specific row, you do need to create a slice of it first - you just update the value of this column in every row.
From what I see, you goal is to replace values on Gender based on previous values. In that case I recommend to check pandas's replace method which is made for that exact reason together with filter. But, since your filter is quite simple, you can do the following:
myDataFrame[myDataFrame["Gender"].str.contains('^f')] = "Female"
To update all female. I used slicing of dataframe (myDataFrame[...]) and the condition is myDataFrame["Gender"].str.contains('^f').

Related

I wrote code using pandas with python. I want to convert the code into a new dataframe with the output seperated into two columns

I went from one data frame to another and performed calcs on the column next to name for each unique person. Now I have a output of the names and calcs next to it and I want to break it into two columns and put it in a data frame and print. I'm thinking I should put the entire for loop into a dictionary then a data frame, but not to sure of how to do that. I am a beginner at this and would really appreciate peoples help. See code from the for loop piece below:
names = df['Participant Name, Number'].unique()
for name in names:
unique_name_df = df[df['Participant Name, Number'] == name]
badge_types = unique_name_df['Dosimeter Location'].unique()
if 'Collar' in badge_types:
collar = unique_name_df[unique_name_df['Dosimeter Location'] == 'Collar']['Total DDE'].astype(float).sum()
if 'Chest' in badge_types:
chest = unique_name_df[unique_name_df['Dosimeter Location'] == 'Chest']['Total DDE'].astype(float).sum()
if len(badge_types) == 1:
if 'Collar' in badge_types:
value = collar
elif 'Chest' in badge_types:
value = chest
print(name, value)
If you expect len(badge_types)==1 in all the cases, try:
pd.DataFrame( df.groupby(['Participant Name, Number']).Total_DDE.sum() )
Otherwise, to get the sum per Dosimeter Location, add it on the groupby as
pd.DataFrame( df.groupby(['Participant Name, Number', 'Dosimeter Location']).Total_DDE.sum() )

Replace apply logic with something else

I have a small df (173, 21).
I wrote a function that works, however I am using apply() and I would like to, if possible,
do it another way only because of apply()'s reputation for being slow.
On this particular data set it doesn't matter at all as it is so small, but I am trying
to avoid apply() if possible.
The function takes in a row, checks each of five columns (see code below), and if the value
in any given cell is 'YES' increment a counter. Possible cell values are 'YES', 'NO' or 'NaN'
Here is the working code:
def clean_deaths(row):
num_deaths = 0
columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
for c in columns:
death = row[c]
if death == 'YES':
num_deaths += 1
return num_deaths
true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)
total = true_avengers['Deaths'].sum()
print(total, '\n') # 88
You are right: you should avoid apply(..., axis=1).
Try this:
true_avengers['Deaths'] = (true_avengers[['Death1', 'Death2', 'Death3', 'Death4', 'Death5']] =='YES').sum(axis=1)

How do you replace characters in a column of a dataframe using pandas?

From a dataframe, one column has int64 values and also some '?' where the data is not present.
The task is to replace the '?' with the mean of the integers in the column.
The column looks something like this:
30.82
26.67
17.56
?
34.99
?
.
.
.
Till now i tried using a for loop to calculate the mean while skipping the index where s[i] == '?'.
But once i try to replace the characters with mean value it gives me an error.
def fillreal(column)
s = pd.Series(column)
count = 0
summ = 0
for i in range(s.size):
if s[i] == '?':
continue
else:
summ += pd.to_numeric(s[i])
count = count+1
av = round(summ/count,2)
column.replace('?', str(av))
return column
function call is:
dataR = fillreal(df['col2'])
How should i correct the code so that it works fine, and also which functions can be used to optimise the code?
TIA
df.replace('?', np.mean(pd.to_numeric(df['30.82'], errors='coerce')))
30.82 here is the name of the column.
Make sure you have inplace=True if you want the dataframe itself modifed. as shown below. you can assign the above statement to a new variable (ex:new_df) and you will get a new df will ? repalce (original remains as it is)
df.replace('?', np.mean(pd.to_numeric(df['30.82'], errors='coerce')),inplace=True)

multidimensional array (nested loop) does not function properly and returns duplicate incorrect results

I have a two-dimensional array (nested loop) that compares two lists, correct format and wrong format using the hamming distance in which if the difference between the strings in the correct and wrong format = 1 it adds the correct format to a third list "corrected" and removes the worng formated string from the wrong format list.
the data inside the lists looks like this
['BWI0520BG6,ATT7791R,AMS,DEN,1420564394,1001\n',
'BWI0520BG6,BER7172M,KUL,LAS,1420565167,1848\n',]
so I have to slice it to get each part of the list to compare between the correct and wrong lists
the code used to compare between them
corrected = []
correct_format = ['EZC9678QI6,VYW5940P,LAS,SIN,1420565203,1843\n',
'EZC9678QI6,RUM0422W,MUC,MAD,1420563539,194\n',
'CKZ3132BR4,XXQ4064B,JFK,FRA,1420563917,802\n',
'HCA3158QA6,GMO5938W,LHR,PEK,1420564317,1057\n',
'JBE2302VO4,VDC9164W,FCO,LAS,1420564698,1276\n',
'XFG5747ZT9,PME8178S,DEN,PEK,1420564409,1322\n',
'CDC0302NN5,QHU1140O,CDG,LAS,1420564498,1133\n',
'CYJ0225CH1,YZO4444S,BKK,MIA,1420565330,2027\n',
'PIT2755XC1,VYW5940P,LAS,SIN,1420565203,1843\n',
'IEG9308EA5,SQU6245R,DEN,FRA,1420564460,1049\n',
'LLZ3798PE3,ULZ8130D,CAN,DFW,1420564983,1683\n',
'LLZ3798PE3,MBA8071P,KUL,PEK,1420563856,572\n',
'PIT2755XC1,SOH3431A,ORD,MIA,1420563649,250\n',
'XFG5747ZT9,XXQ4064B,JFK,FRA,1420563917,802\n',
'HCA3158QA6,SQU6245R,DEN,FRA,1420564460,1049\n',
'JBE2302VO4,HZT2506M,IAH,AMS,1420564324,1044\n',
'VZY2993ME1,WSK1289Z,CLT,DEN,1420563542,278\n',
'SJD8775RZ4,TMV7633W,UGK,DXB,1420563958,849\n',
'EDV2089LK5,ATT7791R,AMS,DEN,1420564394,1001\n',
'SPR4484HA6,VDC9164W,FCO,LAS,1420564698,1276\n',
'UES9151GS5,DAU2617A,CGK,SFO,1420564986,1811\n',
'WBE6935NU3,KJR6646J,IAH,BKK,1420565203,1928\n',
'CDC0302NN5,XIL3623J,PEK,LAX,1420564414,1302\n',
'WYU2010YH8,JVY9791G,PVG,FCO,1420564561,1189\n']
wrong_format = ['BWI0520BG6,VYW5940P,LAS,SI|,1420565203,1843\n',
'CKZ3132BR4,RUM0422W,MUC,;AD,1420563539,194\n',
'CKZ313\\BR4,QHU1140O,CDG,LAS,1420564498,1133\n',
'CXN7304ER2,GMO593[W,LHR,PEK,1420564317,1057\n',
'CXN7304ER2,VDCP164W,FCO,LAS,1420564698,1276\n',
'DAZ3029XA0,WPW9201U,DFW,yEK,1420564869,1452\n',
'HGO4350KK1,QHU1140O,CDG,vAS,1420564498,1133\n',
'JJM4724RF7,YZO4444S,BKK,MI^,1420565330,2027\n',
'KKP5277HZ7,VYW5940P,LAS,:IN,1420565203,1843\n',
'MXU9187YC7,MOO1786A,MAD,]RA,1420563408,184\n',
'ONL0812DH1,BER7172M,KUL,[AS,1420565167,1848\n',
'PAJ3974RK1,EWH6301Y,~AN,DFW,1420564967,1683\n',
'POP2875LH3,MBw8071P,KUL,PEK,1420563856,572\n',
'PUD8209OG3,SOH3431A,OR8,MIA,1420563649,250\n',
'PUD8209OG3,XXQ4064%,JFK,FRA,1420563917,802\n',
'SJD8775RZ4,4QU6245R,DEN,FRA,1420564460,1049\n',
'SJD8775RZ4,HZT2506M,IAH,#MS,1420564324,1044\n',
'SJD8775RZ4,WSK1289Z,CLT,vEN,1420563542,278\n',
'SJD8|75RZ4,ULZ8130D,CAN,DFW,1420564983,1683\n',
'SPR4484HA6,ATT7791R,AM%,DEN,1420564394,1001\n',
'SPRb484HA6,VYW5940P,LAS,SIN,1420565203,1843\n',
'UES915*GS5,SQU6245R,DEN,FRA,1420564460,1049\n',
'WBE6935NU$,XOY7948U,ATL,LHR,1420564038,877\n',
'WTC9125IE5,XIL3623J,PEK,L}X,1420564414,1302\n',
'WYU2010YH8,XIL3623J,PEe,LAX,1420564414,1302\n',
'WYu2010YH8,FYL5866L,ATL,HKG,1420565140,1751\n',
'YMH6360YP0,ATT7791R,A;S,DEN,1420564394,1001\n']
def hamming_distance(s1, s2):
return sum(c1 != c2 for c1, c2 in zip(s1, s2))
for r in correct_format:
for i in wrong_format:
if hamming_distance(r[0:10], i[0:10]) == 1:
corrected.append(r)
wrong_format.remove(i)
elif hamming_distance(r[11:19], i[11:19]) == 1:
corrected.append(r)
wrong_format.remove(i)
elif hamming_distance(r[20:23], i[20:23]) == 1:
corrected.append(r)
wrong_format.remove(i)
elif hamming_distance(r[24:27], i[24:27]) == 1:
corrected.append(r)
wrong_format.remove(i)
So, the corrected list is expected to be filled with the correct format and the wrong format should be empty
corrected = ['POP2875LH3,MBW8071P,KUL,PEK,1420563856,572\n',
'UES9157GS5,SQU6245R,DEN,FRA,1420564460,1049\n',
'CXN7304ER2,GMO5937W,LHR,PEK,1420564317,1057\n',
'SJD8775RZ4,BQU6245R,DEN,FRA,1420564460,1049\n',
'PUD8209OG3,XXQ4064X,JFK,FRA,1420563917,802\n',
'SJD8775RZ4,ULZ8130D,CAN,DFW,1420564983,1683\n',
'ONL0812DH1,BER7172M,KUL,LAS,1420565167,1848\n',
'YMH6360YP0,ATT7791R,AMS,DEN,1420564394,1001\n',]
However, this is what I get
corrected = ['POP2875LH3,MBW8071P,KUL,PEK,1420563856,572\n',
'UES9157GS5,SQU6245R,DEN,FRA,1420564460,1049\n',
'UES9157GS5,SQU6245R,DEN,FRA,1420564460,1049\n',
'CXN7304ER2,GMO5937W,LHR,PEK,1420564317,1057\n',
'SJD8775RZ4,4QU6245R,DEN,FRA,1420564460,1049\n',
'PUD8209OG3,XXQ4064X,JFK,FRA,1420563917,802\n',
'ONL0812DH1,BER7172M,KUL,LAS,1420565167,1848\n',
'YMH6360YP0,ATT7791R,AMS,DEN,1420564394,1001\n']
with this small sample data I get duplicate data. but with large one I get duplicate and incorrect results too. Any thoughts of what causing this issue
Thank you very much for your help.
Some changes and suggestions to the code
You would want to compare the item at the same indexes for both arrays, but right now you have 2 for loops which compare each element in both lists.
You can avoid removing data from wrong_format.
You can split both strings on , instead of having explicit indexes, use an generator over the indexes to compare both items and append accordingly
So the updated code will look like
corrected = []
correct_format = ['EZC9678QI6,VYW5940P,LAS,SIN,1420565203,1843\n',
'EZC9678QI6,RUM0422W,MUC,MAD,1420563539,194\n',
'CKZ3132BR4,XXQ4064B,JFK,FRA,1420563917,802\n',
'HCA3158QA6,GMO5938W,LHR,PEK,1420564317,1057\n',
'JBE2302VO4,VDC9164W,FCO,LAS,1420564698,1276\n',
'XFG5747ZT9,PME8178S,DEN,PEK,1420564409,1322\n',
'CDC0302NN5,QHU1140O,CDG,LAS,1420564498,1133\n',
'CYJ0225CH1,YZO4444S,BKK,MIA,1420565330,2027\n',
'PIT2755XC1,VYW5940P,LAS,SIN,1420565203,1843\n',
'IEG9308EA5,SQU6245R,DEN,FRA,1420564460,1049\n',
'LLZ3798PE3,ULZ8130D,CAN,DFW,1420564983,1683\n',
'LLZ3798PE3,MBA8071P,KUL,PEK,1420563856,572\n',
'PIT2755XC1,SOH3431A,ORD,MIA,1420563649,250\n',
'XFG5747ZT9,XXQ4064B,JFK,FRA,1420563917,802\n',
'HCA3158QA6,SQU6245R,DEN,FRA,1420564460,1049\n',
'JBE2302VO4,HZT2506M,IAH,AMS,1420564324,1044\n',
'VZY2993ME1,WSK1289Z,CLT,DEN,1420563542,278\n',
'SJD8775RZ4,TMV7633W,UGK,DXB,1420563958,849\n',
'EDV2089LK5,ATT7791R,AMS,DEN,1420564394,1001\n',
'SPR4484HA6,VDC9164W,FCO,LAS,1420564698,1276\n',
'UES9151GS5,DAU2617A,CGK,SFO,1420564986,1811\n',
'WBE6935NU3,KJR6646J,IAH,BKK,1420565203,1928\n',
'CDC0302NN5,XIL3623J,PEK,LAX,1420564414,1302\n',
'WYU2010YH8,JVY9791G,PVG,FCO,1420564561,1189\n']
wrong_format = ['BWI0520BG6,VYW5940P,LAS,SI|,1420565203,1843\n',
'CKZ3132BR4,RUM0422W,MUC,;AD,1420563539,194\n',
'CKZ313\\BR4,QHU1140O,CDG,LAS,1420564498,1133\n',
'CXN7304ER2,GMO593[W,LHR,PEK,1420564317,1057\n',
'CXN7304ER2,VDCP164W,FCO,LAS,1420564698,1276\n',
'DAZ3029XA0,WPW9201U,DFW,yEK,1420564869,1452\n',
'HGO4350KK1,QHU1140O,CDG,vAS,1420564498,1133\n',
'JJM4724RF7,YZO4444S,BKK,MI^,1420565330,2027\n',
'KKP5277HZ7,VYW5940P,LAS,:IN,1420565203,1843\n',
'MXU9187YC7,MOO1786A,MAD,]RA,1420563408,184\n',
'ONL0812DH1,BER7172M,KUL,[AS,1420565167,1848\n',
'PAJ3974RK1,EWH6301Y,~AN,DFW,1420564967,1683\n',
'POP2875LH3,MBw8071P,KUL,PEK,1420563856,572\n',
'PUD8209OG3,SOH3431A,OR8,MIA,1420563649,250\n',
'PUD8209OG3,XXQ4064%,JFK,FRA,1420563917,802\n',
'SJD8775RZ4,4QU6245R,DEN,FRA,1420564460,1049\n',
'SJD8775RZ4,HZT2506M,IAH,#MS,1420564324,1044\n',
'SJD8775RZ4,WSK1289Z,CLT,vEN,1420563542,278\n',
'SJD8|75RZ4,ULZ8130D,CAN,DFW,1420564983,1683\n',
'SPR4484HA6,ATT7791R,AM%,DEN,1420564394,1001\n',
'SPRb484HA6,VYW5940P,LAS,SIN,1420565203,1843\n',
'UES915*GS5,SQU6245R,DEN,FRA,1420564460,1049\n',
'WBE6935NU$,XOY7948U,ATL,LHR,1420564038,877\n',
'WTC9125IE5,XIL3623J,PEK,L}X,1420564414,1302\n',
'WYU2010YH8,XIL3623J,PEe,LAX,1420564414,1302\n',
'WYu2010YH8,FYL5866L,ATL,HKG,1420565140,1751\n',
'YMH6360YP0,ATT7791R,A;S,DEN,1420564394,1001\n']
def hamming_distance(s1, s2):
return sum(c1 != c2 for c1, c2 in zip(s1, s2))
#Iterate over both lists together
for r, i in zip(correct_format, wrong_format):
#Split string into words by comma
li_r = r.split(',')
li_i = i.split(',')
#Check all if conditions and append accordingly
if any(hamming_distance(item_r, item_i) == 1 for item_r, item_i in zip(li_r,li_i)):
corrected.append(r)
print(corrected)

Pandas drop row based on year

I want to delete or drop some rows from the dataframe based on the year column. I'm utilizing the following code to do it...
usa_population.drop('year' == '1959-', axis=0, inplace=True)
I'm passing an expression hoping to target those rows. I have no error running this code, however, when I query the dataframe those rows still there...
usa_population[usa_population.year == '1959-']
year p_age p_female p_male p_total
2886 1959- 0 1996399.23 2064922.61 4061321.83
2887 1959- 1 1998220.09 2070499.94 4068720.04
2888 1959- 2 1966510.93 2034099.69 4000610.62
2889 1959- 3 1921734.50 1985181.41 3906915.91
How can I drop this rows?
Preferred way of doing that is boolean indexing (just invert the condition):
usa_population = usa_population[usa_population['year'] != '1959-']
If you want to use drop, you need to pass the indices of the rows to be dropped. So from your selection of usa_population[usa_population.year == '1959-'], you can access the index attribute with usa_population[usa_population.year == '1959-'].index. If you pass this to the drop method, it will do the same thing:
usa_population.drop(usa_population[usa_population.year == '1959-'].index)

Resources