I want to delete or drop some rows from the dataframe based on the year column. I'm utilizing the following code to do it...
usa_population.drop('year' == '1959-', axis=0, inplace=True)
I'm passing an expression hoping to target those rows. I have no error running this code, however, when I query the dataframe those rows still there...
usa_population[usa_population.year == '1959-']
year p_age p_female p_male p_total
2886 1959- 0 1996399.23 2064922.61 4061321.83
2887 1959- 1 1998220.09 2070499.94 4068720.04
2888 1959- 2 1966510.93 2034099.69 4000610.62
2889 1959- 3 1921734.50 1985181.41 3906915.91
How can I drop this rows?
Preferred way of doing that is boolean indexing (just invert the condition):
usa_population = usa_population[usa_population['year'] != '1959-']
If you want to use drop, you need to pass the indices of the rows to be dropped. So from your selection of usa_population[usa_population.year == '1959-'], you can access the index attribute with usa_population[usa_population.year == '1959-'].index. If you pass this to the drop method, it will do the same thing:
usa_population.drop(usa_population[usa_population.year == '1959-'].index)
Related
I recently needed to fill blank string values within a pandas dataframe with an adjacent column for the same row.
I attempted df.apply(lambda x: x['A'].replace(...) as well attempted np.where. Neither worked. There were anomalies with the assignment of "blank string values", I couldn't pick them up via '' or df['A'].replace(r'^\s$',df['B'],regex=True), or replacing df['B'] with e.g. -. The only two things that worked was .isnull() and iterrows where they appeared as nan.
So iterrows worked, but I'm not saving the changes.
How is pandas saving the changes?
mylist = {'A':['fe','fi', 'fo', ''], 'B':['fe1,','fi2', 'fi3', 'thum']}
coffee = pd.DataFrame(mylist)
print ("output1\n",coffee.head())
for index,row in coffee.iterrows():
if str(row['A']) == '':
row['A'] = row['B']
print ("output2\n", coffee.head())
output1
A B
0 fe fe1,
1 fi fi2
2 fo fi3
3 thum
output2
A B
0 fe fe1,
1 fi fi2
2 fo fi3
3 thum thum
Note The dataframe is an object BTW.
About pandas.DataFrame.iterrows, the documentation says :
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
In your case, you can use one of these *solutions (that should work with your real dataset as well):
coffee.loc[coffee["A"].eq(""), "A"] = coffee["B"]
Or :
coffee["A"] = coffee["B"].where(coffee["A"].eq(""), coffee["A"])
Or :
coffee["A"] = coffee["A"].replace("", None).fillna(coffee["B"])
Still a strange behaviour though that your original dataframe got updated within the loop without any re-assign. Also, not to mention that the row/Series is supposed to return a copy and not a view..
I've 2 dataframe, both with a column date:
I need to set in first dataframe the value of specific column found in the second dataframe,
So in first of all I find the correct row of first dataframe with:
id_row = int(dataset.loc[dataset["time"] == str(searchs.index[x])].index[0]) #example: 910
and then I want to update the value of column ['search_volume'] at this row: 910
I will do this with:
dataset['search_volume'][id_row] = searchs[kw_list[0]][x]
but I get back this error:
/root/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
my full code is, but not working and nothing is updated.
for x in range(len(searchs)):
id_row = int(dataset.loc[dataset["time"] == str(searchs.index[x])].index[0])
dataset['search_volume'][id_row] = searchs[kw_list[0]][x]
It work fine if I test manually the update with:
dataset['search_volume'][910] = searchs[kw_list[0]][47]
What's append?!
Use .loc:
dataset.loc[910, 'search_volume'] = searchs.loc[47, kw_list[0]]
For more info about the error message, see this
Also, there are way more efficient methods for doing this. As a rule of thumb, if you are looping over a dataframe, you are generally doing something wrong. Some potential solutions: pd.DataFrame.join, pd.merge, masking, pd.DataFrame.where, etc.
I keep getting warning "A value is trying to be set on a copy of a slice from a DataFrame".
How could I fix it? Any alternative
//check for NAN
//capitalise first letter
//assign 'Male' for 'm',
//assign 'Female for 'f'
myDataFrame.to_csv('new_H.csv')
genderList = myDataFrame.loc[:,"Gender"] //extract Gender column
for i in range(0, len(genderList)):
if type(genderList[i]) == float: #check for empty spaces
genderList[i] = 'NAN'
elif genderList[i].startswith('f'):
genderList[i] = 'Female'
elif genderList[i].startswith('m'):
genderList[i] = 'Male'
for row in myDataFrame.itertuples():
if type(row["Gender"]) == float:
row["Gender"] = 'NAN'
elif row["Gender"].startswith('f'):
row["Gender"] = 'Female'
elif row["Gender"].startswith('m'):
row["Gender"] = 'Male'
The line genderList = myDataFrame.loc[:,"Gender"] cause warning since you are assigning a piece of your data frame, which could result a copy so update may not be applied to original dataframe. In code above, I used itertuples method which is a more "correct" way to iterate through rows in pandas. If you want to perform an action on a specific row, you do need to create a slice of it first - you just update the value of this column in every row.
From what I see, you goal is to replace values on Gender based on previous values. In that case I recommend to check pandas's replace method which is made for that exact reason together with filter. But, since your filter is quite simple, you can do the following:
myDataFrame[myDataFrame["Gender"].str.contains('^f')] = "Female"
To update all female. I used slicing of dataframe (myDataFrame[...]) and the condition is myDataFrame["Gender"].str.contains('^f').
Nested for loop is very time inefficient. I have some ideas to make this efficient. Wondering if better alternatives can be shared.
I am trying to create a dataframe in python pulling values from multiple other dataframes. For a small number of variables/columns I can perform simple assignments. In the example below I want a cell each in two dataframes to be compared and make an assignment if equal. If they are not equal I need to iterate through the second dataframe till every cell is evaluated before making any assignment.
"""iterated through each row of first dataframe and then the second. This is to control for values in compared column
are matched correctly. """
for i in range(len(df10)):
for j in range(len(df6)): # this is not an efficient way to perform this action.
if df10.iloc[i,0] == df6.iloc[j,1]:
df10.iloc[i,23] = df6.iloc[j,6]
df10.iloc[i,24] = df6.iloc[j,1]
df10.sample(n=5)
Here is how you can do it, please see comment for description. Leave comment if something is not clear
np.random.seed(10)
df10 = pd.DataFrame(np.random.choice(5, (5,5)))
df6 = pd.DataFrame(np.random.choice(5, (4,6)))
display(df10)
display(df6)
## compare each pair of rows from 0th column of df10 and 1st column of df6
## using numpy broadcast. Which will return matrix of boolean with true at
## element i,j where values are equal
cond = df10.iloc[:,0].values[:,np.newaxis] == df6.iloc[:,1].values
## get matching index in array when the matrix is flatten
indx = np.arange(cond.size)[cond.ravel()]
## convert flattened index to row and colum index (i,j)
## where i crossponds to row index in df10 and j crossponds to
## row index in df6
i,j = indx//len(df6), indx%len(df6)
## set value using fancy indexing
df10.iloc[i,3] = df6.iloc[j,4].values
df10
Suppose I have 3 dataframe variables: res_df_union is the main dataframe and df_res and df_vacant are subdataframes created from res_df_union. They all share 2 columns called uniqueid and vacant_has_res. My goal is to compare the uniqueid column values in df_res and df_vacant, and if they match, to assign vacant_has_res in res_df_union with the value of 1.
*Note: I am using geoPandas (gpd Dataframe) instead of just pandas because I am working with spatial data but the concept is the same.
res_df_union = gpd.read_file(union, layer=union_feat)
df_parc_res = res_df_union[res_df_union.Parc_Res == 1]
unq_id_res = df_parc_res.uniqueid.unique()
df_parc_vacant = res_df_union[res_df_union.Parc_Vacant == 1]
unq_id_vac = df_parc_vacant.uniqueid.unique()
vacant_res_ids = []
for id_a in unq_id_vac:
for id_b in unq_id_res:
if id_a == id_b:
vacant_res_ids.append(id_a)
The code up to this point works. I have a list of uniqueid's that match. Now I just want to look for those unique id's in res_df_union and then assign res_df_union['vacant_has_res'] = 1. When I run the following, it either causes my IDE to crash, or never finishes running (after several hours). What am I doing wrong and is there a more efficient way to do this?
def u_row(row, id_val):
if row['uniqueid'] == id_val:
return 1
for item in res_df_union['uniqueid']:
if item in vacant_res_ids:
res_df_union['Has_Res_Association'] = res_df_union.apply(lambda row: u_row(row, item), axis = 1)