I have a dataset with 250,000 samples. The column "CHANNEL" has 7 missing values. I want to delete those 7 rows. Here is my code:
mask = df_train["CHANNEL"].notnull()
df_train = df_train[mask]
I checked the shape by
df_train.shape
It correctly outputs 249993 rows. However, when I tried to output the entire dataset, it still shows index from 0 to 249999, like the below picture:
enter image description here
I also checked the number of missing values in each column of df_train, and each of them is zero. This problem matters because I want to do concatenation later and some issues arise. I am not sure if I missed some points when using the above commands. I would appreciate any suggestions and comments!
Try using dropna()
df_train = df_train.dropna()
You may see that the end still has the index 249999, that's just because the original index hasn't changed. To reset the index of the new data frame without the missing values, you can use reset_index()
df_train = df_train.dropna()
df_train = df_train.reset_index(drop=True)
Related
Summary
I am trying to loop through a pandas dataframe, and to run a secondary loop at each iteration. The secondary loop calculates something that I want to append into the original dataframe, so that when the primary loop advances, some of the rows are recalculated based on the changed values. (For those interested, this is a simple advective model of carbon accumulation in soils. When a new layer of soil is deposited, mixing processes penetrate into older layers and transform their properties to a set depth. Thus, each layer deposited changes those below it incrementally, until a former layer lies below the mixing depth.)
I have produced an example of how I want this to work, however it is generating the common error message:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
I have looked into the linked information in the error message as well as myriad posts on this forum, but none get into the continual looping through a changed dataframe.
What I've tried, and some possible solutions
Below is some example code. This code works more or less as well as I want it to. But it produces the warning. Should I:
Suppress the warning and continue working with this architecture? In this case, am I asking for trouble with un-reproducible results?
Try a different architecture altogether, like a numpy array from the original dataframe?
Try df.append() or df.copy() to avoid the warning?
I have tried `df.copy()' to no avail - the warning was still thrown.
Example code:
import pandas as pd
a = pd.DataFrame(
{
'a':[x/2 for x in range(1,11)],
'b':['hot dog', 'slider', 'watermelon', 'funnel cake', 'cotton candy', 'lemonade', 'fried oreo', 'ice cream', 'corn', 'sausage'],
'c':['meat', 'meat', 'vegan', 'vegan', 'vegan', 'vegan', 'dairy','dairy', 'vegan', 'meat']
}
)
print(a)
z = [x/(x+2) for x in range(1,5)]
print(z)
#Primary loop through rows of the main dataframe
for ind, row in a.iterrows():
#Pull out a chunk of the dataframe. This is the portion of the dataframe that will be modified. What is below
#this is already modified and locked into the geological record. What is above has not yet been deposited.
b = a.iloc[ind:(ind+len(z)), :]
#Define the size of the secondary loop. Taking the minimum avoids the model mixing below the boundary layer (key error)
loop = min([len(z), len(b)])
#Now loop through the sub-dataframe and change accordingly.
for fraction in range(loop):
b['a'].iloc[fraction] = b['a'].iloc[fraction]*z[fraction]
#Append the original dataframe with new data:
a.iloc[ind:(ind+loop), :] = b
#Try df.copy(), but still throws warning!
#a.iloc[ind:(ind+loop), :] = b.copy()
print(a)
I have been trying to address an issue mentioned here
I had been trying to use a list of dates to filter a dataframe, and a very gracious person was helping me, but now with the current code, I am receiving these errors.
# Assign a sequential number to each trading day
df_melt_test_percent = df_melt_test_percent.sort_index().assign(DayNumber=lambda x: range(len(x)))
# Find the indices of the FOMC_dates
tmp = pd.merge(
df_FOMC_dates, df_melt_test_percent[['DayNumber']],
left_on='FOMC_date', right_on='DayNumber'
)
# For each row, get the FOMC_dates ± 3 days
tmp['delta'] = tmp.apply(lambda _: range(-3, 4), axis=1)
tmp = tmp.explode('delta')
tmp['DayNumber'] += tmp['delta']
# Assemble the result
result = pd.merge(tmp, df_melt_test_percent, on='DayNumber')
Screenshots of dataframes:
If anyone has any advice on how to fix this, it would be greatly appreciated.
EDIT #1:
The columns on which you want to merge are not the same types in both dataframes. Likely one is string the other integer. You should convert to the same type before merging. Assuming from the little bit you showed, before your merge, run:
tmp['DayNumber'] = tmp['DayNumber'].astype(int)
Alternatively:
df_melt_test_percent['DayNumber'] = df_melt_test_percent['DayNumber'].astype(str)
NB. This might not work as you did not provide a full example. Either search by yourself the right types or provide a reproducible example.
So im trying to normalize my features by using .apply() iteratively on all columns of the dataframe but it gives KeyError. Can someone help?
I've tried using below code but it doesnt work :
for x in df.columns:
df[x+'_norm'] = df[x].apply(lambda x:(x-df[x].mean())/df[x].std())
I don't think it's a good idea to use mean and std functions inside the apply. You are calculating them each time which that any row is going to get its new value. Instead you can calculate them in the beginning of the loop and use of it in the apply function. Like below:
for x in df.columns:
mean = df[x].mean()
std = df[x].std()
df[x+'_norm'] = df[x].apply(lambda y:(y-mean)/std)
I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column
I am facing a weird problem with pandas.
I donot know where I am going wrong?
But when I am creating a new df, there seems to be no problem. like
Any idea why?
Edit :
sat=pd.read_csv("2012_SAT_Results.csv")
sat.head()
#converted columns to numeric types
sat.iloc[:,2:]=sat.iloc[:,2:].apply(pd.to_numeric,errors="coerce")
sat.dtypes
sat_1=sat.iloc[:,2:].apply(pd.to_numeric,errors="coerce")
sat_1.head()
The fact that you can't apply to_numeric directly using .iloc appears to be a bug, but to get the same results that you're looking for (applying to_numeric to multiple columns at the same time), you could instead use:
df = pd.DataFrame({'a':['1','2'],'b':['3','4']})
# If you're applying to entire columns
df[df.columns[1:]] = df[df.columns[1:]].apply(pd.to_numeric, errors = 'coerce')
# If you want to apply to specific rows within columns
df.loc[df.index[1:], df.columns[1:]] = df.loc[df.index[1:], df.columns[1:]].apply(pd.to_numeric, errors = 'coerce')