pd.to_numeric not working - python-3.x

I am facing a weird problem with pandas.
I donot know where I am going wrong?
But when I am creating a new df, there seems to be no problem. like
Any idea why?
Edit :
sat=pd.read_csv("2012_SAT_Results.csv")
sat.head()
#converted columns to numeric types
sat.iloc[:,2:]=sat.iloc[:,2:].apply(pd.to_numeric,errors="coerce")
sat.dtypes
sat_1=sat.iloc[:,2:].apply(pd.to_numeric,errors="coerce")
sat_1.head()

The fact that you can't apply to_numeric directly using .iloc appears to be a bug, but to get the same results that you're looking for (applying to_numeric to multiple columns at the same time), you could instead use:
df = pd.DataFrame({'a':['1','2'],'b':['3','4']})
# If you're applying to entire columns
df[df.columns[1:]] = df[df.columns[1:]].apply(pd.to_numeric, errors = 'coerce')
# If you want to apply to specific rows within columns
df.loc[df.index[1:], df.columns[1:]] = df.loc[df.index[1:], df.columns[1:]].apply(pd.to_numeric, errors = 'coerce')

Related

Delete rows in Dataframe using Pandas

I have a dataset with 250,000 samples. The column "CHANNEL" has 7 missing values. I want to delete those 7 rows. Here is my code:
mask = df_train["CHANNEL"].notnull()
df_train = df_train[mask]
I checked the shape by
df_train.shape
It correctly outputs 249993 rows. However, when I tried to output the entire dataset, it still shows index from 0 to 249999, like the below picture:
enter image description here
I also checked the number of missing values in each column of df_train, and each of them is zero. This problem matters because I want to do concatenation later and some issues arise. I am not sure if I missed some points when using the above commands. I would appreciate any suggestions and comments!
Try using dropna()
df_train = df_train.dropna()
You may see that the end still has the index 249999, that's just because the original index hasn't changed. To reset the index of the new data frame without the missing values, you can use reset_index()
df_train = df_train.dropna()
df_train = df_train.reset_index(drop=True)

Trying to merge 2 dataframes but receiving value error of merging object and int32 columns

I have been trying to address an issue mentioned here
I had been trying to use a list of dates to filter a dataframe, and a very gracious person was helping me, but now with the current code, I am receiving these errors.
# Assign a sequential number to each trading day
df_melt_test_percent = df_melt_test_percent.sort_index().assign(DayNumber=lambda x: range(len(x)))
# Find the indices of the FOMC_dates
tmp = pd.merge(
df_FOMC_dates, df_melt_test_percent[['DayNumber']],
left_on='FOMC_date', right_on='DayNumber'
)
# For each row, get the FOMC_dates ± 3 days
tmp['delta'] = tmp.apply(lambda _: range(-3, 4), axis=1)
tmp = tmp.explode('delta')
tmp['DayNumber'] += tmp['delta']
# Assemble the result
result = pd.merge(tmp, df_melt_test_percent, on='DayNumber')
Screenshots of dataframes:
If anyone has any advice on how to fix this, it would be greatly appreciated.
EDIT #1:
The columns on which you want to merge are not the same types in both dataframes. Likely one is string the other integer. You should convert to the same type before merging. Assuming from the little bit you showed, before your merge, run:
tmp['DayNumber'] = tmp['DayNumber'].astype(int)
Alternatively:
df_melt_test_percent['DayNumber'] = df_melt_test_percent['DayNumber'].astype(str)
NB. This might not work as you did not provide a full example. Either search by yourself the right types or provide a reproducible example.

replacing a special character in a pandas dataframe

I have a dataset that '?' instead of 'NaN' for missing values. I could have gone through each column using replace but the only problem is I have 22 columns. I am trying to create a loop do it effectively but I am getting wrong. Here is what I am doing:
for col in adult.columns:
if adult[col]=='?':
adult[col]=adult[col].str.replace('?', 'NaN')
The plan is to use the 'NaN' then use the fillna function or to drop them with dropna. The second problem is that not all the columns are categorical so the str function is also wrong. How can I easily deal with this situation?
If you're reading the data from a .csv or .xlsx file you can use the na_values parameter:
adult = pd.read_csv('path/to/file.csv', na_values=['?'])
Otherwise do what #MasonCaiby said and use adult.replace('?', float('nan'))

Warning: A value is trying to be set on a copy of a slice from a DataFrame -- Using List of Columns

I am getting the following warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Here is my code that is getting the warning:
col_names = ['Column1', 'Column2']
features = X_train[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
X_train[col_names] = features
I realize this is happening because I'm copying the dataframe. But what I am doing here is not like any of the answers I found googling, so I can't figure out how to apply their answers to my particular situation. It looks like the normal scenario where you get this warning is if you do something like this:
d2 = data[data['name'] == 'fred']
So .loc doesn't work. And .assign doesn't either because I have a list of columns instead of just a column I can assign. I'm just not quite sure how to handle this the way it wants me too.
It works fine the way it is, other than the warning. So the way I have it is correct.
I think the warning is saying for you to do something like
X_train.loc[:, col_names] = features

Loading Pandas Data Frame into Excel using writer.save() and getting indexing error

I am aggregating a Pandas DF using numpy size and then want to load the results into an Excel using writer.save. But I am getting the following error: NotImplementedError: Writing as Excel with a MultiIndex is not yet implemented.
My data looks something like this:
agt_id unique_id
abc123 ab12345
abc123 cd23456
abc123 de34567
xyz987 ef45678
xyz987 fg56789
My results should look like:
agt_id unique_id
abc123 3
xyz987 2
This is an example of my code:
df_agtvol = df_agt.groupby('agt_id').agg({'unique_id':[np.size]})
writer = pd.ExcelWriter(outfilepath, engine='xlsxwriter')
df_agtvol.to_excel(writer, sheet_name='agt_vols')
I have tried to reset the index by using:
df_agt_vol_final = df_agtvol.set_index([df_agtvol.index, 'agt_id'], inplace=True)
based on some research, but am getting a completely different error.
I am relatively new to working with Pandas dataframes, so any help would be appreciated.
You don't need a MultiIndex. The reason you get one is because np.size is wrapped in a list.
Although not explicitly documented, Pandas interprets everything in the list as a subindex for 'unique_id'. This use case falls under the "nested dict of names -> dicts of functions" case in the linked documentation.
So
df_agtvol = df_agt.groupby('agt_id').agg({'unique_id':[np.size]})
Should be
df_agtvol = df_agt.groupby('agt_id').agg({'unique_id': np.size})
This is still overly complicated and you can get the same results with a call to the count method.
df_agtvol = df_agt.groupby('agt_id').count()

Resources