Replace specific value in pandas dataframe column, else convert column to numeric

Replace specific value in pandas dataframe column, else convert column to numeric - python-3.x

Given the following pandas dataframe
+----+------------------+-------------------------------------+--------------------------------+
| | AgeAt_X | AgeAt_Y | AgeAt_Z |
|----+------------------+-------------------------------------+--------------------------------+
| 0 | Older than 100 | Older than 100 | 74.13 |
| 1 | nan | nan | 58.46 |
| 2 | nan | 8.4 | 54.15 |
| 3 | nan | nan | 57.04 |
| 4 | nan | 57.04 | nan |
+----+------------------+-------------------------------------+--------------------------------+
how can I replace values in specific columns which equal Older than 100 with nan
+----+------------------+-------------------------------------+--------------------------------+
| | AgeAt_X | AgeAt_Y | AgeAt_Z |
|----+------------------+-------------------------------------+--------------------------------+
| 0 | nan | nan | 74.13 |
| 1 | nan | nan | 58.46 |
| 2 | nan | 8.4 | 54.15 |
| 3 | nan | nan | 57.04 |
| 4 | nan | 57.04 | nan |
+----+------------------+-------------------------------------+--------------------------------+
Notes
After removing the Older than 100 string from the desired columns, I convert the columns to numeric in order to perform calculations on said columns.
There are other columns in this dataframe (that I have excluded from this example), which will not be converted to numeric, so the conversion to numeric must be done one column at a time.
What I've tried
Attempt 1
if df.isin('Older than 100'):
df.loc[df['AgeAt_X']] = ''
else:
df['AgeAt_X'] = pd.to_numeric(df["AgeAt_X"])
Attempt 2
if df.loc[df['AgeAt_X']] == 'Older than 100r':
df.loc[df['AgeAt_X']] = ''
elif df.loc[df['AgeAt_X']] == '':
df['AgeAt_X'] = pd.to_numeric(df["AgeAt_X"])
Attempt 3
df['AgeAt_X'] = ['' if ele == 'Older than 100' else df.loc[df['AgeAt_X']] for ele in df['AgeAt_X']]
Attempts 1, 2 and 3 return the following error:
KeyError: 'None of [0 NaN\n1 NaN\n2 NaN\n3 NaN\n4 NaN\n5 NaN\n6 NaN\n7 NaN\n8 NaN\n9 NaN\n10 NaN\n11 NaN\n12 NaN\n13 NaN\n14 NaN\n15 NaN\n16 NaN\n17 NaN\n18 NaN\n19 NaN\n20 NaN\n21 NaN\n22 NaN\n23 NaN\n24 NaN\n25 NaN\n26 NaN\n27 NaN\n28 NaN\n29 NaN\n ..\n6332 NaN\n6333 NaN\n6334 NaN\n6335 NaN\n6336 NaN\n6337 NaN\n6338 NaN\n6339 NaN\n6340 NaN\n6341 NaN\n6342 NaN\n6343 NaN\n6344 NaN\n6345 NaN\n6346 NaN\n6347 NaN\n6348 NaN\n6349 NaN\n6350 NaN\n6351 NaN\n6352 NaN\n6353 NaN\n6354 NaN\n6355 NaN\n6356 NaN\n6357 NaN\n6358 NaN\n6359 NaN\n6360 NaN\n6361 NaN\nName: AgeAt_X, Length: 6362, dtype: float64] are in the [index]'
Attempt 4
df['AgeAt_X'] = df['AgeAt_X'].replace({'Older than 100': ''})
Attempt 4 returns the following error:
TypeError: Cannot compare types 'ndarray(dtype=float64)' and 'str'
I've also looked at a few posts. The two below do not actually replace the value but create a new column derived from others
Replace specific values in Pandas DataFrame
Pandas replace DataFrame values

We can loop through each column and check if the sentence is present. If we get a hit, we replace the sentence with NaN with Series.str.replace and right after convert it to numeric with Series.astype, in this case float:
df.dtypes
AgeAt_X object
AgeAt_Y object
AgeAt_Z float64
dtype: object
sent = 'Older than 100'
for col in df.columns:
if sent in df[col].values:
df[col] = df[col].str.replace(sent, 'NaN')
df[col] = df[col].astype(float)
print(df)
AgeAt_X AgeAt_Y AgeAt_Z
0 NaN NaN 74.13
1 NaN NaN 58.46
2 NaN 8.40 54.15
3 NaN NaN 57.04
4 NaN 57.04 NaN
df.dtypes
AgeAt_X float64
AgeAt_Y float64
AgeAt_Z float64
dtype: object

If I understand you correctly, you can replace all occurrences of Older than 100 with np.nan with a single call to DataFrame.replace. If all remaining values are numeric, then the replace will implicitly change the data type of the column to numeric:
# Minimal example DataFrame
df = pd.DataFrame({'AgeAt_X': ['Older than 100', np.nan, np.nan],
'AgeAt_Y': ['Older than 100', np.nan, 8.4],
'AgeAt_Z': [74.13, 58.46, 54.15]})
df
AgeAt_X AgeAt_Y AgeAt_Z
0 Older than 100 Older than 100 74.13
1 NaN NaN 58.46
2 NaN 8.4 54.15
df.dtypes
AgeAt_X object
AgeAt_Y object
AgeAt_Z float64
dtype: object
# Replace occurrences of 'Older than 100' with np.nan in any column
df.replace('Older than 100', np.nan, inplace=True)
df
AgeAt_X AgeAt_Y AgeAt_Z
0 NaN NaN 74.13
1 NaN NaN 58.46
2 NaN 8.4 54.15
df.dtypes
AgeAt_X float64
AgeAt_Y float64
AgeAt_Z float64
dtype: object

Related

How to solve the ValueError: Unstacked DataFrame is too big, causing int32 overflow in python?

I have a dataframe in dynamic format for each ID
df:
ID |Start Date|End date |claim_no|claim_type|Admission_date|Discharge_date|Claim_amt|Approved_amt
10 |01-Apr-20 |31-Mar-21| 1123 |CSHLESS | 23-Aug-2020 | 25-Aug-2020 | 25406 | 19351
10 |01-Apr-20 |31-Mar-21| 1212 |POSTHOSP | 30-Aug-2020 | 01-Sep-2020 | 4209 | 3964
10 |01-Apr-20 |31-Mar-21| 1680 |CSHLESS | 18-Mar-2021 | 23-Mar-2021 | 18002 | 0
11 |12-Dec-20 |11-Dec-21| 1503 |CSHLESS | 12-Jan-2021 | 15-Jan-2021 | 76137 | 50286
11 |12-Dec-20 |11-Dec-21| 1505 |CSHLESS | 05-Jan-2021 | 07-Jan-2021 | 30000 | 0
Based on the ID column i am trying to convert all the dynamic variables into a static format so that i can have a single row for each ID.
Columns such as ID, Start Date,End date are static in nature and rest of the columns are dynamic in nature for each ID.
Inorder to acheive the below output:
ID |Start Date|End date |claim_no_1|claim_type_1|Admission_date_1|Discharge_date_1|Claim_amt_1|Approved_amt_1|claim_no_2|claim_type_2|Admission_date_2|Discharge_date_2|Claim_amt_2|Approved_amt_2|claim_no_3|claim_type_3|Admission_date_3|Discharge_date_3|Claim_amt_3|Approved_amt_3
10 |01-Apr-20 |31-Mar-21| 1123 |CSHLESS | 23-Aug-2020 | 25-Aug-2020 | 25406 | 19351 | 1212 |POSTHOSP | 30-Aug-2020 | 01-Sep-2020 | 4209 | 3964 | 1680 |CSHLESS | 18-Mar-2021 | 23-Mar-2021 | 18002 | 0
i am using the below code:
# Index columns
idx = ['ID', 'Start Date', 'End date']
# Sequential counter to identify unique rows per index columns
cols = df.groupby(idx).cumcount() + 1
# Reshape using stack and unstack
df_out = df.set_index([*idx, cols]).stack().unstack([-2, -1])
# Flatten the multiindex columns
df_out.columns = df_out.columns.map('{0[1]}_{0[0]}'.format)
but it throws a ValueError: Unstacked DataFrame is too big, causing int32 overflow

Try this:
Index columns (very similar to your code)
idx = ['ID', 'Start Date', 'End date']
# Sequential counter to identify unique rows per index columns
df['nrow'] = df.groupby(idx)['claim_no'].transform('rank')
df['nrow'] = df['nrow'].astype(int).astype(str)
instead of stack & unstack. Using these functions you can have better control over columns
df1 = pd.melt(df, id_vars =['nrow', *idx] , value_vars=['claim_no', 'claim_type', 'Admission_date',
'Discharge_date', 'Claim_amt', 'Approved_amt'],
value_name='var'
)
df2 = df1.pivot(index=[*idx],
columns=['variable', 'nrow'], values='var')
df2.columns = ['_'.join(col).rstrip('_') for col in df2.columns.values]
print(df2)
claim_no_1 claim_no_2 claim_no_3 claim_type_1 claim_type_2 claim_type_3 Admission_date_1 Admission_date_2 Admission_date_3 Discharge_date_1 Discharge_date_2 Discharge_date_3 Claim_amt_1 Claim_amt_2 Claim_amt_3 Approved_amt_1 Approved_amt_2 Approved_amt_3
ID Start Date End date
10 01-Apr-20 31-Mar-21 1123 1212 1680 CSHLESS POSTHOSP CSHLESS 23-Aug-2020 30-Aug-2020 18-Mar-2021 25-Aug-2020 01-Sep-2020 23-Mar-2021 25406 4209 18002 19351 3964 0
11 12-Dec-20 11-Dec-21 1503 1505 NaN CSHLESS CSHLESS NaN 12-Jan-2021 05-Jan-2021 NaN 15-Jan-2021 07-Jan-2021 NaN 76137 30000 NaN 50286 0 NaN

compare columns with NaN or <NA> values pandas

I have the dataframe with NaN and values, now I want to compare two columns in the same dataframe whether each row values in null or not null. For examples,
if the column a_1 have null values, column a_2 have not null values, then for that particular
row, the result should be 1 in the new column a_12.
If the values in both a_1(value is 123) & a_2(value is 345) is not null, and the values are
not equal, then the result should be 3 in column a_12.
below is the code snippet I have used for comparison, for the scenario 1, I am getting the result as 3 instead of 1. Please guide me to get the correct output.
try:
if (x[cols[0]]==x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 0
elif (np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 0
elif (~np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 1
elif (np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 3
else:
pass
except Exception as exc:
if (x[cols[0]]==x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 0
elif (pd.isna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 0
elif (pd.notna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 1
elif (pd.isna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 3
else:
pass
I have used pd.isna() and pd.notna(), also np.isnan() and ~np.isnan(), because for some columns the second method (np.isnan()) is working, for some columns its just throwing an error.
Please guide me to achieve the result as excepted.
Expected Output:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 1 |
| <NA> | qweweqw | 2 |
| adsadgsgd | wwuwquq | 3 |
Output Got with the above code:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 3 |
| <NA> | qweweqw | 3 |
| adsadgsgd | wwuwquq | 3 |

Going by the logic in your code, you'd want to define a function and apply it across your DataFrame.
df = pd.DataFrame({'a_1': [1, 2, np.nan, np.nan, 1], 'a_2': [2, np.nan, 1, np.nan, 1]})
The categories you want map neatly to binary numbers, which you can use to write a short function like -
def nan_check(row):
x, y = row
if x != y:
return int(f'{int(pd.notna(y))}{int(pd.notna(x))}', base=2)
return 0
df['flag'] = df.apply(nan_check, axis=1)
Output
a_1 a_2 flag
0 1.0 2.0 3
1 2.0 NaN 1
2 NaN 1.0 2
3 NaN NaN 0
4 1.0 1.0 0

You can try np.select, but I think you need to rethink the condition and the expected output
Condition 1: if the column a_1 have null values, column a_2 have not null values, then for that particular row, the result should be 1 in the new column a_12.
Condition 2: If the values in both a_1 & a_2 is not null, and the values are not equal, then the result should be 3 in column a_12.
df['a_12'] = np.select(
[df['a_1'].isna() & df['a_2'].notna(),
df['a_1'].notna() & df['a_2'].notna() & df['a_1'].ne(df['a_2'])],
[1, 3],
default=0
)
print(df)
a_1 a_2 result a_12
0 gssfwe gssfwe 0 0
1 NaN NaN 0 0
2 fsfsfw NaN 1 0 # Shouldn't be Condition 1 since a_1 is not NaN
3 NaN qweweqw 2 1 # Condition 1
4 adsadgsgd wwuwquq 3 3

Create new column and calculate values to the column in python row wise

I need to create a new column as Billing and Non-Billing based on the Billable column. If the Billable is 'Yes' then i should create a new column as Billing and if its 'No' then need to create a new column as 'Non-Billable' and need to calculate it. Calculation should be in row axis.
Calculation for Billing in row:
Billing = df[Billing] * sum/168 * 100
Calculation for Non-Billing in row:
Non-Billing = df[Non-Billing] * sum/ 168 * 100
Data
Employee Name | Java | Python| .Net | React | Billable|
----------------------------------------------------------------
|Priya | 10 | | 5 | | Yes |
|Krithi | | 10 | 20 | | No |
|Surthi | | 5 | | | yes |
|Meena | | 20 | | 10 | No |
|Manju | 20 | 10 | 10 | | Yes |
Output
I have tried using insert statement but i cannot keep on inserting it. I tried append also but its not working.
Bill_amt = []
Non_Bill_amt = []
for i in df['Billable']:
if i == "Yes" or i == None:
Bill_amt = (df[Bill_amt].sum(axis=1)/168 * 100).round(2)
df.insert (len( df.columns ), column='Billable Amount', value=Bill_amt )#inserting the column and it name
#CANNOT INSERT ROW AFTER IT AND CANNOT APPEND IT TOO
else:
Non_Bill_amt = (DF[Non_Bill_amt].sum ( axis=1 ) / 168 * 100).round ( 2 )
df.insert ( len ( df.columns ), column='Non Billable Amount', value=Non_Bill_amt ) #inserting the column and its name
#CANNOT INSERT ROW AFTER IT.

Use .sum(axis=1) and then np.where() to put the values in respective columns. For example:
x = df.loc[:, "Java":"React"].sum(axis=1) / 168 * 100
df["Bill"] = np.where(df["Billable"].str.lower() == "yes", x, "")
df["Non_Bill"] = np.where(df["Billable"].str.lower() == "no", x, "")
print(df)
Prints:
Employee_Name Java Python .Net React Billable Bill Non_Bill
0 Priya 10.0 NaN 5.0 NaN Yes 8.928571428571429
1 Krithi NaN 10.0 20.0 NaN No 17.857142857142858
2 Surthi NaN 5.0 NaN NaN yes 2.976190476190476
3 Meena NaN 20.0 NaN 10.0 No 17.857142857142858
4 Manju 20.0 10.0 10.0 NaN Yes 23.809523809523807

apply function to Pandas dataframe with index as argument

I made a function that takes one single argument, but now I want to apply it to a whole dataframe where the index is the argument. My first impulse is to do a for loop but I know those are frowned upon in Pandas.
I have some data from the World Bank API, in a dataframe "df" for many countries and years:
+-------------------------------------------------------------+
| ODA gdp_per_cap sant mort |
| country date |
| Afghanistan 2010 6235319824.219 11.264 34.177 87.600 |
| 2009 6113120117.188 18.515 32.910 91.400 |
| 2008 4811209960.938 1.594 31.655 95.400 |
| 2007 4982609863.281 11.023 30.412 99.500 |
| 2006 2895830078.125 2.253 29.181 103.700 |
+-------------------------------------------------------------+
The country and date are indexes. I need create a new dataframe and populate it with calculations. The new dataframe has the same country index but no date.
I wrote this function to calculate some fields:
def fill_df(country):
total_oda = bil(df.loc[country, 'ODA'].sum()/10)
gdp = df.loc[country, 'gdp_per_cap'].mean()
sanitation = percent_change(country, 'sant')
mortality = percent_change(country, 'mort')
metric = (-pow(total_oda, .5) + gdp/4 + sanitation*.5 - mortality*0.5 + 80)*.2
final_df.loc[country] = [total_oda, gdp, sanitation, mortality, metric]
fill_df('Afghanistan')
fill_df('Burundi')
Ok so the function works for one country at a time. This is the new final_df:
+-------------------------------------------------------------+
| ODA gdp sant mort metric |
| country |
| Afghanistan 3.329 6.606 45.293 -29.695 23.464 |
| Burundi 0.392 0.236 1.115 -39.439 19.942 |
| Burkina Faso NaN NaN NaN NaN NaN |
| Central African Republic NaN NaN NaN NaN NaN |
| Congo, Dem. Rep. NaN NaN NaN NaN NaN |
| Eritrea NaN NaN NaN NaN NaN |
+-------------------------------------------------------------+
Now I want to apply it to all final_df. Below is the idea, but does not work because my function takes one argument instead of an index of values.
country_idx = df.index.get_level_values(0).unique()
final_df.apply(fill_df, axis=0, args=country_idx)
How to apply the function to final_df?

You can use groupby with agg on level=0 argument
final_df = df.groupby(level=0).agg((
total_oda = ('ODA', lambda x: x.sum()/10),
gdp = ('gdp_per_cap', 'mean'),
sanitation = ('sant', percent_change),
mortality = ('mort', percent_change),
metric = (-pow(total_oda, .5) + gdp/4 + sanitation*.5 - mortality*0.5 + 80)*.2
))

Create "leakage-free" Variables in Python?

I have a pandas data frame with several thousand observations and I would like to create "leakage-free" variables in Python. So I am looking for a way to calculate e.g. a group-specific mean of a variable without the single observation in row i.
For example:
| Group | Price | leakage-free Group Mean |
-------------------------------------------
| 1 | 20 | 25 |
| 1 | 40 | 15 |
| 1 | 10 | 30 |
| 2 | ... | ... |
I would like to do that with several variables and I would like to create mean, median and variance in such a way, so a computationally fast method might be good. If a group has only one row I would like to enter 0s in the leakage-free Variable.
As I am rather a beginner in Python, some piece of code might be very helpful. Thank You!!

With one-liner:
df = pd.DataFrame({'Group': [1,1,1,2], 'Price':[20,40,10,30]})
df['lfgm'] = df.groupby('Group').transform(lambda x: (x.sum()-x)/(len(x)-1)).fillna(0)
print(df)
Output:
Group Price lfgm
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
3 2 30 0.0
Update:
For median and variance (not one-liners unfortunately):
df = pd.DataFrame({'Group': [1,1,1,1,2], 'Price':[20,100,10,70,30]})
def f(x):
for i in x.index:
z = x.loc[x.index!=i, 'Price']
x.at[i, 'mean'] = z.mean()
x.at[i, 'median'] = z.median()
x.at[i, 'var'] = z.var()
return x[['mean', 'median', 'var']]
df = df.join(df.groupby('Group').apply(f))
print(df)
Output:
Group Price mean median var
0 1 20 60.000000 70.0 2100.000000
1 1 100 33.333333 20.0 1033.333333
2 1 10 63.333333 70.0 1633.333333
3 1 70 43.333333 20.0 2433.333333
4 2 30 NaN NaN NaN

Use:
grp = df.groupby('Group')
n = grp['Price'].transform('count')
mean = grp['Price'].transform('mean')
df['new_col'] = (mean*n - df['Price'])/(n-1)
print(df)
Group Price new_col
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
Note: This solution will be faster than using apply, you can test using %%timeit followed by the codes.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Replace specific value in pandas dataframe column, else convert column to numeric - python-3.x

Related

How to solve the ValueError: Unstacked DataFrame is too big, causing int32 overflow in python?

compare columns with NaN or <NA> values pandas

Create new column and calculate values to the column in python row wise

apply function to Pandas dataframe with index as argument

Create "leakage-free" Variables in Python?

Categories

Resources