How to solve the ValueError: Unstacked DataFrame is too big, causing int32 overflow in python? - python-3.x

I have a dataframe in dynamic format for each ID
df:
ID |Start Date|End date |claim_no|claim_type|Admission_date|Discharge_date|Claim_amt|Approved_amt
10 |01-Apr-20 |31-Mar-21| 1123 |CSHLESS | 23-Aug-2020 | 25-Aug-2020 | 25406 | 19351
10 |01-Apr-20 |31-Mar-21| 1212 |POSTHOSP | 30-Aug-2020 | 01-Sep-2020 | 4209 | 3964
10 |01-Apr-20 |31-Mar-21| 1680 |CSHLESS | 18-Mar-2021 | 23-Mar-2021 | 18002 | 0
11 |12-Dec-20 |11-Dec-21| 1503 |CSHLESS | 12-Jan-2021 | 15-Jan-2021 | 76137 | 50286
11 |12-Dec-20 |11-Dec-21| 1505 |CSHLESS | 05-Jan-2021 | 07-Jan-2021 | 30000 | 0
Based on the ID column i am trying to convert all the dynamic variables into a static format so that i can have a single row for each ID.
Columns such as ID, Start Date,End date are static in nature and rest of the columns are dynamic in nature for each ID.
Inorder to acheive the below output:
ID |Start Date|End date |claim_no_1|claim_type_1|Admission_date_1|Discharge_date_1|Claim_amt_1|Approved_amt_1|claim_no_2|claim_type_2|Admission_date_2|Discharge_date_2|Claim_amt_2|Approved_amt_2|claim_no_3|claim_type_3|Admission_date_3|Discharge_date_3|Claim_amt_3|Approved_amt_3
10 |01-Apr-20 |31-Mar-21| 1123 |CSHLESS | 23-Aug-2020 | 25-Aug-2020 | 25406 | 19351 | 1212 |POSTHOSP | 30-Aug-2020 | 01-Sep-2020 | 4209 | 3964 | 1680 |CSHLESS | 18-Mar-2021 | 23-Mar-2021 | 18002 | 0
i am using the below code:
# Index columns
idx = ['ID', 'Start Date', 'End date']
# Sequential counter to identify unique rows per index columns
cols = df.groupby(idx).cumcount() + 1
# Reshape using stack and unstack
df_out = df.set_index([*idx, cols]).stack().unstack([-2, -1])
# Flatten the multiindex columns
df_out.columns = df_out.columns.map('{0[1]}_{0[0]}'.format)
but it throws a ValueError: Unstacked DataFrame is too big, causing int32 overflow

Try this:
Index columns (very similar to your code)
idx = ['ID', 'Start Date', 'End date']
# Sequential counter to identify unique rows per index columns
df['nrow'] = df.groupby(idx)['claim_no'].transform('rank')
df['nrow'] = df['nrow'].astype(int).astype(str)
instead of stack & unstack. Using these functions you can have better control over columns
df1 = pd.melt(df, id_vars =['nrow', *idx] , value_vars=['claim_no', 'claim_type', 'Admission_date',
'Discharge_date', 'Claim_amt', 'Approved_amt'],
value_name='var'
)
df2 = df1.pivot(index=[*idx],
columns=['variable', 'nrow'], values='var')
df2.columns = ['_'.join(col).rstrip('_') for col in df2.columns.values]
print(df2)
claim_no_1 claim_no_2 claim_no_3 claim_type_1 claim_type_2 claim_type_3 Admission_date_1 Admission_date_2 Admission_date_3 Discharge_date_1 Discharge_date_2 Discharge_date_3 Claim_amt_1 Claim_amt_2 Claim_amt_3 Approved_amt_1 Approved_amt_2 Approved_amt_3
ID Start Date End date
10 01-Apr-20 31-Mar-21 1123 1212 1680 CSHLESS POSTHOSP CSHLESS 23-Aug-2020 30-Aug-2020 18-Mar-2021 25-Aug-2020 01-Sep-2020 23-Mar-2021 25406 4209 18002 19351 3964 0
11 12-Dec-20 11-Dec-21 1503 1505 NaN CSHLESS CSHLESS NaN 12-Jan-2021 05-Jan-2021 NaN 15-Jan-2021 07-Jan-2021 NaN 76137 30000 NaN 50286 0 NaN

Related

compare columns with NaN or <NA> values pandas

I have the dataframe with NaN and values, now I want to compare two columns in the same dataframe whether each row values in null or not null. For examples,
if the column a_1 have null values, column a_2 have not null values, then for that particular
row, the result should be 1 in the new column a_12.
If the values in both a_1(value is 123) & a_2(value is 345) is not null, and the values are
not equal, then the result should be 3 in column a_12.
below is the code snippet I have used for comparison, for the scenario 1, I am getting the result as 3 instead of 1. Please guide me to get the correct output.
try:
if (x[cols[0]]==x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 0
elif (np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 0
elif (~np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 1
elif (np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 3
else:
pass
except Exception as exc:
if (x[cols[0]]==x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 0
elif (pd.isna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 0
elif (pd.notna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 1
elif (pd.isna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 3
else:
pass
I have used pd.isna() and pd.notna(), also np.isnan() and ~np.isnan(), because for some columns the second method (np.isnan()) is working, for some columns its just throwing an error.
Please guide me to achieve the result as excepted.
Expected Output:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 1 |
| <NA> | qweweqw | 2 |
| adsadgsgd | wwuwquq | 3 |
Output Got with the above code:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 3 |
| <NA> | qweweqw | 3 |
| adsadgsgd | wwuwquq | 3 |
Going by the logic in your code, you'd want to define a function and apply it across your DataFrame.
df = pd.DataFrame({'a_1': [1, 2, np.nan, np.nan, 1], 'a_2': [2, np.nan, 1, np.nan, 1]})
The categories you want map neatly to binary numbers, which you can use to write a short function like -
def nan_check(row):
x, y = row
if x != y:
return int(f'{int(pd.notna(y))}{int(pd.notna(x))}', base=2)
return 0
df['flag'] = df.apply(nan_check, axis=1)
Output
a_1 a_2 flag
0 1.0 2.0 3
1 2.0 NaN 1
2 NaN 1.0 2
3 NaN NaN 0
4 1.0 1.0 0
You can try np.select, but I think you need to rethink the condition and the expected output
Condition 1: if the column a_1 have null values, column a_2 have not null values, then for that particular row, the result should be 1 in the new column a_12.
Condition 2: If the values in both a_1 & a_2 is not null, and the values are not equal, then the result should be 3 in column a_12.
df['a_12'] = np.select(
[df['a_1'].isna() & df['a_2'].notna(),
df['a_1'].notna() & df['a_2'].notna() & df['a_1'].ne(df['a_2'])],
[1, 3],
default=0
)
print(df)
a_1 a_2 result a_12
0 gssfwe gssfwe 0 0
1 NaN NaN 0 0
2 fsfsfw NaN 1 0 # Shouldn't be Condition 1 since a_1 is not NaN
3 NaN qweweqw 2 1 # Condition 1
4 adsadgsgd wwuwquq 3 3

Create new column and calculate values to the column in python row wise

I need to create a new column as Billing and Non-Billing based on the Billable column. If the Billable is 'Yes' then i should create a new column as Billing and if its 'No' then need to create a new column as 'Non-Billable' and need to calculate it. Calculation should be in row axis.
Calculation for Billing in row:
Billing = df[Billing] * sum/168 * 100
Calculation for Non-Billing in row:
Non-Billing = df[Non-Billing] * sum/ 168 * 100
Data
Employee Name | Java | Python| .Net | React | Billable|
----------------------------------------------------------------
|Priya | 10 | | 5 | | Yes |
|Krithi | | 10 | 20 | | No |
|Surthi | | 5 | | | yes |
|Meena | | 20 | | 10 | No |
|Manju | 20 | 10 | 10 | | Yes |
Output
I have tried using insert statement but i cannot keep on inserting it. I tried append also but its not working.
Bill_amt = []
Non_Bill_amt = []
for i in df['Billable']:
if i == "Yes" or i == None:
Bill_amt = (df[Bill_amt].sum(axis=1)/168 * 100).round(2)
df.insert (len( df.columns ), column='Billable Amount', value=Bill_amt )#inserting the column and it name
#CANNOT INSERT ROW AFTER IT AND CANNOT APPEND IT TOO
else:
Non_Bill_amt = (DF[Non_Bill_amt].sum ( axis=1 ) / 168 * 100).round ( 2 )
df.insert ( len ( df.columns ), column='Non Billable Amount', value=Non_Bill_amt ) #inserting the column and its name
#CANNOT INSERT ROW AFTER IT.
Use .sum(axis=1) and then np.where() to put the values in respective columns. For example:
x = df.loc[:, "Java":"React"].sum(axis=1) / 168 * 100
df["Bill"] = np.where(df["Billable"].str.lower() == "yes", x, "")
df["Non_Bill"] = np.where(df["Billable"].str.lower() == "no", x, "")
print(df)
Prints:
Employee_Name Java Python .Net React Billable Bill Non_Bill
0 Priya 10.0 NaN 5.0 NaN Yes 8.928571428571429
1 Krithi NaN 10.0 20.0 NaN No 17.857142857142858
2 Surthi NaN 5.0 NaN NaN yes 2.976190476190476
3 Meena NaN 20.0 NaN 10.0 No 17.857142857142858
4 Manju 20.0 10.0 10.0 NaN Yes 23.809523809523807

How to create this year_month sales and previous year_month sales in two different columns?

I need to create two different columns, one for this year sales and one column for last year sales from a transactional level data?
Data format:-
Date | bill amount
2019-07-22 | 500
2019-07-25 | 200
2020-11-15 | 100
2020-11-06 | 900
2020-12-09 | 50
2020-12-21 | 600
Required format:-
Year_month |This month Sales | Prev month sales
2019_07 | 700 | -
2020_11 | 1000 | -
2020_12 | 650 | 1000
The relatively tricky bit is to figure out what the previous month is. We do it by figuring out the beginning of the month for each date and then rolling back by 1 month. Note that this will take care of January -> December of previous year issues
We start by creating a sample dataframe and importing some useful modules
from io import StringIO
from datetime import datetime,timedelta
from dateutil.relativedelta import relativedelta
data = StringIO(
"""
date|amount
2019-07-22|500
2019-07-25|200
2020-11-15|100
2020-11-06|900
2020-12-09|50
2020-12-21|600
""")
df = pd.read_csv(data,sep='|')
df['date'] = pd.to_datetime(df['date'])
df
we get
date amount
0 2019-07-22 500
1 2019-07-25 200
2 2020-11-15 100
3 2020-11-06 900
4 2020-12-09 50
5 2020-12-21 600
Then we figure out the month start and the previous month start using datetime utilities
df['month_start'] = df['date'].apply(lambda d:datetime(year = d.year, month = d.month, day = 1))
df['prev_month_start'] = df['month_start'].apply(lambda d:d+relativedelta(months = -1))
Then we summarize monthly sales using groupby on month start
ms_df = df.drop(columns = 'date').groupby('month_start').agg({'prev_month_start':'first','amount':sum}).reset_index()
ms_df
so we get
month_start prev_month_start amount
0 2019-07-01 2019-06-01 700
1 2020-11-01 2020-10-01 1000
2 2020-12-01 2020-11-01 650
Then we join (merge) ms_df on itself by mapping 'prev_month_start' to 'month_start'
ms_df2 = ms_df.merge(ms_df, left_on='prev_month_start', right_on='month_start', how = 'left', suffixes = ('','_prev'))
We are more or less there but now make it pretty by getting rid of superfluous columns, adding labels, etc
ms_df2['label'] = ms_df2['month_start'].dt.strftime('%Y_%m')
ms_df2 = ms_df2.drop(columns = ['month_start','prev_month_start','month_start_prev','prev_month_start_prev'])
columns = ['label','amount','amount_prev']
ms_df2 = ms_df2[columns]
and we get
| | label | amount | amount_prev |
|---:|--------:|---------:|--------------:|
| 0 | 2019_07 | 700 | nan |
| 1 | 2020_11 | 1000 | nan |
| 2 | 2020_12 | 650 | 1000 |
Using #piterbarg's data, we can use resample, combined with shift and concat to get your desired data:
import pandas as pd
from io import StringIO
data = StringIO(
"""
date|amount
2019-07-22|500
2019-07-25|200
2020-11-15|100
2020-11-06|900
2020-12-09|50
2020-12-21|600
"""
)
df = pd.read_csv(data, sep="|", parse_dates=["date"])
df
date amount
0 2019-07-22 500
1 2019-07-25 200
2 2020-11-15 100
3 2020-11-06 900
4 2020-12-09 50
5 2020-12-21 600
Get the sum for current sales:
data = df.resample(on="date", rule="1M").amount.sum().rename("This_month")
data
date
2019-07-31 700
2019-08-31 0
2019-09-30 0
2019-10-31 0
2019-11-30 0
2019-12-31 0
2020-01-31 0
2020-02-29 0
2020-03-31 0
2020-04-30 0
2020-05-31 0
2020-06-30 0
2020-07-31 0
2020-08-31 0
2020-09-30 0
2020-10-31 0
2020-11-30 1000
2020-12-31 650
Freq: M, Name: This_month, dtype: int64
Now, we can shift the month to get values for previous month, and drop rows that have 0 as total sales to get your final output:
(pd.concat([data, data.shift().rename("previous_month")], axis=1)
.query("This_month!=0")
.fillna(0))
This_month previous_month
date
2019-07-31 700 0.0
2020-11-30 1000 0.0
2020-12-31 650 1000.0

Python, converting int to str, trailing/leading decimal/zeros

I convert my dataframe values to str, but when I concatenate them together the previous ints are including trailing decimals.
df["newcol"] = df['columna'].map(str) + '_' + df['columnb'].map(str) + '_' + df['columnc'].map(str)
This is giving me output like
500.0 how can I get rid of this leading/trailing decimal? sometimes my data in column a will have non alpha numeric characters.
+---------+---------+---------+------------------+----------------------+
| columna | columnb | columnc | expected | currently getting |
+---------+---------+---------+------------------+----------------------+
| | -1 | 27 | _-1_27 | _-1.0_27.0 |
| | -1 | 42 | _-1_42 | _-1.0_42.0 |
| | -1 | 67 | _-1_67 | _-1.0_67.0 |
| | -1 | 95 | _-1_95 | _-1.0_95.0 |
| 91_CCMS | 14638 | 91 | 91_CCMS_14638_91 | 91_CCMS_14638.0_91.0 |
| DIP96 | 1502 | 96 | DIP96_1502_96 | DIP96_1502.0_96.0 |
| 106 | 11694 | 106 | 106_11694_106 | 00106_11694.0_106.0 |
+---------+---------+---------+------------------+----------------------+
Error:
invalid literal for int() with base 10: ''
Edit:
If your df has more than 3 columns, and you want to join only 3 columns, you may specify those columns in the command using columns slicing. Assume your df has 5 columns named as : AA, BB, CC, DD, EE. You want only joining columns CC, DD, EE. You just need to specify those 3 columns before the fillna, and assign the result to newcol as you want:
df["newcol"] = df[['CC', 'DD', 'EE']].fillna('') \
.applymap(lambda x: x if isinstance(x, str) else str(int(x))).agg('_'.join, axis=1)
Note: I just break command into 2 lines using '\' for easy reading.
Original:
I guess your real data of columna columnb columnc contain str, float, int, empty space, blank space, and maybe even NaN.
Float with decimal values = .00 in a column dtype object will show without decimal.
Assume your df has only 3 columns: colmna, columnb, columnc as you said. Using command below will handle: str, float, int, NaN and joining 3 columns into one as you want:
df.fillna('').applymap(lambda x: x if isinstance(x, str) else str(int(x))).agg('_'.join, axis=1)
I created a sample similar as yours
columna columnb columnc
0 -1 27
1 NaN -1 42
2 -1 67
3 -1 95
4 91_CCMS 14638 91
5 DIP96 96
6 106 11694 106
Using your command returns the concatenated string having '.0' as you described
df['columna'].map(str) + '_' + df['columnb'].map(str) + '_' + df['columnc'].map(str)
Out[1926]:
0 _-1.0_27.0
1 nan_-1.0_42.0
2 _-1.0_67.0
3 _-1.0_95.0
4 91_CCMS_14638_91
5 DIP96__96
6 106_11694_106
dtype: object
Using my command:
df.fillna('').applymap(lambda x: x if isinstance(x, str) else str(int(x))).agg('_'.join, axis=1)
Out[1927]:
0 _-1_27
1 _-1_42
2 _-1_67
3 _-1_95
4 91_CCMS_14638_91
5 DIP96__96
6 106_11694_106
dtype: object
I couldn't reproduce this error but maybe you could try something like:
df["newcol"] = df['columna'].map(lambda x: str(int(x)) if isinstance(x, int) else str(x)) + '_' + df['columnb'].map(lambda x: str(int(x))) + '_' + df['columnc'].map(lambda x: str(int(x)))

Replace specific value in pandas dataframe column, else convert column to numeric

Given the following pandas dataframe
+----+------------------+-------------------------------------+--------------------------------+
| | AgeAt_X | AgeAt_Y | AgeAt_Z |
|----+------------------+-------------------------------------+--------------------------------+
| 0 | Older than 100 | Older than 100 | 74.13 |
| 1 | nan | nan | 58.46 |
| 2 | nan | 8.4 | 54.15 |
| 3 | nan | nan | 57.04 |
| 4 | nan | 57.04 | nan |
+----+------------------+-------------------------------------+--------------------------------+
how can I replace values in specific columns which equal Older than 100 with nan
+----+------------------+-------------------------------------+--------------------------------+
| | AgeAt_X | AgeAt_Y | AgeAt_Z |
|----+------------------+-------------------------------------+--------------------------------+
| 0 | nan | nan | 74.13 |
| 1 | nan | nan | 58.46 |
| 2 | nan | 8.4 | 54.15 |
| 3 | nan | nan | 57.04 |
| 4 | nan | 57.04 | nan |
+----+------------------+-------------------------------------+--------------------------------+
Notes
After removing the Older than 100 string from the desired columns, I convert the columns to numeric in order to perform calculations on said columns.
There are other columns in this dataframe (that I have excluded from this example), which will not be converted to numeric, so the conversion to numeric must be done one column at a time.
What I've tried
Attempt 1
if df.isin('Older than 100'):
df.loc[df['AgeAt_X']] = ''
else:
df['AgeAt_X'] = pd.to_numeric(df["AgeAt_X"])
Attempt 2
if df.loc[df['AgeAt_X']] == 'Older than 100r':
df.loc[df['AgeAt_X']] = ''
elif df.loc[df['AgeAt_X']] == '':
df['AgeAt_X'] = pd.to_numeric(df["AgeAt_X"])
Attempt 3
df['AgeAt_X'] = ['' if ele == 'Older than 100' else df.loc[df['AgeAt_X']] for ele in df['AgeAt_X']]
Attempts 1, 2 and 3 return the following error:
KeyError: 'None of [0 NaN\n1 NaN\n2 NaN\n3 NaN\n4 NaN\n5 NaN\n6 NaN\n7 NaN\n8 NaN\n9 NaN\n10 NaN\n11 NaN\n12 NaN\n13 NaN\n14 NaN\n15 NaN\n16 NaN\n17 NaN\n18 NaN\n19 NaN\n20 NaN\n21 NaN\n22 NaN\n23 NaN\n24 NaN\n25 NaN\n26 NaN\n27 NaN\n28 NaN\n29 NaN\n ..\n6332 NaN\n6333 NaN\n6334 NaN\n6335 NaN\n6336 NaN\n6337 NaN\n6338 NaN\n6339 NaN\n6340 NaN\n6341 NaN\n6342 NaN\n6343 NaN\n6344 NaN\n6345 NaN\n6346 NaN\n6347 NaN\n6348 NaN\n6349 NaN\n6350 NaN\n6351 NaN\n6352 NaN\n6353 NaN\n6354 NaN\n6355 NaN\n6356 NaN\n6357 NaN\n6358 NaN\n6359 NaN\n6360 NaN\n6361 NaN\nName: AgeAt_X, Length: 6362, dtype: float64] are in the [index]'
Attempt 4
df['AgeAt_X'] = df['AgeAt_X'].replace({'Older than 100': ''})
Attempt 4 returns the following error:
TypeError: Cannot compare types 'ndarray(dtype=float64)' and 'str'
I've also looked at a few posts. The two below do not actually replace the value but create a new column derived from others
Replace specific values in Pandas DataFrame
Pandas replace DataFrame values
We can loop through each column and check if the sentence is present. If we get a hit, we replace the sentence with NaN with Series.str.replace and right after convert it to numeric with Series.astype, in this case float:
df.dtypes
AgeAt_X object
AgeAt_Y object
AgeAt_Z float64
dtype: object
sent = 'Older than 100'
for col in df.columns:
if sent in df[col].values:
df[col] = df[col].str.replace(sent, 'NaN')
df[col] = df[col].astype(float)
print(df)
AgeAt_X AgeAt_Y AgeAt_Z
0 NaN NaN 74.13
1 NaN NaN 58.46
2 NaN 8.40 54.15
3 NaN NaN 57.04
4 NaN 57.04 NaN
df.dtypes
AgeAt_X float64
AgeAt_Y float64
AgeAt_Z float64
dtype: object
If I understand you correctly, you can replace all occurrences of Older than 100 with np.nan with a single call to DataFrame.replace. If all remaining values are numeric, then the replace will implicitly change the data type of the column to numeric:
# Minimal example DataFrame
df = pd.DataFrame({'AgeAt_X': ['Older than 100', np.nan, np.nan],
'AgeAt_Y': ['Older than 100', np.nan, 8.4],
'AgeAt_Z': [74.13, 58.46, 54.15]})
df
AgeAt_X AgeAt_Y AgeAt_Z
0 Older than 100 Older than 100 74.13
1 NaN NaN 58.46
2 NaN 8.4 54.15
df.dtypes
AgeAt_X object
AgeAt_Y object
AgeAt_Z float64
dtype: object
# Replace occurrences of 'Older than 100' with np.nan in any column
df.replace('Older than 100', np.nan, inplace=True)
df
AgeAt_X AgeAt_Y AgeAt_Z
0 NaN NaN 74.13
1 NaN NaN 58.46
2 NaN 8.4 54.15
df.dtypes
AgeAt_X float64
AgeAt_Y float64
AgeAt_Z float64
dtype: object

Resources