Combine text from multiple rows in pandas - python-3.x

I want to merge content for respective rows' data only where some specific conditions are met.
Here is the test dataframe I am working on
Date Desc Debit Credit Bal
0 04-08-2019 abcdef 45654 NaN 345.0
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.06
4 06-08-2019 xyz NaN 350.0 695.06
in which, I want to join the rows where there is nan into Date to the previous row.
Output required:
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654 NaN 345.06
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.0
4 06-08-2019 xyz NaN 350.0 695.06
If anybody help me out with this? I have tried the following:
for j in [x for x in range(lst[0], lst[-1]+1) if x not in lst]:
print (test.loc[j-1:j, ].apply(lambda x: ''.join(str(x)), axis=1))
But could not get the expected result.

You can use
d = df["Date"].fillna(method='ffill')
df.update(df.groupby(d).transform('sum'))
print(df)
output
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 0.0 351.0
1 NaN abcdefjklmn 45654.0 0.0 351.0
2 05-08-2019 abd 45.0 0.0 345.0
3 06-08-2019 xyz 0.0 345.0 54645.0

idx = test.loc[test["Date"].isna()].index
test.loc[idx-1, "Desc"] = test.loc[idx-1]["Desc"].str.cat(test.loc[idx]["Desc"])
test.loc[idx-1, "Bal"] = (test.loc[idx-1]["Bal"].astype(str)
.str.cat(test.loc[idx]["Bal"].astype(str)))
## I tried to add two values but it didn't work as expected, giving 351.0
# test.loc[idx-1, "Bal"] = test.loc[idx-1]["Bal"].values + test.loc[idx]["Bal"].values
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 NaN 345.06.0
1 NaN jklmn NaN NaN 6
2 05-08-2019 abd 45.0 NaN 345
3 06-08-2019 xyz NaN 345.0 54645

Related

INDEX and MATCH with multiple criteria in Pandas Python

I am trying to do an index match in 2 data set but having trouble. Here is an example of what I am trying to do. I want to fill in column "a", "b", "c" that are empty in df with the df2 data where "Machine", "Year", and "Order Type".
The first dataframe lets call this one "df"
Machine Year Cost a b c
0 abc 2014 5500 nan nan nan
1 abc 2015 89 nan nan nan
2 abc 2016 600 nan nan nan
3 abc 2017 250 nan nan nan
4 abc 2018 2100 nan nan nan
5 abc 2019 590 nan nan nan
6 dcb 2020 3000 nan nan nan
7 dcb 2021 100 nan nan nan
The second data set is called "df2"
Order Type Machine Year Total Count
0 a abc 2014 1
1 b abc 2014 1
2 c abc 2014 2
4 c dcb 2015 4
3 a abc 2016 3
Final Output is:
Machine Year Cost a b c
0 abc 2014 5500 1 1 2
1 abc 2015 89 nan nan nan
2 abc 2016 600 3 nan nan
3 abc 2017 250 nan nan nan
4 abc 2018 2100 nan nan nan
5 abc 2019 590 1 nan nan
6 dcb 2014 3000 nan nan 4
7 dcb 2015 100 nan nan nan
Thanks for help in advance
Consider DataFrame.pivot to reshape df2 to merge with df1.
final_df = (
df1.reindex(["Machine", "Type", "Cost"], axis=True)
.merge(
df.pivot(
index=["Machine", "Year"],
columns="Order Type",
values="Total Count"
).reset_index(),
on = ["Machine", "Year"]
)
)

Filter the rows based on the missing values on the specific set columns - pandas

I have data frame as shown below
ID Type Desc D_N D_A C_N C_A
1 Edu Education 3 100 NaN NaN
1 Bank In_Pay NaN NaN NaN 900
1 Eat Food 4 200 NaN NaN
1 Edu Education NaN NaN NaN NaN
1 Bank NaN NaN NaN 4 700
1 Eat Food NaN NaN NaN NaN
2 Edu Education NaN NaN 1 100
2 Bank NaN NaN NaN 8 NaN
2 NaN Food 4 NaN NaN NaN
3 Edu Education NaN NaN NaN NaN
3 Bank NaN 2 300 NaN NaN
3 Eat Food NaN 140 NaN NaN
From the above df, I would like to filter the rows where exactly one of the columns D_N, D_A, C_N and C_A has non zero.
Expected Output:
ID Type Desc D_N D_A C_N C_A
1 Bank In_Pay NaN NaN NaN 900
2 Bank NaN NaN NaN 8 NaN
2 NaN Food 4 NaN NaN NaN
3 Eat Food NaN 140 NaN NaN
I tried the below code but that does not work.
df[df.loc[:, ["D_N", "D_A", "C_N", "C_A"]].isna().sum(axis=1).eq(1)]
Use DataFrame.count for count values excluded missing values:
df1 = df[df[["D_N", "D_A", "C_N", "C_A"]].count(axis=1).eq(1)]
print (df1)
ID Type Desc D_N D_A C_N C_A
1 1 Bank In_Pay NaN NaN NaN 900.0
7 2 Bank NaN NaN NaN 8.0 NaN
8 2 NaN Food 4.0 NaN NaN NaN
11 3 Eat Food NaN 140.0 NaN NaN
Your solution is possible modify with test non missing values:
df1 = df[df[["D_N", "D_A", "C_N", "C_A"]].notna().sum(axis=1).eq(1)]

Remove the rows when the values of the all the specific 4 columns are NaN in pandas

I have a df as shown below
df:
ID Type Desc D_N D_A C_N C_A
1 Edu Education 3 100 NaN NaN
1 Bank In_Pay NaN NaN 8 900
1 Eat Food 4 200 NaN NaN
1 Edu Education NaN NaN NaN NaN
1 Bank NaN NaN NaN 4 700
1 Eat Food NaN NaN NaN NaN
2 Edu Education NaN NaN 1 100
2 Bank In_Pay NaN NaN 8 NaN
2 Eat Food 4 200 NaN NaN
3 Edu Education NaN NaN NaN NaN
3 Bank NaN 2 300 NaN NaN
3 Eat Food NaN NaN NaN NaN
About the df:
whenever D_N non-NaN D_A should be non-NaN, at the same time C_N and C_A should be NaN and vice versa.
In the above data, I would like to filter rows where any of the D_N, D_A, C_N and C_A are non-NaN
Expected Output:
ID Type Desc D_N D_A C_N C_A
1 Edu Education 3 100 NaN NaN
1 Bank In_Pay NaN NaN 8 900
1 Eat Food 4 200 NaN NaN
1 Bank NaN NaN NaN 4 700
2 Edu Education NaN NaN 1 100
2 Bank In_Pay NaN NaN 8 NaN
2 Eat Food 4 200 NaN NaN
3 Bank NaN 2 300 NaN NaN
print(df.dropna(subset=["D_N", "D_A", "C_N", "C_A"], how="all"))
Prints:
ID Type Desc D_N D_A C_N C_A
0 1 Edu Education 3.0 100.0 NaN NaN
1 1 Bank In_Pay NaN NaN 8.0 900.0
2 1 Eat Food 4.0 200.0 NaN NaN
4 1 Bank NaN NaN NaN 4.0 700.0
6 2 Edu Education NaN NaN 1.0 100.0
7 2 Bank In_Pay NaN NaN 8.0 NaN
8 2 Eat Food 4.0 200.0 NaN NaN
10 3 Bank NaN 2.0 300.0 NaN NaN

Calculating rolling sum in a pandas dataframe on the basis of 2 variable constraints

I want to create a variable : SumOfPrevious5OccurencesAtIDLevel which is the sum of previous 5 values (as per Date variable) of Var1 at an ID level (column 1) , otherwise it will take a value of NA
Sample Data and Output:
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel
1 1/1/2018 0 NA
1 1/2/2018 1 NA
1 1/3/2018 2 NA
1 1/4/2018 3 NA
2 1/1/2018 4 NA
2 1/2/2018 5 NA
2 1/3/2018 6 NA
2 1/4/2018 7 NA
2 1/5/2018 8 NA
2 1/6/2018 9 30
2 1/7/2018 10 35
2 1/8/2018 11 40
Use groupby with transform and functions rolling and shift:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
#if not sorted ID with datetimes
df = df.sort_values(['ID','Date'])
df['new'] = df.groupby('ID')['Var1'].transform(lambda x: x.rolling(5).sum().shift())
print (df)
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel new
0 1 2018-01-01 0 NaN NaN
1 1 2018-01-02 1 NaN NaN
2 1 2018-01-03 2 NaN NaN
3 1 2018-01-04 3 NaN NaN
4 2 2018-01-01 4 NaN NaN
5 2 2018-01-02 5 NaN NaN
6 2 2018-01-03 6 NaN NaN
7 2 2018-01-04 7 NaN NaN
8 2 2018-01-05 8 NaN NaN
9 2 2018-01-06 9 30.0 30.0
10 2 2018-01-07 10 35.0 35.0
11 2 2018-01-08 11 40.0 40.0

Number of NaN values before first non NaN value Python dataframe

I have a dataframe with several columns, some of them contain NaN values. I would like for each row to create another column containing the total number of columns minus the number of NaN values before the first non NaN value.
Original dataframe:
ID Value0 Value1 Value2 Value3
1 10 10 8 15
2 NaN 45 52 NaN
3 NaN NaN NaN NaN
4 NaN NaN 100 150
The extra column would look like:
ID NewColumn
1 4
2 3
3 0
4 2
Thanks in advance!
Set the index to ID
Attach a non-null column to stop/catch the argmax
Use argmax to find the first non-null value
Subtract those values from the length of the relevant columns
df.assign(
NewColumn=
df.shape[1] - 1 -
df.set_index('ID').assign(notnull=1).notnull().values.argmax(1)
)
ID Value0 Value1 Value2 Value3 NewColumn
0 1 10.0 10.0 8.0 15.0 4
1 2 NaN 45.0 52.0 NaN 3
2 3 NaN NaN NaN NaN 0
3 4 NaN NaN 100.0 150.0 2

Resources