groupby Sum on columns of multiple dataframes in Pandas - python-3.x

I have 2 dataframes (df1 and df2) below. Based on the values of df1.col1 and df2.col1 , I want the values in df1.col2 and df2.col2 to be added. This sum total should then go against each row in df1 as a 3rd column. For example: wherever we have A01 in col1, the values in col2 should be summed. So 1+7+5=13. Same for all other col1 values.
To give further clarity, I have given a snapshot of how df1 should look at the end (End Result)
**df1
|col1 | col2** |
|------| -------- |
| A01 | 1 |
| A02 | 0 |
| A03 | 0 |
| A01 | 7 |
| A02 | 1 |
**df2
|col1 | col2 | col3**
|------| --------|----------
| A01 | 5 | x
| A02 | 0 | y
| A06 | 0 | asa
| A07 | 1 | asa
| A02 | 4 | st
END Result:
**df1**
**col1 | col2 | col3**
|------| --------|----------
A01 | 1 | 13
A02 | 0 | 5
A03 | 0 | 0
A01 | 7 | 13
A02 | 1 | 5

found a solution based on manipulating merge and groupby operations.
First I grouped by 'col1' df1 and 'df2', then calculated the sum, and finally merged with df1 the 'sum'
dic_df1={'col1':['A01', 'A02', 'A03', 'A01', 'A02'], 'col2':[1, 0, 0, 7, 1]}
df1=pd.DataFrame(dic_df1)
dic_df2={'col1':['A01', 'A02', 'A06', 'A07', 'A02'], 'col2':[5, 0, 0, 1, 4] ,'col3**':['x', 'y', 'asa', 'asa', 'st']}
df2=pd.DataFrame(dic_df2)
print(df1), print(df2)
then perform a groupby on 'col1' so we can calculate the first part of the sum, and then when merging df1 and df2, we will have total sum in both datasets in column 'sum'
df2=df2.groupby(['col1'], as_index=False)['col2'].sum()
merged=(df1.groupby(['col1'], as_index=False)['col2'].sum()).merge(df2, left_on='col1', right_on='col1', how='left')
merged['sum']=merged['col2_x']+merged['col2_y']
finally a merge on df1 with merged
df1=df1.merge(merged[['col1', 'sum']], left_on='col1', right_on='col1', how='left')
final output if df1:
There are many ways to fill the Nan value in final output df1.

Related

Pandas: for each row count occurrence in another df within specific dates

I have the following 2 dfs:
df1
|company|company_id| date | max_date |
| A21 | 5 |2021-02-04| 2021-02-11|
| A21 | 10 |2020-10-04| 2020-10-11|
| B52 | 8 |2021-03-04| 2021-04-11|
| B52 | 6 |2020-04-04| 2020-04-11|
-------------------------------------------
and
df2:
|company|company_id| date_df2 |
| A21 | 5 |2021-02-05|
| A21 | 5 |2021-02-08|
| A21 | 5 |2021-02-12|
| A21 | 5 |2021-02-11|
| A21 | 10 |2020-10-07|
| B52 | 8 |2021-03-07|
| B52 | 6 |2020-04-08|
| B52 | 6 |2020-04-12|
| B52 | 6 |2020-04-05|
-------------------------------
Logic:
For each company and company_id in df1 i want to count how many occurence are in df2 where the date_df2 in df2 is between the date and max_date from df1
Expected results:
|company|company_id| date | max_date |count|
| A21 | 5 |2021-02-04| 2021-02-11| 3 |
| A21 | 10 |2020-10-04| 2020-10-11| 1 |
| B52 | 8 |2021-03-04| 2021-04-11| 1 |
| B52 | 6 |2020-04-04| 2020-03-11| 2 |
------------------------------------------------
How can this be achieved in pandas?
Code to reproduce the df:
#df1
list_columns = ['company','company_id','date','max_date']
list_data = [
['A21',5,'2021-02-04','2021-02-11'],
['A21',10,'2020-10-04','2020-10-11'],
['B52',8,'2021-03-04','2021-04-11'],
['B52',6,'2020-04-04','2020-04-11']
]
df1 = pd.DataFrame(columns=list_columns, data=list_data)
#df2
list_columns = ['company','company_id','date']
list_data = [
['A21',5,'2021-02-05'],
['A21',5,'2021-02-08'],
['A21',5,'2021-02-12'],
['A21',5,'2021-02-11'],
['A21',10,'2020-10-07'],
['B52',8,'2021-03-07'],
['B52',6,'2020-04-08'],
['B52',6,'2020-04-12'],
['B52',6,'2020-04-05']
]
df2 = pd.DataFrame(columns=list_columns, data=list_data)
Use DataFrame.merge with default inner join, then filter matched valeus by Series.between, aggregate counts by GroupBy.size and append new column with repalce missing values if necessary:
df1['date'] = pd.to_datetime(df1['date'])
df1['max_date'] = pd.to_datetime(df1['max_date'])
df2['date'] = pd.to_datetime(df2['date'])
df = df1.merge(df2, on=['company','company_id'], suffixes=('','_'))
s = (df[df['date_'].between(df['date'], df['max_date'])]
.groupby(['company','company_id'])
.size())
df1 = df1.join(s.rename('count'), on=['company','company_id']).fillna({'count':0})
print (df1)
company company_id date max_date count
0 A21 5 2021-02-04 2021-02-11 3
1 A21 10 2020-10-04 2020-10-11 1
2 B52 8 2021-03-04 2021-04-11 1
3 B52 6 2020-04-04 2020-04-11 2

How to merge two rows in the same dataframe

I have a data frame which contains two rows. The value in the column "ID" for both these rows is the same. How can I create a new data frame and bring all the values in both the rows into one row, but in separate columns?
For example, if in the input data frame, there is a column called "Amount" in both the rows, The new data frame should contain one-row with two different columns as Amount_1 and Amount_2.
groupby does not work as I do not want all the information in the same columns.
I can not merge, as this is not from two different data frames.
Turn:
+------+--------+----------+---------+
| ID | Amount |Name |State |
|------|--------|----------+---------+
| 1 | 16 |A |CA |
| 2 | 32 |B |GA |
| 2 | 64 |C |NY |
+------+--------+----------+---------+
into:
+------+----------+----------+-------+--------+---------+--------+
| ID | Amount_1 | Amount_2 |Name_1 | Name_2 | State_1 | State_2|
|------|----------|----------|-------+--------+---------+--------+
| 1 | 16 | |A | | CA | |
| 2 | 32 | 64 |B |C | GA | NY |
+------+----------+----------+-------+--------+---------+--------+
Add a column that will contain the column names of the new DataFrame by using cumcount. After that, use pivot:
df['amountnr'] = 'Amount_' + df.groupby('ID').cumcount().add(1).astype(str)
df.pivot(index='ID', columns= 'amountnr', values='Amount')
#amountnr Amount_1 Amount_2
#ID
#1 16.0 NaN
#2 32.0 64.0
Edit
With you new specifications, I feel you should really use a MultiIndex, like so:
df['cumcount'] = df.groupby('ID').cumcount().add(1)
df.set_index(['ID', 'cumcount']).unstack()
# Amount Name State
#cumcount 1 2 1 2 1 2
#ID
#1 16.0 NaN A NaN CA NaN
#2 32.0 64.0 B C GA NY
If you insist, you can later always join the columns of your MultiIndex:
df2.columns = ['_'.join([coltype, str(count)]) for coltype, count in df2.columns.values]

Sorting rows in pandas first by timestamp values and then by giving particular order to categorical values of a column

I have a pandas dataframe which has a column "user" containing categorical values(a,b,c,d). I only care about the ordering of two categories in ascending order (a, d). So (a,b,c,d) and (a,c,b,d) both are fine for me.
How to create the ordering is the first part of the question?
Secondly I have another column which contains "timestamps". I want to order my rows first by "timestamps" and then for the rows with same timestamps I want to sort with the above ordering of categorical values.
Lets say my data frame looks like this.
+-----------+------+
| Timestamp | User |
+-----------+------+
| 1 | b |
| 2 | d |
| 1 | a |
| 1 | c |
| 1 | d |
| 2 | a |
| 2 | b |
+-----------+------+
I want first this kind of sorting to happen
+-----------+------+
| Timestamp | User |
+-----------+------+
| 1 | b |
| 1 | a |
| 1 | c |
| 1 | d |
| 2 | d |
| 2 | a |
| 2 | b |
+-----------+------+
Followed by the categorical ordering of "user"
+-----------+------+
| Timestamp | User |
+-----------+------+
| 1 | a |
| 1 | b |
| 1 | c |
| 1 | d |
| 2 | a |
| 2 | b |
| 2 | d |
+-----------+------+
OR
+-----------+------+
| Timestamp | User |
+-----------+------+
| 1 | a |
| 1 | c |
| 1 | b |
| 1 | d |
| 2 | a |
| 2 | b |
| 2 | d |
+-----------+------+
As you can see the "c" and "b"'s order do not matter.
You can specify order in ordered categorical by categories and then call DataFrame.sort_values:
df['User'] = pd.Categorical(df['User'], ordered=True, categories=['a','b','c','d'])
df = df.sort_values(['Timestamp','User'])
print (df)
Timestamp User
2 1 a
0 1 b
3 1 c
4 1 d
5 2 a
6 2 b
1 2 d
If there is many values of User is possible dynamically create categories:
vals = ['a', 'd']
cats = vals + np.setdiff1d(df['User'], vals).tolist()
print (cats)
['a', 'd', 'b', 'c']
df['User'] = pd.Categorical(df['User'], ordered=True, categories=cats)
df = df.sort_values(['Timestamp','User'])
print (df)
Timestamp User
2 1 a
4 1 d
0 1 b
3 1 c
5 2 a
1 2 d
6 2 b

PySpark: How to fillna values in dataframe for specific columns?

I have the following sample DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)
df.fillna(0, subset=['a', 'b'])
There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1
Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )

Pandas DataFrame, Iterate through groups is very slow

I have a dataframe df with ~ 300.000 rows and plenty of columns:
| COL_A | ... | COL_B | COL_C |
-----+--------+-...--+--------+--------+
IDX
-----+--------+-...--+--------+--------+
'AAA'| 'A1' | ... | 'B1' | 0 |
-----+--------+-...--+--------+--------+
'AAB'| 'A1' | ... | 'B2' | 2 |
-----+--------+-...--+--------+--------+
'AAC'| 'A1' | ... | 'B3' | 1 |
-----+--------+-...--+--------+--------+
'AAD'| 'A2' | ... | 'B3' | 0 |
-----+--------+-...--+--------+--------+
I need to group after COL_A and from each row of each group I need the value of IDX (e.G.: 'AAA') and COL_B (e.G.: B1) in the order given in COL_C
For A1 I thus need: [['AAA','B1'], ['AAC','B3'], ['AAB','B2']]
This is what I do.
grouped_by_A = self.df.groupby(COL_A)
for col_A, group in grouped_by_A:
group = group.sort_values(by=[COL_C], ascending=True)
...
It works fine, but it's horribly slow (Core i7, 16 GB RAM). It already takes ~ 5 Minutes when I'm not doing anything with the values. Do you know a faster way?

Resources