standardization of pandas dataframe groups/chunks (how to insert rows fast) - python-3.x

I have "objects" (represented each by some rows in a table) which are described in multiple rows. But the problem is, that objects sometimes miss rows. My goal is to have a DataFrame where each object has the same amount of rows (same shape), where missing rows of an object are filled with empty rows.
For example:
object 1
O-ID | key 1 | key 2 | ... | key N | value 1 | value 2 | value N
0 | A 11 | A 21 | ... | key N1 | | |
0 | A 13 | A 23 | ... | key N3 | | |
0 | A 16 | A 26 | ... | key N6 | | |
object 2
O-ID | key 1 | key 2 | ... | key N | value 1 | value 2 | value N
1 | A 12 | A 22 | ... | key N2 | | |
1 | A 13 | A 23 | ... | key N3 | | |
1 | A 14 | A 24 | ... | key N4 | | |
"O-ID" is the Object-ID. We can see that there are 6 different kinds of rows in total. In the end, I want each object to have all 6 rows. key 1 .. key-N are keys in sense of key-value pairs (with value 1 ... value N).
The result should look like this:
object 1:
O-ID | key 1 | key 2 | ... | key N | value 1 | value 2 | value N
0 | A 11 | A 21 | ... | key N1 | | |
0 | A 12 | A 22 | ... | key N2 | Null | Null | Null
0 | A 13 | A 23 | ... | key N3 | | |
0 | A 14 | A 24 | ... | key N4 | Null | Null | Null
0 | A 15 | A 25 | ... | key N5 | Null | Null | Null
0 | A 16 | A 26 | ... | key N6 | | |
object 2:
O-ID | key 1 | key 2 | ... | key N | value 1 | value 2 | value N
1 | A 11 | A 21 | ... | key N1 | Null | Null | Null
1 | A 12 | A 22 | ... | key N2 | | |
1 | A 13 | A 23 | ... | key N3 | | |
1 | A 14 | A 24 | ... | key N4 | | |
1 | A 15 | A 25 | ... | key N5 | Null | Null | Null
1 | A 16 | A 26 | ... | key N6 | Null | Null | Null
I don't know how to do this besides using a slow for-loop...
Do you know a better/faster way to find out which rows are missing, and how to insert "Null"-rows?
I already had the idea of grouping them by "O-ID" and then using a map on the groups. But how do I insert the "null"-rows in the right order in a fast way?
I'm using the latest pandas version and the latest python 3

First we create a multiindex from all the keys we need in the result dataframe res. Then we reindex our dataframe with this new multiindex. In the last step we convert the key tuples back to individual columns and reorder the columns and sort the rows as needed.
import pandas as pd
df = pd.DataFrame( {'O_ID': [0,0,0,1,1,1,2],
'key_1': ['A11', 'A13', 'A16', 'A12', 'A13', 'A14', 'A15'],
'key_2': ['A21', 'A23', 'A26', 'A22', 'A23', 'A24', 'A25'],
'key_n': ['key N1', 'key N3', 'key N6', 'key N2', 'key N3', 'key N4', 'key N5'],
'value_1': [11,12,13,14,15,16,17],
'value_2': [21,22,23,24,25,26,27],
'value_n': [121,122,123,124,125,126,127]
})
keycols = [c for c in df.columns if c.startswith('key')]
valcols = [c for c in df.columns if c.startswith('value')]
# create multiindex of all combinations of O_ID and key tuples
keys = df[keycols].apply(tuple, axis=1)
idx = pd.MultiIndex.from_product([df.O_ID.unique(), keys.unique()], names=['O_ID','key_tuples'])
# set index of O_ID and key tuples and reindex with new multiindex
res = df.set_index(['O_ID',keys]).drop(columns=keycols)
res = res.reindex(idx).reset_index()
# split key tuples back into individual columns and reorder/sort as needed
res = pd.DataFrame(res.key_tuples.to_list(), index=res.index, columns=keycols).join(res).drop(columns=['key_tuples'])
res = res.reindex(columns=['O_ID']+keycols+valcols).sort_values(['O_ID']+keycols)
Result:
O_ID key_1 key_2 key_n value_1 value_2 value_n
0 0 A11 A21 key N1 11.0 21.0 121.0
3 0 A12 A22 key N2 NaN NaN NaN
1 0 A13 A23 key N3 12.0 22.0 122.0
4 0 A14 A24 key N4 NaN NaN NaN
5 0 A15 A25 key N5 NaN NaN NaN
2 0 A16 A26 key N6 13.0 23.0 123.0
6 1 A11 A21 key N1 NaN NaN NaN
9 1 A12 A22 key N2 14.0 24.0 124.0
7 1 A13 A23 key N3 15.0 25.0 125.0
10 1 A14 A24 key N4 16.0 26.0 126.0
11 1 A15 A25 key N5 NaN NaN NaN
8 1 A16 A26 key N6 NaN NaN NaN
12 2 A11 A21 key N1 NaN NaN NaN
15 2 A12 A22 key N2 NaN NaN NaN
13 2 A13 A23 key N3 NaN NaN NaN
16 2 A14 A24 key N4 NaN NaN NaN
17 2 A15 A25 key N5 17.0 27.0 127.0
14 2 A16 A26 key N6 NaN NaN NaN
(I had to add a third object with key A15, otherwise it is unclear from your sample data where this key should come from, i.e. this method uses all existing keys. If you know all the key values in advance and want to build the result dataframe with these keys, no matter if they occur in the input dataframe or not, then you can create your multiindex from these known key values instead of the unique keys present in the input data)

Related

filter and get rows between the conditions in a dataframe

My DataFrame looks something like this:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
| value 6 | 3 |
| value 7 | 4 |
| value 8 | 5 |
| Start B | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End B | 6 |
| value 6 | 3 |
| value 7 | 4 |
| value 8 | 5 |
| Start C | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End C | 6 |
+----------------------------------+---------+
What I am trying to acheive is if substring start and end is present I want the rows between them.
Expected Result is:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
| Start B | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End B | 6 |
| Start C | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End C | 6 |
+----------------------------------+---------+
I tried the code from this How to filter dataframe columns between two rows that contain specific string in column?
m = df['To'].isin(['Start A', 'End A']).cumsum().eq(1)
df[m|m.shift()]
But this only returns the first set of start and end, also it expects the exact string.
output:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
+----------------------------------+---------+
The answer you linked to was designed to work with a single pair of Start/End.
A more generic variant of it would be to check for the parity of the group (assuming strictly alternating Start/End):
m1 = df['Col1'].str.match(r'Start|End').cumsum().mod(2).eq(1)
# boolean indexing
out = df[m1|m1.shift()]
Alternatively, use each Start as a flag to keep the following rows and each End as a flag to drop them. This wouldn't however consider the A/B/C letter after the Start/End like the nice answer of #Quang does:
# extract Start/End
s = df['Col1'].str.extract(r'^(Start|End)', expand=False)
# set flags and ffill
m1 = s.replace({'Start': True, 'End': False}).ffill()
# boolean slicing
out = df[m1|m1.shift()]
Output:
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
Let's try:
# extract the label after `Start/End`
groups = df['Col1'].str.extract('[Start|End] (.*)', expand=False)
# keep rows with equal forward fill and backward fill
df[groups.bfill() == groups.ffill()]
Output:
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
One option is with an interval index:
Get the positions of the starts and ends:
starts = df.Col1.str.startswith("Start").to_numpy().nonzero()[0]
ends = df.Col1.str.startswith("End").to_numpy().nonzero()[0]
Build an interval index, and get matches where the index lies between Start and End:
intervals = pd.IntervalIndex.from_arrays(starts, ends, closed='both')
intervals = intervals.get_indexer(df.index)
Filter the original dataframe with the intervals, where intervals are not less than 0:
df.loc[intervals >= 0]
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6

Splitting ID column in pandas dataframe to multiple columns

I have a pandas dataframe like below :
| ID | Value |
+----------+--------+
|1C16 | 34 |
|1C1 | 45 |
|7P.75 | 23 |
|7T1 | 34 |
|1C10DG | 34 |
+----------+--------+
I want to split the ID column (its a string column) in a way that looks like below:
| ID | Value | Code | Core |size |
+----------+--------+-------+------+-----+
|1C16 | 34 | C | 1 | 16 |
|1C1 | 45 | C | 1 | 1 |
|7P.75 | 23 | P | 7 | .75 |
|7T1 | 34 | T | 7 | 1 |
|1C10DG | 34 | C | 1 | 10 |
+----------+--------+-------+------+-----+
So how can this be achieved ? Thanks
You can try .str.extract with regex (?P<Code>\d+)(?P<Core>[A-Z])(?P<size>[.0-9]+) to capture the patterns:
df.ID.str.extract(r'(?P<Code>\d+)(?P<Core>[A-Z])(?P<size>[.0-9]+)')
# Code Core size
#0 1 C 16
#1 1 C 1
#2 7 P .75
#3 7 T 1
#4 1 C 10
use .str.extract() with multiple capturing groups & join
df.join(
df['ID'].str.extract('(\d)(\w)(\d+|.\d+)').rename(
columns={0 : 'Core', 1 : 'Code', 2 : 'Size'}))
ID Value Core Code Size
1 1C16 34.0 1 C 16
2 1C1 45.0 1 C 1
3 7P.75 23.0 7 P .75
4 7T1 34.0 7 T 1
5 1C10DG 34.0 1 C 10

rolling average and aggregate more than one column in pandas

How do I also aggregate the 'reviewer' lists together with average of 'quantities'?
For a data frame like below I can successfully calculate the average of the quantities per group over every 3 years. How do I add an extra column that aggregates the values of column 'reviewer, for every period as well? for example for company 'A' for year 1993, the column would be [[p1,p2],[p3,p2],[p4]].
df= pd.DataFrame(data=[
['A', 1990, 2,['p1','p2']],
['A', 1991,3,['p3','p2']],
['A', 1993,5,['p4']],
['A',2000,4,['p1','p5','p7']],
['B',2000,1, ['p3']],
['B',2001,2,['p6','p9']],
['B',2002,3,['p10','p1']]], columns=['company', 'year','quantity', 'reviewer'])
df['rolling_average'] = (df.groupby(['company'])
.rolling(3).agg({'quantity':'mean'}).reset_index(level=[0], drop=True))
The output currently looks like:
| index | company | year | quantity | reviewer | rolling_average |
| :---- | :------ | :--- | :------- | :------- | :-------------- |
| 0 | A | 1990 | 2 | [p1, p2] | NaN |
| 1 | A | 1991 | 3 | [p3, p2] | NaN |
| 2 | A | 1993 | 5 | [p4] | 3.33 |
| 3 | A | 2000 | 4 | [p5, p7] | 4.00 |
| 4 | B | 2000 | 1 | [p3] | NaN |
| 5 | B | 2001 | 2 | [p6, p9] | NaN |
| 6 | B | 2002 | 3 | [p10, p1]| 2.00 |
Since the rolling can not take non-numeric , we need self-define the rolling here
n = 3
df['new'] = df.groupby(['company'])['reviewer'].apply(lambda x :[x[y-n:y].tolist() if y>=n else np.nan for y in range(1,len(x)+1)]).explode().values
df
company year quantity reviewer new
0 A 1990 2 [p1, p2] NaN
1 A 1991 3 [p3, p2] NaN
2 A 1993 5 [p4] [[p1, p2], [p3, p2], [p4]]
3 A 2000 4 [p1, p5, p7] [[p3, p2], [p4], [p1, p5, p7]]
4 B 2000 1 [p3] NaN
5 B 2001 2 [p6, p9] NaN
6 B 2002 3 [p10, p1] [[p3], [p6, p9], [p10, p1]]

Pandas: for each row count occurrence in another df within specific dates

I have the following 2 dfs:
df1
|company|company_id| date | max_date |
| A21 | 5 |2021-02-04| 2021-02-11|
| A21 | 10 |2020-10-04| 2020-10-11|
| B52 | 8 |2021-03-04| 2021-04-11|
| B52 | 6 |2020-04-04| 2020-04-11|
-------------------------------------------
and
df2:
|company|company_id| date_df2 |
| A21 | 5 |2021-02-05|
| A21 | 5 |2021-02-08|
| A21 | 5 |2021-02-12|
| A21 | 5 |2021-02-11|
| A21 | 10 |2020-10-07|
| B52 | 8 |2021-03-07|
| B52 | 6 |2020-04-08|
| B52 | 6 |2020-04-12|
| B52 | 6 |2020-04-05|
-------------------------------
Logic:
For each company and company_id in df1 i want to count how many occurence are in df2 where the date_df2 in df2 is between the date and max_date from df1
Expected results:
|company|company_id| date | max_date |count|
| A21 | 5 |2021-02-04| 2021-02-11| 3 |
| A21 | 10 |2020-10-04| 2020-10-11| 1 |
| B52 | 8 |2021-03-04| 2021-04-11| 1 |
| B52 | 6 |2020-04-04| 2020-03-11| 2 |
------------------------------------------------
How can this be achieved in pandas?
Code to reproduce the df:
#df1
list_columns = ['company','company_id','date','max_date']
list_data = [
['A21',5,'2021-02-04','2021-02-11'],
['A21',10,'2020-10-04','2020-10-11'],
['B52',8,'2021-03-04','2021-04-11'],
['B52',6,'2020-04-04','2020-04-11']
]
df1 = pd.DataFrame(columns=list_columns, data=list_data)
#df2
list_columns = ['company','company_id','date']
list_data = [
['A21',5,'2021-02-05'],
['A21',5,'2021-02-08'],
['A21',5,'2021-02-12'],
['A21',5,'2021-02-11'],
['A21',10,'2020-10-07'],
['B52',8,'2021-03-07'],
['B52',6,'2020-04-08'],
['B52',6,'2020-04-12'],
['B52',6,'2020-04-05']
]
df2 = pd.DataFrame(columns=list_columns, data=list_data)
Use DataFrame.merge with default inner join, then filter matched valeus by Series.between, aggregate counts by GroupBy.size and append new column with repalce missing values if necessary:
df1['date'] = pd.to_datetime(df1['date'])
df1['max_date'] = pd.to_datetime(df1['max_date'])
df2['date'] = pd.to_datetime(df2['date'])
df = df1.merge(df2, on=['company','company_id'], suffixes=('','_'))
s = (df[df['date_'].between(df['date'], df['max_date'])]
.groupby(['company','company_id'])
.size())
df1 = df1.join(s.rename('count'), on=['company','company_id']).fillna({'count':0})
print (df1)
company company_id date max_date count
0 A21 5 2021-02-04 2021-02-11 3
1 A21 10 2020-10-04 2020-10-11 1
2 B52 8 2021-03-04 2021-04-11 1
3 B52 6 2020-04-04 2020-04-11 2

SUM values in one column if criteria exists at least once per value in another column

| A B C D | E F | G H
----|----------------------------------------------------|-----------------------|-------------------
1 | | |
2 | Products date quantity | |
----|----------------------------------------------------|-----------------------|-------------------
3 | Product_A 2020-01-08 0 | From 2020-01-01 | Result: 800
4 | Product_A 2020-12-15 0 | to 2020-10-31 |
5 | Product_A 2020-12-23 0 | |
6 | Product_A 500 | |
----|----------------------------------------------------|-----------------------|------------------
7 | Product_B 2020-11-09 0 | |
8 | Product_B 2021-03-14 0 | |
9 | Product_B 700 | |
----|----------------------------------------------------|-----------------------|------------------
10 | Product_C 2020-02-05 0 | |
11 | Product_C 2020-07-19 0 | |
12 | Product_C 2020-09-18 0 | |
13 | Product_C 2020-09-25 0 | |
14 | Product_C 300 | |
14 | | |
15 | | |
In the table I have listed different products with multiple dates per product.
Below each product there is a row in which a quantity is displayed.
Now in Cell H3 I want to get the Sum of the quantity of all products that have at least one date between the dates in Cell F3 and Cell F4. In the example this applies to Product_A and Product_C therefore the sum is 500+300=800.
I have no clue what kind of formula I need to achieve this.
I guess it must be something like this:
SUMIFS(Date in Cell F3 OR in Cell F4 exists for Product in Column C THEN SUM over Column D)
Do you have an idea how this formula has to look like?
One way would be with SUMPRODUCT() combined with COUNTIFS():
=SUMPRODUCT((COUNTIFS(B3:B14,B3:B14,C3:C14,">="&F3,C3:C14,"<="&F4)>0)*D3:D14)

Resources