Filter rows of 1st Dataframe from the 2nd Dataframe having different starting dates - python-3.x

I have two dataframes from which a new dataframe has to be created.
The first one is given below.
data = {'ID':['A', 'A', 'A', 'A', 'A', 'B','B','B','B', 'C','C','C','C','C','C', 'D','D','D'],
'Date':['2021-2-13', '2021-2-14', '2021-2-15', '2021-2-16', '2021-2-17', '2021-2-16', '2021-2-17', '2021-2-18', '2021-2-19',
'2021-2-12', '2021-2-13', '2021-2-14', '2021-2-15', '2021-2-16','2021-2-17', '2021-2-14', '2021-2-15', '2021-2-16'],
'Steps': [1000, 1200, 1500, 2000, 1400, 4000,3400, 5000,1000, 3500,4000,5000,5300,2000,3500, 5000,5500,5200 ]}
df1 = pd.DataFrame(data)
df1
The image of this is also attached.
The 2nd dataframe contains the starting date of each participant as given and shown below.
data1 = {'ID':['A', 'B', 'C', 'D'],
'Date':['2021-2-15', '2021-2-17', '2021-2-16', '2021-2-15']}
df2 = pd.DataFrame(data1)
df2
The snippet of it is given below.
Now, the resulting dataframe have to be such that for each participant in the Dataframe1, the rows have to start from the dates given in the 2nd Dataframe. The rows prior to that starting date have to be deleted.
The final dataframe as in how it should look is given below.
Any help is greatly appreciated.
Thanks

You can use .merge + boolean-indexing:
df1["Date"] = pd.to_datetime(df1["Date"])
df2["Date"] = pd.to_datetime(df2["Date"])
x = df1.merge(df2, on="ID", suffixes=("", "_y"))
print(x.loc[x.Date >= x.Date_y, df1.columns].reset_index(drop=True))
Prints:
ID Date Steps
0 A 2021-02-15 1500
1 A 2021-02-16 2000
2 A 2021-02-17 1400
3 B 2021-02-17 3400
4 B 2021-02-18 5000
5 B 2021-02-19 1000
6 C 2021-02-16 2000
7 C 2021-02-17 3500
8 D 2021-02-15 5500
9 D 2021-02-16 5200
Or: If some ID is missing in df2:
x = df1.merge(df2, on="ID", suffixes=("", "_y"), how="outer").fillna(pd.Timestamp(0))
print(x.loc[x.Date >= x.Date_y, df1.columns].reset_index(drop=True))

If the ID in df2 is unique, you could map df2 to df1, compare the dates, and use the boolean series to index df1 :
df1.loc[df1.Date >= df1.ID.map(df2.set_index('ID').squeeze())]
ID Date Steps
2 A 2021-02-15 1500
3 A 2021-02-16 2000
4 A 2021-02-17 1400
6 B 2021-02-17 3400
7 B 2021-02-18 5000
8 B 2021-02-19 1000
13 C 2021-02-16 2000
14 C 2021-02-17 3500
16 D 2021-02-15 5500
17 D 2021-02-16 5200

Related

Filter dataframe on multiple conditions within different columns

I have a sample of the dataframe as given below.
data = {'ID':['A', 'A', 'A', 'A', 'A', 'B','B','B','B'],
'Date':['2021-2-13', '2021-2-14', '2021-2-14', '2021-2-14', '2021-2-15', '2021-2-14', '2021-2-14', '2021-2-15', '2021-2-15'],
'Modified_Date':['3/19/2021 6:34:20 PM','3/20/2021 4:57:39 PM', '3/21/2021 4:57:40 PM', '3/22/2021 4:57:57 PM', '3/23/2021 4:57:41 PM',
'3/25/2021 11:44:15 PM','3/26/2021 2:16:09 PM', '3/20/2021 2:16:04 PM', '3/21/2021 4:57:40 PM'],
'Steps': [1000, 1200, 1500, 2000, 1400, 4000, 5000,1000, 3500]}
df1 = pd.DataFrame(data)
df1
This data have to be filtered in such a way that first for 'ID', and then for each 'Date', the latest entry of 'Modified_Date' row has to be selected.
EX: For ID=A, For Date='2021-04-14', The latest/last modified date = '3/22/2021 4:57:57 PM', This row has to be selected.
I have attached the snippet of the how the final dataframe has to look like.
I have been stuck on this for a while.
Try:
df1["Date"] = pd.to_datetime(df1["Date"])
df1["Modified_Date"] = pd.to_datetime(df1["Modified_Date"])
df_out = df1.groupby(["ID", "Date"], as_index=False).apply(
lambda x: x.loc[x["Modified_Date"].idxmax()]
)
print(df_out)
Prints:
ID Date Modified_Date Steps
0 A 2021-02-13 2021-03-19 18:34:20 1000
1 A 2021-02-14 2021-03-22 16:57:57 2000
2 A 2021-02-15 2021-03-23 16:57:41 1400
3 B 2021-02-14 2021-03-26 14:16:09 5000
4 B 2021-02-15 2021-03-21 16:57:40 3500
Or: .sort_values + .groupby:
df_out = (
df1.sort_values(["ID", "Date", "Modified_Date"])
.groupby(["ID", "Date"], as_index=False)
.last()
)
The easiest/most straighforward is to sort by date and take the last per group:
(df1.sort_values(by='Modified_Date')
.groupby(['ID', 'Date'], as_index=False).last()
)
output:
ID Date Modified_Date Steps
0 A 2021-2-13 3/19/2021 6:34:20 PM 1000
1 A 2021-2-14 3/22/2021 4:57:57 PM 2000
2 A 2021-2-15 3/23/2021 4:57:41 PM 1400
3 B 2021-2-14 3/26/2021 2:16:09 PM 5000
4 B 2021-2-15 3/21/2021 4:57:40 PM 3500
You can also sort_values and drop_duplicates:
First convert the 2 series to dates (since they are strings in the example):
df1["Date"] = pd.to_datetime(df1["Date"])
df1["Modified_Date"] = pd.to_datetime(df1["Modified_Date"])
Then sort values on Modified_date and drop_duplicates keeping the last values:
out = df1.sort_values('Modified_Date').drop_duplicates(['ID','Date'],keep='last')\
.sort_index()
print(out)
ID Date Modified_Date Steps
0 A 2021-02-13 2021-03-19 18:34:20 1000
3 A 2021-02-14 2021-03-22 16:57:57 2000
4 A 2021-02-15 2021-03-23 16:57:41 1400
6 B 2021-02-14 2021-03-26 14:16:09 5000
8 B 2021-02-15 2021-03-21 16:57:40 3500

how to multiply pandas pandas data

im a beginner.
I have a python dataframe as below. I would like to multiply each of the elements by a=100, b=200, c=300. Can someone help me to understand how to do that?
There are n number of columns.
Thank you.
index
a
b
c
2021-01-01
22
20
18
2021-01-02
25
29
7
2021-01-03
15
30
20
Create a dictionary and apply operation to your dataframe:
coeff = {'a': 100, 'b': 200, 'c': 300}
df.update(df[coeff.keys()].mul(pd.Series(coeff), axis=1))
>>> df
index a b c
0 2021-01-01 2200 4000 5400
1 2021-01-02 2500 5800 2100
2 2021-01-03 1500 6000 6000
Alternative with a list:
df[['a', 'b', 'c']] *= [100, 200, 300]
Saying your dataframe is called df then it is simple as (if I understand it correctly):
df.a = df.a * 100
df.b = df.b * 200
df.c = df.c * 300

Indexing based on multiple columns

I'm new to python and below mentioned is an ongoing data engineering issue I'm currently trying to resolve.
Table structure
Data:
Index 1 :
Is sequential and would increment by 1 as rows are added.
Index 2 : The problem <<-- To tabulate index 2
This is dependent on values stored in the columns [A,B,C,D,E]. If the value remains the same, we need to assign a single index for these rows.
eg: Rows 1,2,3 have 567 as a value for A,B,C respectively.
Therefore, index 2 is 100 for these 3 rows.
Record types :
1 - A
2 - B
3 - C
4 - D
5 - E
Code
data = [(100, 100, 1 , 567,'','','','') ,
(101, 100, 2 , '',567,'','','') ,
(102, 100, 3 , '','',567,'','') ,
(103, 101, 3 , '','',568,'','') ,
(104, 101, 4 , '','','',568,'') ,
(105, 101, 5 , '','','','',568) ]
#Creates the data frame
df = pd.DataFrame( data, columns = ['index1' , 'index2', 'record_type' , 'A','B','C','D','E'], dtype=str)
#Combines columns A,B,C,D,E and adds a $ where ever it is null in order to stack these values
df['combined'] = df[['A', 'B', 'C','D','E']].stack().groupby(level=0).agg('$'.join)
# Cleans the column 'combined'
df['combined_cleaned']= df['combined'].replace({'\$':''}, regex = True)
Attempting to use the combined_cleaned column to calculate index2.
Not sure if this is the right approach, open to suggestions.
A few assumptions here, but seem to fit your problem.
If there is only ever 1 value over those columns for each row then you can take the max along the row, and then find consecutive groups checking whether that Series is equal to itself, shifted.
We add 99 because by definition the counting will start at 1, but you seem to want 100.
val_cols = ['A', 'B', 'C', 'D', 'E']
s = df[val_cols].apply(pd.to_numeric).max(1)
#0 567.0
#1 567.0
#2 567.0
#3 568.0
#4 568.0
#5 568.0
#dtype: float64
df['index2'] = s.ne(s.shift()).cumsum() + 99
print(df)
index1 record_type A B C D E index2
0 100 1 567 100
1 101 2 567 100
2 102 3 567 100
3 103 3 568 101
4 104 4 568 101
5 105 5 568 101
If instead of a single value, 'record_type' points to the appropriate column you can use numpy indexing.
import numpy as np
arr = df[val_cols].to_numpy()
idx = df['record_type'].astype(int).to_numpy()
vals = arr[np.arange(len(arr)), idx-1]
#array(['567', '567', '567', '568', '568', '568'], dtype=object)
The combined_cleaned column could be generated directly using
cols = ['A', 'B', 'C','D','E']
df[cols].replace('', np.nan).apply(lambda x: x.dropna().item(), axis=1)
You can also try with stack followed by factorize:
cols = ['A', 'B', 'C','D','E']
s = pd.factorize(df[cols].replace('',np.nan).stack())[0]
df['index2_new'] = int(df['index1'].iat[0]) + s
print(df)
index1 index2 record_type A B C D E index2_new
0 100 100 1 567 100
1 101 100 2 567 100
2 102 100 3 567 100
3 103 101 3 568 101
4 104 101 4 568 101
5 105 101 5 568 101

append one dataframe column value to another dataframe

I have two dataframes. df1 is empty dataframe and df2 is having some data as shown. There are few columns common in both dfs. I want to append df2 dataframe columns data into df1 dataframe's column. df3 is expected result.
I have referred Python + Pandas + dataframe : couldn't append one dataframe to another, but not working. It gives following error:
ValueError: Plan shapes are not aligned
df1:
Empty DataFrame
Columns: [a, b, c, d, e]
Index: [] `
df2:
c e
0 11 55
1 22 66
df3 (expected output):
a b c d e
0 11 55
1 22 66
tried with append but not getting desired result
import pandas as pd
l1 = ['a', 'b', 'c', 'd', 'e']
l2 = []
df1 = pd.DataFrame(l2, columns=l1)
l3 = ['c', 'e']
l4 = [[11, 55],
[22, 66]]
df2 = pd.DataFrame(l4, columns=l3)
print("concat","\n",pd.concat([df1,df2])) # columns will be inplace
print("merge Nan","\n",pd.merge(df2, df1,how='left', on=l3)) # columns occurence is not preserved
#### Output ####
#concat
a b c d e
0 NaN NaN 11 NaN 55
1 NaN NaN 22 NaN 66
#merge
c e a b d
0 11 55 NaN NaN NaN
1 22 66 NaN NaN NaN
Append seems to work for me. Does this not do what you want?
df1 = pd.DataFrame(columns=['a', 'b', 'c'])
print("df1: ")
print(df1)
df2 = pd.DataFrame(columns=['a', 'c'], data=[[0, 1], [2, 3]])
print("df2:")
print(df2)
print("df1.append(df2):")
print(df1.append(df2, ignore_index=True, sort=False))
Output:
df1:
Empty DataFrame
Columns: [a, b, c]
Index: []
df2:
a c
0 0 1
1 2 3
df1.append(df2):
a b c
0 0 NaN 1
1 2 NaN 3
Have you tried pd.concat ?
pd.concat([df1,df2])

Combining dataframes in pandas and populating with maximum values

I'm trying to combine multiple data frames in pandas and I want the new dataframe to contain the maximum element within the various dataframes. All of the dataframes have the same row and column labels. How can I do this?
Example:
df1 = Date A B C
1/1/15 3 5 1
2/1/15 2 4 7
df2 = Date A B C
1/1/15 7 2 2
2/1/15 1 5 4
I'd like the result to look like this.
df = Date A B C
1/1/15 7 5 2
2/1/15 2 5 7
You can use np.where to return an array of the values that satisfy your boolean condition, this can then be used to construct a df:
In [5]:
vals = np.where(df1 > df2, df1, df2)
vals
Out[5]:
array([['1/1/15', 7, 5, 2],
['2/1/15', 2, 5, 7]], dtype=object)
In [6]:
pd.DataFrame(vals, columns = df1.columns)
Out[6]:
Date A B C
0 1/1/15 7 5 2
1 2/1/15 2 5 7
I don't know if Date is a column or index but the end result will be the same.
EDIT
Actually just use np.maximum:
In [8]:
np.maximum(df1,df2)
Out[8]:
Date A B C
0 1/1/15 7 5 2
1 2/1/15 2 5 7

Resources