How to compare datetime between dataframes in multi logic statements? - python-3.x

I am having issue comparing dates between two dataframes from inside a multi logic statement.
df1:
EmailAddress DateTimeCreated
1#1 2019-02-12 20:47:00
df2:
EmailAddress DateTimeCreated
1#1.com 2019-02-07 20:47:00
2#2.com 2018-11-13 20:47:00
3#3.com 2018-11-04 20:47:00
I want to do three things, whenever there is a row in df1:
1. Compare to see if `EmailAddress` from df1 is present in df2:
1a. If `EmailAddress` is present, compare `DateTimeCreated` in df1 to `DateTimeCreated` in df2,
2. If `DateTimeCreated` in df1 is greater than today-90 days append df1 into df2.
In simpler words:
I want to see email address is present in df2 and if it is, compare datetimecreated in df2 to see if it has been greater than today-90days since last time person answered. If it has been greater than 90days then append the row from df1, into df2.
My logic is appending everything not sure what I am doing wrong like so:
import pandas as pd
from datetime import datetime, timedelta
df2.append(df2.loc[df2.EmailAddress.isin(df1.EmailAddress)&(df2.DateTimeCreated.ge(datetime.today() - timedelta(90)))])
what am I doing wrong to mess up on the date?
EDIT:
In the above example, between the dataframes, the row from df1 would not be appended bc DateTimeCreated is between TODAY() - 90 days.

Please refer inline comments for the explanation. Please note that you need to rename your df1 columns to match df2 columns in this solution.
import pandas as pd
import datetime
from datetime import timedelta, datetime
df1 = pd.DataFrame({'EmailAddress':['2#2.com'], 'DateTimeCreated':[datetime(2019,2,12,20,47,0)]})
df2 = pd.DataFrame({'EmailAddress':['1#1.com', '2#2.com', '3#3.com'],
'DateTimeCreated':[
datetime(2019,2,7,20,47,0),
datetime(2018,11,13,20,47,0),
datetime(2019,11,4,20,47,0)]})
# Get all expired rows
df3 = df2.loc[datetime.now() - df2['DateTimeCreated'] > timedelta(days=90)]
# Update it with the timestamp from df1
df3 = df3.set_index('EmailAddress').join(df1.set_index('EmailAddress'), how='inner', rsuffix='_r')
df3.drop('DateTimeCreated', axis=1, inplace=True)
df3.columns = ['DateTimeCreated']
# Patch df2 with the latest timestamp
df2 = df3.combine_first(df2.set_index('EmailAddress')).reset_index()
# Patch again for rows in df1 that are not in df2
df1 = df1.loc[df1['EmailAddress'].apply(lambda x: 1 if x not in df2['EmailAddress'].tolist() else 0) == 1]
df2 = pd.concat([df2, df1])
>>>df2
EmailAddress DateTimeCreated
0 1#1.com 2019-02-07 20:47:00
1 2#2.com 2019-02-12 20:47:00
2 3#3.com 2019-11-04 20:47:00

Try
1. left join df1 and df2 which meeting the condition 1 email address same
combined_df = df1.join(df2,how="left",lsuffix="df1_",rsuffix="df2_")
2. calculated the gap between df1 datetimecreated and today
gap = pd.datetime.today()- combined_df.DateTimeCreated_df1
return the index which gap >90
mask = combined_df.gap>90
df2.append(df1[mask])
Note:I think you may need the combined_df only, the 4th step append should lead to duplicated or confused data. Anyway, you can choic use step 1,2,3,4 or only used step 1,2,3

Related

Is there a way to compare the values of a Pandas DataFrame with the values of a second DataFrame?

I have 2 Pandas Dataframes with 5 columns and about 1000 rows each (working with python3).
I'm interested in making a comparison between the first column in df1 and the first column of df2 as follows:
DF1
[index] [col1]
1 "foobar"
2 "acksyn"
3 "foobaz"
4 "ackfin"
... ...
DF2
[index] [col1]
1 "old"
2 "fin"
3 "new"
4 "bar"
... ...
What I want to achieve is this: for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.
In this example the resulting DF1 should be:
DF1
[index] [col1]
2 "acksyn"
3 "foobaz"
... ...
(see DF2 indexes 2 and 4 are the final part in DF1 indexes 1 and 4)
I tried using an internally defined function like:
def check_presence(df1_col1, second_csv):
for index, row in second_csv.iterrows():
search_string = "(?P<first_group>^(" + some_string + "))(?P<the_rest>" + row["col1"] + "$)"
if re.search(search_string, df1_col1):
return True
return False
and instructions with this format:
indexes = csv[csv.col1.str.contains(some_regex, regex= True, na=False)].index
but in both cases the python console complies about not being able to compare non-string objects with a string
What am I doing wrong? I can even try a solution after joining the 2 CSVs but I think I would need to do the same thing in the end
Thanks for patience, I'm new to python...
You will need to join your keywords in df2 first if you want to use str.contains method.
import pandas as pd
df = pd.DataFrame({'col1': {0: 'foobar', 1: 'acksyn', 2: 'foobaz', 3: 'ackfin'}})
df2 = pd.DataFrame({'col1': {0: 'old', 1: 'fin', 2: 'new', 3: 'bar'}})
print (df["col1"].str.contains("|".join(df2["col1"])))
#
0 True
1 False
2 False
3 True
Possible Solution
"" for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.""
This is a one-liner if I understand properly:
# Search for Substring
# Generate an "OR" statement with a join
# Drop if match.
df[~df.col1.str.contains('|'.join(df2.col1.values))]
This will keep only the rows where DF2.Col1 is NOT found in DF1.Col1.
pd.Series.str.contains
Take your frames
frame1 =frame1=pd.DataFrame({"col1":["foobar","acksyn","foobaz","ackfin"]})
frame2=pd.DataFrame({"col1":["old","fin","new","bar"]})
Then
myList=frame2.col2.values
pattern='|'.join(myList)
Finally
frame1["col2"]=frame1["col1"].str.contains(pattern)
frame1.loc[frame1["col2"]==True]
col1 col2
0 foobar True
3 ackfin True

How to apply multi logic conditions between two dataframes?

I have two dataframes like so:
df1
email datetimecreated
1#1.com 2019-02-07 20:47:00
df2
email datetimecreated
1#1.com 2019-02-12 20:47:00
I want to create the following logic:
1. check if email field in df2 is present in df1
if email address is present, then:
2. check if DatetimeCreated in df1 is greater than TODAY-90 days
IF both are TRUE do not append the row from df2 into df1.
I am doing the date check like this:
from datetime import datetime, timedelta
df1 = df1[df1.DateTimeCreated >= (datetime.today() - timedelta(90))]
I am doing the email check like this:
boolean = session_final_101.EmailAddress.iat[0] in df3
How do I combine the two statements?
I tried this:
if boolean == False:
Winner_final = pd.concat([df1, df2], ignore_index=True,sort=False)
but the boolean variable is always false I am not sure what I am doing wrong.

Pandas data frame merge select columns

I want to get data from only df2 (all columns) by comparing 'no' filed in both df1 and df2.
My 3 line code is below, for this i'm getting all columns from df1 and df2 not able to trim fields from df1. How to achieve ?
I've 2 pandas dataframes like below :
df1:
no,name,salary
1,abc,100
2,def,105
3,abc,110
4,def,115
5,abc,120
df2:
no,name,salary,dept,addr
1,abc,100,IT1,ADDR1
2,abc,101,IT2,ADDR2
3,abc,102,IT3,ADDR3
4,abc,103,IT4,ADDR4
5,abc,104,IT5,ADDR5
6,abc,105,IT6,ADDR6
7,abc,106,IT7,ADDR7
8,abc,107,IT8,ADDR8
df1 = pd.read_csv("D:\\data\\data1.csv")
df2 = pd.read_csv("D:\\data\\data2.csv")
resDF = pd.merge(df1, df2, on='no' , how='inner')
I think you need filter only no column, then on and how parameters are not necessary:
resDF = pd.merge(df1[['no']], df2)
Or use boolean indexing with filtering by isin:
resDF = df2[df2['no'].isin(df1['no'])]

How to do multi logic value comparisons between dataframes?

I have two dataframes like so:
df1:
Email DateTimeCompleted
2#2.com 2019-02-09T01:34:44.591Z
df2:
Email DateTimeCompleted
b#b.com 2019-01-29T01:34:44.591Z
2#2.com 2018-01-29T01:34:44.591Z
How do I look up Email value in df2 and compare where DateTimeCompleted is greater than TODAY (minus) 90 days and append df1 row data into df2? To add sometimes df2 can be empty if that makes a difference.
df2 updated would look like this:
Email DateTimeCompleted
b#b.com 2019-01-29T01:34:44.591Z
2#2.com 2018-01-29T01:34:44.591Z
2#2.com 2019-02-09T01:34:44.591Z
I tried this:
from datetime import date
if df1.Email in df2.Email & df2.DateTimeCompleted >= date.today()-90 :
print('true')
i get error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Also tried:
if df2.Email.str.contains(df1.Email.iat[0]):
print('true')
got error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You can do the following:
1. merge the two dataframes on keycolumns Email so you know which rows consist in both dataframes.
2. Filter the rows which are greater than today - 90days
3. Concat the dataframes to final with pd.concat
Code:
# Merge dataframes together
df3 = pd.merge(df1, df2, on=['Email'], suffixes=['', '_2'])
# Filter the rows
df3 = df3[df3.DateTimeCompleted > (dt.today() - timedelta(90))]
# Drop the column we dont need
df3.drop(['DateTimeCompleted_2'], axis=1, inplace=True)
# Create final dataframe by concatting
df_final = pd.concat([df2, df3], ignore_index=True)
Email DateTimeCompleted
0 b#b.com 2019-01-29 01:34:44.591
1 2#2.com 2018-01-29 01:34:44.591
2 2#2.com 2019-02-09 01:34:44.591
I wrote a function to do the following
The function takes the argument
mailid, dataframe1, dataframe2
def process(mailid,df1,df2):
if mailid in df2.Email.values:
b = df1.loc[df1.Email==mailid,"DateTimeCompleted"].head(1)
if((~b.empty) or (int(((pd.to_datetime('today'))-(pd.to_datetime(b))).astype('timedelta64[D]')) >90)):
df1 = pd.concat([df1, pd.DataFrame([[mailid,b[0]]],columns=['Email','DateTimeCompleted'])],axis=0)
print("Added the row")
else:
print("Condition failed")
print("False")
else:
print("The mail is not there in dataframe")
return df1

Which row is extra in a dataframe?

I have two dataframes that are contain market daily end of day data. They are supposed to contain identical starting dates and ending dates and number of rows, but when I print the len of each, one is bigger by one than the other:
DF1
close
date
2008-01-01 45.92
2008-01-02 45.16
2008-01-03 45.33
2008-01-04 42.09
2008-01-07 46.98
...
[2870 rows x 1 columns]
DF2
close
date
2008-01-01 60.48
2008-01-02 59.71
2008-01-03 58.43
2008-01-04 56.64
2008-01-07 56.98
...
[2871 rows x 1 columns]
How can I show which row either:
has a duplicate row,
or has an extra date
so that I can delete the [probable] weekend/holiday date row that is in DF2 but not in DF1?
I have tried things like:
df1 = df1.drop_duplicates(subset='date', keep='first')
df2 = df1.drop_duplicates(subset='date', keep='first')
but can't get it to work [ValueError: not enough values to unpack (expected 2, got 0)].
Extra:
How do I remove weekend dates from a dataframe?
May using .loc
DF2=DF2.loc[DF1.index]
If check index different between DF1 and DF2
DF2.index.difference(DF1.index)
Check whether DF2 have duplicate index
DF2[DF2.index.duplicated(keep=False)]
Check the weekends
df.index.weekday_name.isin(['Sunday','Saturday'])
Fix your code
df1 = df1.reset_index().drop_duplicates(subset='date', keep='first').reset_index('date')
df2 = df2.reset_index().drop_duplicates(subset='date', keep='first').reset_index('date')
Also for this I recommend duplicated
df2 =df2 [df2.index.duplicated()]
About the business
def B_day(date):
return bool(len(pd.bdate_range(date, date)))
df.index.map(B_day)

Resources