I have two dataframes like so:
df1
email datetimecreated
1#1.com 2019-02-07 20:47:00
df2
email datetimecreated
1#1.com 2019-02-12 20:47:00
I want to create the following logic:
1. check if email field in df2 is present in df1
if email address is present, then:
2. check if DatetimeCreated in df1 is greater than TODAY-90 days
IF both are TRUE do not append the row from df2 into df1.
I am doing the date check like this:
from datetime import datetime, timedelta
df1 = df1[df1.DateTimeCreated >= (datetime.today() - timedelta(90))]
I am doing the email check like this:
boolean = session_final_101.EmailAddress.iat[0] in df3
How do I combine the two statements?
I tried this:
if boolean == False:
Winner_final = pd.concat([df1, df2], ignore_index=True,sort=False)
but the boolean variable is always false I am not sure what I am doing wrong.
Related
I have a client data df with 200+ columns, say A,B,C,D...X,Y,Z. There's a column in this df which has CAMPAIGN_ID in it. I have another data mapping_csv that has CAMPAIGN_ID and set of columns I need from df. I need to split df into one csv file for each campaign, that will have rows from that campaign and only those columns that are as per mapping_csv.
I am getting type error as below.
TypeError: unhashable type: 'list'
This is what I tried.
for campaign in df['CAMPAIGN_ID'].unique():
df2 = df[df['CAMPAIGN_ID']==campaign]
# remove blank columns
df2.dropna(how='all', axis=1, inplace=True)
for column in df2.columns:
if df2[column].unique()[0]=="0000-00-00" and df2[column].unique().shape[0]==1:
df2 = df2.drop(column, axis=1)
for column in df2.columns:
if df2[column].unique()[0]=='0' and df2[column].unique().shape[0]==1:
df2 = df2.drop(column, axis=1)
# select required columns
df2 = df2[mapping_csv.loc[mapping_csv['CAMPAIGN_ID']==campaign, 'Variable_List'].str.replace(" ","").str.split(",")]
file_shape = df2.shape[0]
filename = "cart_"+str(dt.date.today().strftime('%Y%m%d'))+"_"+campaign+"_rowcnt_"+str(file_shape)
df2.to_csv(filename+".csv",index=False)
Any help will be appreciated.
This is how data looks like -
This is how mapping looks like -
This addresses your core problem.
df = pd.DataFrame(dict(id=['foo','foo','bar','bar',],a=[1,2,3,4,], b=[5,6,7,8], c=[1,2,3,4]))
mapper = dict(foo=['a','b'], bar=['b','c'])
for each_id in df.id.unique():
df_id = df.query(f'id.str.contains("{each_id}")').loc[:,mapper[each_id]]
print(df_id)
I have 2 Pandas Dataframes with 5 columns and about 1000 rows each (working with python3).
I'm interested in making a comparison between the first column in df1 and the first column of df2 as follows:
DF1
[index] [col1]
1 "foobar"
2 "acksyn"
3 "foobaz"
4 "ackfin"
... ...
DF2
[index] [col1]
1 "old"
2 "fin"
3 "new"
4 "bar"
... ...
What I want to achieve is this: for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.
In this example the resulting DF1 should be:
DF1
[index] [col1]
2 "acksyn"
3 "foobaz"
... ...
(see DF2 indexes 2 and 4 are the final part in DF1 indexes 1 and 4)
I tried using an internally defined function like:
def check_presence(df1_col1, second_csv):
for index, row in second_csv.iterrows():
search_string = "(?P<first_group>^(" + some_string + "))(?P<the_rest>" + row["col1"] + "$)"
if re.search(search_string, df1_col1):
return True
return False
and instructions with this format:
indexes = csv[csv.col1.str.contains(some_regex, regex= True, na=False)].index
but in both cases the python console complies about not being able to compare non-string objects with a string
What am I doing wrong? I can even try a solution after joining the 2 CSVs but I think I would need to do the same thing in the end
Thanks for patience, I'm new to python...
You will need to join your keywords in df2 first if you want to use str.contains method.
import pandas as pd
df = pd.DataFrame({'col1': {0: 'foobar', 1: 'acksyn', 2: 'foobaz', 3: 'ackfin'}})
df2 = pd.DataFrame({'col1': {0: 'old', 1: 'fin', 2: 'new', 3: 'bar'}})
print (df["col1"].str.contains("|".join(df2["col1"])))
#
0 True
1 False
2 False
3 True
Possible Solution
"" for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.""
This is a one-liner if I understand properly:
# Search for Substring
# Generate an "OR" statement with a join
# Drop if match.
df[~df.col1.str.contains('|'.join(df2.col1.values))]
This will keep only the rows where DF2.Col1 is NOT found in DF1.Col1.
pd.Series.str.contains
Take your frames
frame1 =frame1=pd.DataFrame({"col1":["foobar","acksyn","foobaz","ackfin"]})
frame2=pd.DataFrame({"col1":["old","fin","new","bar"]})
Then
myList=frame2.col2.values
pattern='|'.join(myList)
Finally
frame1["col2"]=frame1["col1"].str.contains(pattern)
frame1.loc[frame1["col2"]==True]
col1 col2
0 foobar True
3 ackfin True
I am having issue comparing dates between two dataframes from inside a multi logic statement.
df1:
EmailAddress DateTimeCreated
1#1 2019-02-12 20:47:00
df2:
EmailAddress DateTimeCreated
1#1.com 2019-02-07 20:47:00
2#2.com 2018-11-13 20:47:00
3#3.com 2018-11-04 20:47:00
I want to do three things, whenever there is a row in df1:
1. Compare to see if `EmailAddress` from df1 is present in df2:
1a. If `EmailAddress` is present, compare `DateTimeCreated` in df1 to `DateTimeCreated` in df2,
2. If `DateTimeCreated` in df1 is greater than today-90 days append df1 into df2.
In simpler words:
I want to see email address is present in df2 and if it is, compare datetimecreated in df2 to see if it has been greater than today-90days since last time person answered. If it has been greater than 90days then append the row from df1, into df2.
My logic is appending everything not sure what I am doing wrong like so:
import pandas as pd
from datetime import datetime, timedelta
df2.append(df2.loc[df2.EmailAddress.isin(df1.EmailAddress)&(df2.DateTimeCreated.ge(datetime.today() - timedelta(90)))])
what am I doing wrong to mess up on the date?
EDIT:
In the above example, between the dataframes, the row from df1 would not be appended bc DateTimeCreated is between TODAY() - 90 days.
Please refer inline comments for the explanation. Please note that you need to rename your df1 columns to match df2 columns in this solution.
import pandas as pd
import datetime
from datetime import timedelta, datetime
df1 = pd.DataFrame({'EmailAddress':['2#2.com'], 'DateTimeCreated':[datetime(2019,2,12,20,47,0)]})
df2 = pd.DataFrame({'EmailAddress':['1#1.com', '2#2.com', '3#3.com'],
'DateTimeCreated':[
datetime(2019,2,7,20,47,0),
datetime(2018,11,13,20,47,0),
datetime(2019,11,4,20,47,0)]})
# Get all expired rows
df3 = df2.loc[datetime.now() - df2['DateTimeCreated'] > timedelta(days=90)]
# Update it with the timestamp from df1
df3 = df3.set_index('EmailAddress').join(df1.set_index('EmailAddress'), how='inner', rsuffix='_r')
df3.drop('DateTimeCreated', axis=1, inplace=True)
df3.columns = ['DateTimeCreated']
# Patch df2 with the latest timestamp
df2 = df3.combine_first(df2.set_index('EmailAddress')).reset_index()
# Patch again for rows in df1 that are not in df2
df1 = df1.loc[df1['EmailAddress'].apply(lambda x: 1 if x not in df2['EmailAddress'].tolist() else 0) == 1]
df2 = pd.concat([df2, df1])
>>>df2
EmailAddress DateTimeCreated
0 1#1.com 2019-02-07 20:47:00
1 2#2.com 2019-02-12 20:47:00
2 3#3.com 2019-11-04 20:47:00
Try
1. left join df1 and df2 which meeting the condition 1 email address same
combined_df = df1.join(df2,how="left",lsuffix="df1_",rsuffix="df2_")
2. calculated the gap between df1 datetimecreated and today
gap = pd.datetime.today()- combined_df.DateTimeCreated_df1
return the index which gap >90
mask = combined_df.gap>90
df2.append(df1[mask])
Note:I think you may need the combined_df only, the 4th step append should lead to duplicated or confused data. Anyway, you can choic use step 1,2,3,4 or only used step 1,2,3
I want to get data from only df2 (all columns) by comparing 'no' filed in both df1 and df2.
My 3 line code is below, for this i'm getting all columns from df1 and df2 not able to trim fields from df1. How to achieve ?
I've 2 pandas dataframes like below :
df1:
no,name,salary
1,abc,100
2,def,105
3,abc,110
4,def,115
5,abc,120
df2:
no,name,salary,dept,addr
1,abc,100,IT1,ADDR1
2,abc,101,IT2,ADDR2
3,abc,102,IT3,ADDR3
4,abc,103,IT4,ADDR4
5,abc,104,IT5,ADDR5
6,abc,105,IT6,ADDR6
7,abc,106,IT7,ADDR7
8,abc,107,IT8,ADDR8
df1 = pd.read_csv("D:\\data\\data1.csv")
df2 = pd.read_csv("D:\\data\\data2.csv")
resDF = pd.merge(df1, df2, on='no' , how='inner')
I think you need filter only no column, then on and how parameters are not necessary:
resDF = pd.merge(df1[['no']], df2)
Or use boolean indexing with filtering by isin:
resDF = df2[df2['no'].isin(df1['no'])]
I have two dataframes like so:
df1:
Email DateTimeCompleted
2#2.com 2019-02-09T01:34:44.591Z
df2:
Email DateTimeCompleted
b#b.com 2019-01-29T01:34:44.591Z
2#2.com 2018-01-29T01:34:44.591Z
How do I look up Email value in df2 and compare where DateTimeCompleted is greater than TODAY (minus) 90 days and append df1 row data into df2? To add sometimes df2 can be empty if that makes a difference.
df2 updated would look like this:
Email DateTimeCompleted
b#b.com 2019-01-29T01:34:44.591Z
2#2.com 2018-01-29T01:34:44.591Z
2#2.com 2019-02-09T01:34:44.591Z
I tried this:
from datetime import date
if df1.Email in df2.Email & df2.DateTimeCompleted >= date.today()-90 :
print('true')
i get error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Also tried:
if df2.Email.str.contains(df1.Email.iat[0]):
print('true')
got error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You can do the following:
1. merge the two dataframes on keycolumns Email so you know which rows consist in both dataframes.
2. Filter the rows which are greater than today - 90days
3. Concat the dataframes to final with pd.concat
Code:
# Merge dataframes together
df3 = pd.merge(df1, df2, on=['Email'], suffixes=['', '_2'])
# Filter the rows
df3 = df3[df3.DateTimeCompleted > (dt.today() - timedelta(90))]
# Drop the column we dont need
df3.drop(['DateTimeCompleted_2'], axis=1, inplace=True)
# Create final dataframe by concatting
df_final = pd.concat([df2, df3], ignore_index=True)
Email DateTimeCompleted
0 b#b.com 2019-01-29 01:34:44.591
1 2#2.com 2018-01-29 01:34:44.591
2 2#2.com 2019-02-09 01:34:44.591
I wrote a function to do the following
The function takes the argument
mailid, dataframe1, dataframe2
def process(mailid,df1,df2):
if mailid in df2.Email.values:
b = df1.loc[df1.Email==mailid,"DateTimeCompleted"].head(1)
if((~b.empty) or (int(((pd.to_datetime('today'))-(pd.to_datetime(b))).astype('timedelta64[D]')) >90)):
df1 = pd.concat([df1, pd.DataFrame([[mailid,b[0]]],columns=['Email','DateTimeCompleted'])],axis=0)
print("Added the row")
else:
print("Condition failed")
print("False")
else:
print("The mail is not there in dataframe")
return df1