Which row is extra in a dataframe? - python-3.x

I have two dataframes that are contain market daily end of day data. They are supposed to contain identical starting dates and ending dates and number of rows, but when I print the len of each, one is bigger by one than the other:
DF1
close
date
2008-01-01 45.92
2008-01-02 45.16
2008-01-03 45.33
2008-01-04 42.09
2008-01-07 46.98
...
[2870 rows x 1 columns]
DF2
close
date
2008-01-01 60.48
2008-01-02 59.71
2008-01-03 58.43
2008-01-04 56.64
2008-01-07 56.98
...
[2871 rows x 1 columns]
How can I show which row either:
has a duplicate row,
or has an extra date
so that I can delete the [probable] weekend/holiday date row that is in DF2 but not in DF1?
I have tried things like:
df1 = df1.drop_duplicates(subset='date', keep='first')
df2 = df1.drop_duplicates(subset='date', keep='first')
but can't get it to work [ValueError: not enough values to unpack (expected 2, got 0)].
Extra:
How do I remove weekend dates from a dataframe?

May using .loc
DF2=DF2.loc[DF1.index]
If check index different between DF1 and DF2
DF2.index.difference(DF1.index)
Check whether DF2 have duplicate index
DF2[DF2.index.duplicated(keep=False)]
Check the weekends
df.index.weekday_name.isin(['Sunday','Saturday'])
Fix your code
df1 = df1.reset_index().drop_duplicates(subset='date', keep='first').reset_index('date')
df2 = df2.reset_index().drop_duplicates(subset='date', keep='first').reset_index('date')
Also for this I recommend duplicated
df2 =df2 [df2.index.duplicated()]
About the business
def B_day(date):
return bool(len(pd.bdate_range(date, date)))
df.index.map(B_day)

Related

How do I give col names for reduce way of merging data frames

I have two dfs:- df1 and df2.:-
dfs=[df1,df2]
df_final = reduce(lambda left,right: pd.merge(left,right,on='Serial_Nbr'), dfs)
I want to select only one column apart from the merge column Serial_Nbr in df1while doing the merge.
how do i do this..?
Filter column in df1:
dfs=[df1[['Serial_Nbr']],df2]
Or if only 2 DataFrames remove reduce:
df_final = pd.merge(df1[['Serial_Nbr']], df2, on='Serial_Nbr')

Get only rows of dataframe where a subset of columns exist in another dataframe

I want to get all rows of a dataframe (df2) where the city column value and postcode column value also exist in another dataframe (df1).
Important is here that I want the combination of both columns and not look at the column individually.
My approach was this:
#1. Get all combinations
df_combinations=np.array(df1.select("Ort","Postleitzahl").dropDuplicates().collect())
sc.broadcast(df_combinations)
#2.Define udf
def combination_in_vx(ort,plz):
for arr_el in dfSpark_combinations:
if str(arr_el[0]) == ort and int(arr_el[1]) == plz:
return True
return False
combination_in_vx = udf(combination_in_vx, BooleanType())
#3.
df_tmp=df_2.withColumn("Combination_Exists", combination_in_vx('city','postcode'))
df_result=df_tmp.filter(df_tmp.Combination_Exists)
Although this should theoretically work it takes forever!
Does anybody know about a better solution here? Thank you very much!
You can do a left semi join using the two columns. This will include the rows in df2 where the values in both of the two specified columns exist in df1:
import pyspark.sql.functions as F
df_result = df2.join(df1, ["Ort", "Postleitzahl"], 'left_semi')

How to compare multiple columns in two tables and find out the duplicates?

I have two dataframe
Dataframe 1
Dataframe 2
ID column is not unique in the two tables. I want to compare all the columns in both the tables except ID's and print the unique rows
Expected output
I tried 'isin' function, but not working. Each dataframe size is 150000 and I removed duplicates in both the tables. Please advise how to do that?
You can use df.append to combine the dataframe, then use df.duplicated which will flag the duplicates.
df3 = df1.append(df, ignore_index=True)
df4 = df3.duplicated(subset=['Team', 'name', 'Country', 'Token'], keep=False)

Python: DataFrame Index shifting

I have several dataframes that I have concatenated with pandas in the line:
xspc = pd.concat([df1,df2,df3], axis = 1, join_axes = [df3.index])
In df2 the index values read one day later than the values of df1, and df3. So for instance when the most current date is 7/1/19 the index values for df1 and df3 will read "7/1/19" while df2 reads '7/2/19'. I would like to be able to concatenate each series so that each dataframe is joined on the most recent date, so in other words I would like all the dataframe values from df1 index value '7/1/19' to be concatenated with dataframe 2 index value '7/2/19' and dataframe 3 index value '7/1/19'. When methods can I use to shift the data around to join on these not matching index values?
You can reset the index of the data frame and then concat the dataframes
df1=df1.reset_index()
df2=df2.reset_index()
df3=df3.reset_index()
df_final = pd.concat([df1,df2,df3],axis=1, join_axes=[df3.index])
This should work since you mentioned that the date in df2 will be one day after df1 or df3

How to compare datetime between dataframes in multi logic statements?

I am having issue comparing dates between two dataframes from inside a multi logic statement.
df1:
EmailAddress DateTimeCreated
1#1 2019-02-12 20:47:00
df2:
EmailAddress DateTimeCreated
1#1.com 2019-02-07 20:47:00
2#2.com 2018-11-13 20:47:00
3#3.com 2018-11-04 20:47:00
I want to do three things, whenever there is a row in df1:
1. Compare to see if `EmailAddress` from df1 is present in df2:
1a. If `EmailAddress` is present, compare `DateTimeCreated` in df1 to `DateTimeCreated` in df2,
2. If `DateTimeCreated` in df1 is greater than today-90 days append df1 into df2.
In simpler words:
I want to see email address is present in df2 and if it is, compare datetimecreated in df2 to see if it has been greater than today-90days since last time person answered. If it has been greater than 90days then append the row from df1, into df2.
My logic is appending everything not sure what I am doing wrong like so:
import pandas as pd
from datetime import datetime, timedelta
df2.append(df2.loc[df2.EmailAddress.isin(df1.EmailAddress)&(df2.DateTimeCreated.ge(datetime.today() - timedelta(90)))])
what am I doing wrong to mess up on the date?
EDIT:
In the above example, between the dataframes, the row from df1 would not be appended bc DateTimeCreated is between TODAY() - 90 days.
Please refer inline comments for the explanation. Please note that you need to rename your df1 columns to match df2 columns in this solution.
import pandas as pd
import datetime
from datetime import timedelta, datetime
df1 = pd.DataFrame({'EmailAddress':['2#2.com'], 'DateTimeCreated':[datetime(2019,2,12,20,47,0)]})
df2 = pd.DataFrame({'EmailAddress':['1#1.com', '2#2.com', '3#3.com'],
'DateTimeCreated':[
datetime(2019,2,7,20,47,0),
datetime(2018,11,13,20,47,0),
datetime(2019,11,4,20,47,0)]})
# Get all expired rows
df3 = df2.loc[datetime.now() - df2['DateTimeCreated'] > timedelta(days=90)]
# Update it with the timestamp from df1
df3 = df3.set_index('EmailAddress').join(df1.set_index('EmailAddress'), how='inner', rsuffix='_r')
df3.drop('DateTimeCreated', axis=1, inplace=True)
df3.columns = ['DateTimeCreated']
# Patch df2 with the latest timestamp
df2 = df3.combine_first(df2.set_index('EmailAddress')).reset_index()
# Patch again for rows in df1 that are not in df2
df1 = df1.loc[df1['EmailAddress'].apply(lambda x: 1 if x not in df2['EmailAddress'].tolist() else 0) == 1]
df2 = pd.concat([df2, df1])
>>>df2
EmailAddress DateTimeCreated
0 1#1.com 2019-02-07 20:47:00
1 2#2.com 2019-02-12 20:47:00
2 3#3.com 2019-11-04 20:47:00
Try
1. left join df1 and df2 which meeting the condition 1 email address same
combined_df = df1.join(df2,how="left",lsuffix="df1_",rsuffix="df2_")
2. calculated the gap between df1 datetimecreated and today
gap = pd.datetime.today()- combined_df.DateTimeCreated_df1
return the index which gap >90
mask = combined_df.gap>90
df2.append(df1[mask])
Note:I think you may need the combined_df only, the 4th step append should lead to duplicated or confused data. Anyway, you can choic use step 1,2,3,4 or only used step 1,2,3

Resources