Groupby with Cumcount - Not Working As Expected - python-3.x

I am looking to return a dataframe for the last 5 times a Product was sold & I'm running into issues.
Here is my dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='D'), 10000)})
I thought I could sort the Dataframe by date then use .cumcount() to create a helper column to later filter by. Here's what I tried:
df = df.sort_values('Date',ascending=False)
df['count_product'] = df.groupby(['Date','Product']).cumcount() + 1
df2 = df.loc[df.count_product < 5]
This does not work as intended. Based on the data above, I would have expected Product 1 to the following dates included in the new dataframe: 2018-12-31, 2018-12-30, 2018-12-29, 2018-12-28, & 2018-12-27. Product 3 would have the dates 2018-12-31, 2018-12-30, 2018-12-29, 2018-12-28, & 2018-12-26.
Any suggestions?

Check with drop_duplicates then groupby with head, after filter we using merge
yourdf=df.drop_duplicates(['Product','Date']).groupby('Product').head(4)[['Product','Date']].merge(df)

You can create a filter from the groupby:
s = df.groupby('Product').apply(lambda x: x.Date.ge(x.Date.drop_duplicates().nlargest(5).iloc[-1])).reset_index(0, True)
df2 = df.loc[s]
Just to check:
df2.groupby('Product').Date.agg(['min', 'max'])
min max
Product
Product 1 2018-12-27 2018-12-31
Product 2 2018-12-27 2018-12-31
Product 3 2018-12-26 2018-12-31

Related

How do I get the maximum and minimum values of a column depending on another two columns in pandas dataframe?

This is my first time asking a question. I have a dataframe that looks like below:
import pandas as pd
data = [['AK', 'Co',2957],
['AK', 'Ot', 15],
['AK','Petr', 86848],
['AL', 'Co',167],
['AL', 'Ot', 10592],
['AL', 'Petr',1667]]
my_df = pd.DataFrame(data, columns = ['State', 'Energy', 'Elec'])
print(my_df)
I need to find the maximum and minimum values of the third column based on the first two columns. I did browse through a few stackoverflow questions but couldn't find the right way to solve this.
My output should look like below:
data = [['AK','Ot', 15],
['AK','Petr',86848],
['AL','Co',167],
['AL','Ot', 10592]]
my_df = pd.DataFrame(data, columns = ['State', 'Energy', 'Elec'])
print(my_df)
Note: Please let me know where I am lagging before leaving a negative marking on the question
This link helped me: Python pandas dataframe: find max for each unique values of an another column
try idxmin and idxmax with .loc filter.
new_df = my_df.loc[
my_df.groupby(["State"])
.agg(ElecMin=("Elec", "idxmin"), ElecMax=("Elec", "idxmax"))
.stack()
]
)
print(new_df)
State Energy Elec
0 AK Ot 15
1 AK Petr 86848
2 AL Co 167
3 AL Ot 10592

Is there a way to compare the values of a Pandas DataFrame with the values of a second DataFrame?

I have 2 Pandas Dataframes with 5 columns and about 1000 rows each (working with python3).
I'm interested in making a comparison between the first column in df1 and the first column of df2 as follows:
DF1
[index] [col1]
1 "foobar"
2 "acksyn"
3 "foobaz"
4 "ackfin"
... ...
DF2
[index] [col1]
1 "old"
2 "fin"
3 "new"
4 "bar"
... ...
What I want to achieve is this: for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.
In this example the resulting DF1 should be:
DF1
[index] [col1]
2 "acksyn"
3 "foobaz"
... ...
(see DF2 indexes 2 and 4 are the final part in DF1 indexes 1 and 4)
I tried using an internally defined function like:
def check_presence(df1_col1, second_csv):
for index, row in second_csv.iterrows():
search_string = "(?P<first_group>^(" + some_string + "))(?P<the_rest>" + row["col1"] + "$)"
if re.search(search_string, df1_col1):
return True
return False
and instructions with this format:
indexes = csv[csv.col1.str.contains(some_regex, regex= True, na=False)].index
but in both cases the python console complies about not being able to compare non-string objects with a string
What am I doing wrong? I can even try a solution after joining the 2 CSVs but I think I would need to do the same thing in the end
Thanks for patience, I'm new to python...
You will need to join your keywords in df2 first if you want to use str.contains method.
import pandas as pd
df = pd.DataFrame({'col1': {0: 'foobar', 1: 'acksyn', 2: 'foobaz', 3: 'ackfin'}})
df2 = pd.DataFrame({'col1': {0: 'old', 1: 'fin', 2: 'new', 3: 'bar'}})
print (df["col1"].str.contains("|".join(df2["col1"])))
#
0 True
1 False
2 False
3 True
Possible Solution
"" for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.""
This is a one-liner if I understand properly:
# Search for Substring
# Generate an "OR" statement with a join
# Drop if match.
df[~df.col1.str.contains('|'.join(df2.col1.values))]
This will keep only the rows where DF2.Col1 is NOT found in DF1.Col1.
pd.Series.str.contains
Take your frames
frame1 =frame1=pd.DataFrame({"col1":["foobar","acksyn","foobaz","ackfin"]})
frame2=pd.DataFrame({"col1":["old","fin","new","bar"]})
Then
myList=frame2.col2.values
pattern='|'.join(myList)
Finally
frame1["col2"]=frame1["col1"].str.contains(pattern)
frame1.loc[frame1["col2"]==True]
col1 col2
0 foobar True
3 ackfin True

Split rows with same ID into different columns python

I want to have a dataframe with repeated values with the same id number. But i want to split the repeated rows into colunms.
data = [[10450015,4.4],[16690019 4.1],[16690019,4.0],[16510069 3.7]]
df = pd.DataFrame(data, columns = ['id', 'k'])
print(df)
The resulting dataframe would have n_k (n= repated values of id rows). The repeated id gets a individual colunm and when it does not have repeated id, it gets a 0 in the new colunm.
data_merged = {'id':[10450015,16690019,16510069], '1_k':[4.4,4.1,3.7], '2_k'[0,4.0,0]}
print(data_merged)
Try assiging the column idx ref, using DataFrame.assign and groupby.cumcount then DataFrame.pivot_table. Finally use a list comprehension to sort column names:
df_new = (df.assign(col=df.groupby('id').cumcount().add(1))
.pivot_table(index='id', columns='col', values='k', fill_value=0))
df_new.columns = [f"{x}_k" for x in df_new.columns]
print(df_new)
1_k 2_k
id
10450015 4.4 0
16510069 3.7 0
16690019 4.1 4

How to compare datetime between dataframes in multi logic statements?

I am having issue comparing dates between two dataframes from inside a multi logic statement.
df1:
EmailAddress DateTimeCreated
1#1 2019-02-12 20:47:00
df2:
EmailAddress DateTimeCreated
1#1.com 2019-02-07 20:47:00
2#2.com 2018-11-13 20:47:00
3#3.com 2018-11-04 20:47:00
I want to do three things, whenever there is a row in df1:
1. Compare to see if `EmailAddress` from df1 is present in df2:
1a. If `EmailAddress` is present, compare `DateTimeCreated` in df1 to `DateTimeCreated` in df2,
2. If `DateTimeCreated` in df1 is greater than today-90 days append df1 into df2.
In simpler words:
I want to see email address is present in df2 and if it is, compare datetimecreated in df2 to see if it has been greater than today-90days since last time person answered. If it has been greater than 90days then append the row from df1, into df2.
My logic is appending everything not sure what I am doing wrong like so:
import pandas as pd
from datetime import datetime, timedelta
df2.append(df2.loc[df2.EmailAddress.isin(df1.EmailAddress)&(df2.DateTimeCreated.ge(datetime.today() - timedelta(90)))])
what am I doing wrong to mess up on the date?
EDIT:
In the above example, between the dataframes, the row from df1 would not be appended bc DateTimeCreated is between TODAY() - 90 days.
Please refer inline comments for the explanation. Please note that you need to rename your df1 columns to match df2 columns in this solution.
import pandas as pd
import datetime
from datetime import timedelta, datetime
df1 = pd.DataFrame({'EmailAddress':['2#2.com'], 'DateTimeCreated':[datetime(2019,2,12,20,47,0)]})
df2 = pd.DataFrame({'EmailAddress':['1#1.com', '2#2.com', '3#3.com'],
'DateTimeCreated':[
datetime(2019,2,7,20,47,0),
datetime(2018,11,13,20,47,0),
datetime(2019,11,4,20,47,0)]})
# Get all expired rows
df3 = df2.loc[datetime.now() - df2['DateTimeCreated'] > timedelta(days=90)]
# Update it with the timestamp from df1
df3 = df3.set_index('EmailAddress').join(df1.set_index('EmailAddress'), how='inner', rsuffix='_r')
df3.drop('DateTimeCreated', axis=1, inplace=True)
df3.columns = ['DateTimeCreated']
# Patch df2 with the latest timestamp
df2 = df3.combine_first(df2.set_index('EmailAddress')).reset_index()
# Patch again for rows in df1 that are not in df2
df1 = df1.loc[df1['EmailAddress'].apply(lambda x: 1 if x not in df2['EmailAddress'].tolist() else 0) == 1]
df2 = pd.concat([df2, df1])
>>>df2
EmailAddress DateTimeCreated
0 1#1.com 2019-02-07 20:47:00
1 2#2.com 2019-02-12 20:47:00
2 3#3.com 2019-11-04 20:47:00
Try
1. left join df1 and df2 which meeting the condition 1 email address same
combined_df = df1.join(df2,how="left",lsuffix="df1_",rsuffix="df2_")
2. calculated the gap between df1 datetimecreated and today
gap = pd.datetime.today()- combined_df.DateTimeCreated_df1
return the index which gap >90
mask = combined_df.gap>90
df2.append(df1[mask])
Note:I think you may need the combined_df only, the 4th step append should lead to duplicated or confused data. Anyway, you can choic use step 1,2,3,4 or only used step 1,2,3

Which row is extra in a dataframe?

I have two dataframes that are contain market daily end of day data. They are supposed to contain identical starting dates and ending dates and number of rows, but when I print the len of each, one is bigger by one than the other:
DF1
close
date
2008-01-01 45.92
2008-01-02 45.16
2008-01-03 45.33
2008-01-04 42.09
2008-01-07 46.98
...
[2870 rows x 1 columns]
DF2
close
date
2008-01-01 60.48
2008-01-02 59.71
2008-01-03 58.43
2008-01-04 56.64
2008-01-07 56.98
...
[2871 rows x 1 columns]
How can I show which row either:
has a duplicate row,
or has an extra date
so that I can delete the [probable] weekend/holiday date row that is in DF2 but not in DF1?
I have tried things like:
df1 = df1.drop_duplicates(subset='date', keep='first')
df2 = df1.drop_duplicates(subset='date', keep='first')
but can't get it to work [ValueError: not enough values to unpack (expected 2, got 0)].
Extra:
How do I remove weekend dates from a dataframe?
May using .loc
DF2=DF2.loc[DF1.index]
If check index different between DF1 and DF2
DF2.index.difference(DF1.index)
Check whether DF2 have duplicate index
DF2[DF2.index.duplicated(keep=False)]
Check the weekends
df.index.weekday_name.isin(['Sunday','Saturday'])
Fix your code
df1 = df1.reset_index().drop_duplicates(subset='date', keep='first').reset_index('date')
df2 = df2.reset_index().drop_duplicates(subset='date', keep='first').reset_index('date')
Also for this I recommend duplicated
df2 =df2 [df2.index.duplicated()]
About the business
def B_day(date):
return bool(len(pd.bdate_range(date, date)))
df.index.map(B_day)

Resources