I have this store_df DataFrame:
store_id date sales
0 1 2023-1-2 11
1 2 2023-1-3 22
2 3 2023-1-4 33
3 1 2023-1-5 44
4 2 2023-1-6 55
5 3 2023-1-7 66
6 1 2023-1-8 77
7 2 2023-1-9 88
8 3 2023-1-10 99
I am not able to solve this in the interview.
This was the exact question asked :
Create a dataset with 3 columns – store_id, date, sales Create 3 Store_id Each store_id has 3 consecutive dates Sales are recorded for 9 rows We are considering the same 9 dates across all stores Sales can be any random number
Write a function that fetches the previous day’s sales as output once we give store_id & date as input
The question can be handled in multiple ways.
If you want to just get the previous row per group, assuming that the values are consecutive and sorted by increasing dates, use a groupby.shift:
store_df['prev_day_sales'] = store_df.groupby('store_id')['sales'].shift()
Output:
store_id date sales prev_day_sales
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 55.0
8 3 2023-01-04 99 66.0
If, you really want to get the previous day's value (not the previous available day), use a merge:
store_df['date'] = pd.to_datetime(store_df['date'])
store_df.merge(store_df.assign(date=lambda d: d['date'].add(pd.Timedelta('1D'))),
on=['store_id', 'date'], suffixes=(None, '_prev_day'), how='left'
)
Note. This makes it easy to handle other deltas, like business days (replace pd.Timedelta('1D') with pd.offsets.BusinessDay(1)).
Example (with a different input):
store_id date sales sales_prev_day
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 NaN # there is no data for 2023-01-04
8 3 2023-01-04 99 66.0
have got dataframe df
Aisle Table_no Table_Bit
11 2 1
11 2 2
11 2 3
11 3 1
11 3 2
11 3 3
14 2 1
14 2 2
14 2 3
and another spc_df
Aisle Table_no Item Item_time Space
11 2 Mango 2 0.25
11 2 Lemon 1 0.125
11 3 Apple 3 0.75
14 2 Orange 1 0.125
14 2 Melon 2 0.25
need to add columns spc_df['Item'] and spc_df['Space'] to dataframe df with number of times the value in spc_df['Item_time'] as given in expected output.
Additional note(may/maynot used for logic): Sum of Item_time for every Aisle-Table_no will be max of Table_Bit for that Aisle-Table_no combination.
Expected Output:
Aisle Table_no Table_Bit Item Space
11 2 1 Mango 0.25
11 2 2 Mango 0.25
11 2 3 Lemon 0.125
11 3 1 Apple 0.75
11 3 2 Apple 0.75
11 3 3 Apple 0.75
14 2 1 Orange 0.125
14 2 2 Melon 0.25
14 2 3 Melon 0.25
First repeat values in spc_df by Item_time with add counter column by GroupBy.cumcount, so possible left join by original df if Table_Bit is counter column starting by 1 per groups:
df2 = (spc_df.loc[spc_df.index.repeat(spc_df['Item_time'])]
.assign(Table_Bit = lambda x: x.groupby(['Aisle','Table_no']).cumcount().add(1)))
df = df.merge(df2, how='left')
print (df)
Aisle Table_no Table_Bit Item Item_time Space
0 11 2 1 Mango 2 0.250
1 11 2 2 Mango 2 0.250
2 11 2 3 Lemon 1 0.125
3 11 3 1 Apple 3 0.750
4 11 3 2 Apple 3 0.750
5 11 3 3 Apple 3 0.750
6 14 2 1 Orange 1 0.125
7 14 2 2 Melon 2 0.250
8 14 2 3 Melon 2 0.250
If not Table_ID is counter column create helper column new:
df2 = (spc_df.loc[spc_df.index.repeat(spc_df['Item_time'])]
.assign(new= lambda x: x.groupby(['Aisle','Table_no']).cumcount()))
df = (df.assign(new = df.groupby(['Aisle','Table_no']).cumcount())
.merge(df2, how='left')
.drop('new', axis=1))
print (df)
Aisle Table_no Table_Bit Item Item_time Space
0 11 2 1 Mango 2 0.250
1 11 2 2 Mango 2 0.250
2 11 2 3 Lemon 1 0.125
3 11 3 1 Apple 3 0.750
4 11 3 2 Apple 3 0.750
5 11 3 3 Apple 3 0.750
6 14 2 1 Orange 1 0.125
7 14 2 2 Melon 2 0.250
8 14 2 3 Melon 2 0.250
I have two dataframes df1 and df2:
data1 = {'A':[1,3,2,1.4,2,1,2,4], 'B':[10,30,20,1.4,2,78,2,78],'C':[200,340,20,180,2,201,2,100]}
df1 = pd.DataFrame(data1)
print(df1)
A B C
0 1.0 10.0 200
1 3.0 30.0 340
2 2.0 20.0 20
3 1.4 1.4 180
4 2.0 2.0 2
5 1.0 78.0 201
6 2.0 2.0 2
7 4.0 78.0 100
data2 = {'D':['a1','a2','a3','a4',2,1,'a3',4], 'E':['b1','b2',20,1.4,2,78,2,78],'F':[200,340,'c1',180,2,'c2',2,100]}
df2 = pd.DataFrame(data2)
print(df2)
D E F
0 a1 b1 200
1 a2 b2 340
2 a3 20 c1
3 a4 1.4 180
4 2 2 2
5 1 78 c2
6 a3 2 2
7 4 78 100
I want to insert df2 in df1 by replacing column B in df1. How can a dataframe be inserted by reaplcing column in another dataframe.
desired result:
A D E F C
0 1.0 a1 b1 200 200
1 3.0 a2 b2 340 340
2 2.0 a3 20 c1 20
3 1.4 a4 1.4 180 180
4 2.0 2 2 2 2
5 1.0 1 78 c2 201
6 2.0 a3 2 2 2
7 4.0 4 78 100 100
Idea is use concat with select columns by positions by DataFrame.iloc and Index.get_loc, last remove original column by DataFrame.drop:
c = 'B'
pos = df1.columns.get_loc(c)
df = pd.concat([df1.iloc[:, :pos], df2, df1.iloc[:, pos:]], axis=1).drop(c, axis=1)
print (df)
A D E F C
0 1.0 a1 b1 200 200
1 3.0 a2 b2 340 340
2 2.0 a3 20 c1 20
3 1.4 a4 1.4 180 180
4 2.0 2 2 2 2
5 1.0 1 78 c2 201
6 2.0 a3 2 2 2
7 4.0 4 78 100 100
Tested another columns:
c = 'A'
pos = df1.columns.get_loc(c)
df = pd.concat([df1.iloc[:, :pos], df2, df1.iloc[:, pos:]], axis=1).drop(c, axis=1)
print (df)
D E F B C
0 a1 b1 200 10.0 200
1 a2 b2 340 30.0 340
2 a3 20 c1 20.0 20
3 a4 1.4 180 1.4 180
4 2 2 2 2.0 2
5 1 78 c2 78.0 201
6 a3 2 2 2.0 2
7 4 78 100 78.0 100
c = 'C'
pos = df1.columns.get_loc(c)
df = pd.concat([df1.iloc[:, :pos], df2, df1.iloc[:, pos:]], axis=1).drop(c, axis=1)
print (df)
A B D E F
0 1.0 10.0 a1 b1 200
1 3.0 30.0 a2 b2 340
2 2.0 20.0 a3 20 c1
3 1.4 1.4 a4 1.4 180
4 2.0 2.0 2 2 2
5 1.0 78.0 1 78 c2
6 2.0 2.0 a3 2 2
7 4.0 78.0 4 78 100
IIUC we use itertools with reindex
import itertools
l=list(itertools.chain.from_iterable(list(df2) if item == 'B' else [item] for item in list(df1)))
pd.concat([df1,df2], axis=1).reindex(columns=l)
A D E F C
0 1.0 a1 b1 200 200
1 3.0 a2 b2 340 340
2 2.0 a3 20 c1 20
3 1.4 a4 1.4 180 180
4 2.0 2 2 2 2
5 1.0 1 78 c2 201
6 2.0 a3 2 2 2
7 4.0 4 78 100 100
I have the following data frame
item_id group price
0 1 A 10
1 3 A 30
2 4 A 40
3 6 A 60
4 2 B 20
5 5 B 50
I am looking to add a quantile column based on the price for each group like below:
item_id group price quantile
01 A 10 0.25
03 A 30 0.5
04 A 40 0.75
06 A 60 1.0
02 B 20 0.5
05 B 50 1.0
I could loop over entire data frame and perform computation for each group. However, I am wondering is there a more elegant way to resolve this? Thanks!
You need df.rank() with pct=True:
pct : bool, default False
Whether or not to display the returned rankings in percentile form.
df['quantile']=df.groupby('group')['price'].rank(pct=True)
print(df)
item_id group price quantile
0 1 A 10 0.25
1 3 A 30 0.50
2 4 A 40 0.75
3 6 A 60 1.00
4 2 B 20 0.50
5 5 B 50 1.00
Although the df.rank method above is probably the way to go for this problem. Here's another solution using pd.qcut with GroupBy:
df['quantile'] = (
df.groupby('group')['price']
.apply(lambda x: pd.qcut(x, q=len(x), labels=False)
.add(1)
.div(len(x))
)
)
item_id group price quantile
0 1 A 10 0.25
1 3 A 30 0.50
2 4 A 40 0.75
3 6 A 60 1.00
4 2 B 20 0.50
5 5 B 50 1.00
I have a dataframe with three columns as Year, Product, Price. I wanted to calculate minimum value excluding zero from Price from each year. Also wanted to populate adjacent value from column Product to the minimum value.
Data:
Year Product Price
2000 Grapes 0
2000 Apple 220
2000 pear 185
2000 Watermelon 172
2001 Orange 0
2001 Muskmelon 90
2001 Pear 165
2001 Watermelon 99
Desirable output in new dataframe:
Year Minimum Price Product
2000 172 Watermelon
2001 90 Muskmelon
First filter out 0 rows by boolean indexing:
df1 = df[df['Price'] != 0]
And then use DataFrameGroupBy.idxmin for indices for minimal Price per groups with selecting by loc:
df2 = df1.loc[df1.groupby('Year')['Price'].idxmin()]
Alternative is use sort_values with drop_duplicates:
df2 = df1.sort_values(['Year', 'Price']).drop_duplicates('Year')
print (df2)
Year Product Price
3 2000 Watermelon 172
5 2001 Muskmelon 90
If possible multiple minimal values and need all of them per groups:
print (df)
Year Product Price
0 2000 Grapes 0
1 2000 Apple 220
2 2000 pear 172
3 2000 Watermelon 172
4 2001 Orange 0
5 2001 Muskmelon 90
6 2001 Pear 165
7 2001 Watermelon 99
df1 = df[df['Price'] != 0]
df = df1[df1['Price'].eq(df1.groupby('Year')['Price'].transform('min'))]
print (df)
Year Product Price
2 2000 pear 172
3 2000 Watermelon 172
5 2001 Muskmelon 90
EDIT:
print (df)
Year Product Price
0 2000 Grapes 0
1 2000 Apple 220
2 2000 pear 185
3 2000 Watermelon 172
4 2001 Orange 0
5 2001 Muskmelon 90
6 2002 Pear 0
7 2002 Watermelon 0
df['Price'] = df['Price'].replace(0, np.nan)
df2 = df.sort_values(['Year', 'Price']).drop_duplicates('Year')
df2['Product'] = df2['Product'].mask(df2['Price'].isnull(), 'No data')
print (df2)
Year Product Price
3 2000 Watermelon 172.0
5 2001 Muskmelon 90.0
6 2002 No data NaN