I have this store_df DataFrame:
store_id date sales
0 1 2023-1-2 11
1 2 2023-1-3 22
2 3 2023-1-4 33
3 1 2023-1-5 44
4 2 2023-1-6 55
5 3 2023-1-7 66
6 1 2023-1-8 77
7 2 2023-1-9 88
8 3 2023-1-10 99
I am not able to solve this in the interview.
This was the exact question asked :
Create a dataset with 3 columns – store_id, date, sales Create 3 Store_id Each store_id has 3 consecutive dates Sales are recorded for 9 rows We are considering the same 9 dates across all stores Sales can be any random number
Write a function that fetches the previous day’s sales as output once we give store_id & date as input
The question can be handled in multiple ways.
If you want to just get the previous row per group, assuming that the values are consecutive and sorted by increasing dates, use a groupby.shift:
store_df['prev_day_sales'] = store_df.groupby('store_id')['sales'].shift()
Output:
store_id date sales prev_day_sales
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 55.0
8 3 2023-01-04 99 66.0
If, you really want to get the previous day's value (not the previous available day), use a merge:
store_df['date'] = pd.to_datetime(store_df['date'])
store_df.merge(store_df.assign(date=lambda d: d['date'].add(pd.Timedelta('1D'))),
on=['store_id', 'date'], suffixes=(None, '_prev_day'), how='left'
)
Note. This makes it easy to handle other deltas, like business days (replace pd.Timedelta('1D') with pd.offsets.BusinessDay(1)).
Example (with a different input):
store_id date sales sales_prev_day
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 NaN # there is no data for 2023-01-04
8 3 2023-01-04 99 66.0
I have the following data frame
item_id group price
0 1 A 10
1 3 A 30
2 4 A 40
3 6 A 60
4 2 B 20
5 5 B 50
I am looking to add a quantile column based on the price for each group like below:
item_id group price quantile
01 A 10 0.25
03 A 30 0.5
04 A 40 0.75
06 A 60 1.0
02 B 20 0.5
05 B 50 1.0
I could loop over entire data frame and perform computation for each group. However, I am wondering is there a more elegant way to resolve this? Thanks!
You need df.rank() with pct=True:
pct : bool, default False
Whether or not to display the returned rankings in percentile form.
df['quantile']=df.groupby('group')['price'].rank(pct=True)
print(df)
item_id group price quantile
0 1 A 10 0.25
1 3 A 30 0.50
2 4 A 40 0.75
3 6 A 60 1.00
4 2 B 20 0.50
5 5 B 50 1.00
Although the df.rank method above is probably the way to go for this problem. Here's another solution using pd.qcut with GroupBy:
df['quantile'] = (
df.groupby('group')['price']
.apply(lambda x: pd.qcut(x, q=len(x), labels=False)
.add(1)
.div(len(x))
)
)
item_id group price quantile
0 1 A 10 0.25
1 3 A 30 0.50
2 4 A 40 0.75
3 6 A 60 1.00
4 2 B 20 0.50
5 5 B 50 1.00
Below is my dataframe, I believe I need to use groupby or pivot but haven't gotten anything to work correctly.
LOGIN MANAGER 7 8 9 10 11 UNITS HOURS UPH
0 joeblow MSmith 1 21 1 47.01
1 joeblow MSmith 0.25 18 0.25 75.83
2 joeblow MSmith 1 12 1 87.05
3 joeblow MSmith 0.26 13 0.26 206.9
4 joeblow MSmith 0.43 23 0.43 53.18
My expected output would look like below, where the UNITS and HOURS are summed and UPH is averaged and the other columns groupby:
LOGIN MANAGER 7 8 9 10 11 UNITS HOURS UPH
0 joeblow MSmith 1 0.25 1 0.26 0.43 66 2.94 93.994
First Create your columns dict with functions
d={'7':'first','8':'first','9':'first','10':'first','11':'first','UNITS':'sum','HOURS':'sum','UPH':'mean'}
Then do with agg
yourdf=df.groupby(['LOGIN','MANAGER']).agg(d)
I have a data like this. first column is the number of days from one starting point. second column is value generated after each number of days as given.
example after 1 day i get 5$, after 2nd day i get 3$ and so on. there may be some time where there is no revenue like 4th day. the numbers are not consecutive.
data =pd.DataFrame({'day':[1,2,3,5,6,7,8,9,10,11,14,15,17,18,19],
'value':[5,3,7,8,9,4,6,5,2,8,6,7,9,5,2]})
I want to find total value after every 7 day window.
output should be like
day value
7 36
14 27
21 23
I am using loop to achieve this. is there a better pythonic way of doing this.
df =pd.DataFrame({})
sum_value=0
for index, row in data.iterrows():
sum_value+= row['value']
if row['day'] %7==0:
df = df.append(pd.DataFrame({'day':row['day'],'sum_value':[sum_value]}))
sum_value=0
pritn(df)
Also, how to find sum of previous 7 day values at each day (each row)
expected output
day value
1 5
2 8
3 15
5 23
6 32
7 36
8 37
9 39
10 34
and so on...
I hope i did the calculation right. it is basically running total of previous 7 days of values. it would be easier if the numbers are not missing in days column.
Use groupby with helper Series with subtract 1 and integer division with aggregate sum and last:
df = data.groupby((data['day'] - 1) // 7 , as_index=False).agg({'day':'last', 'value':'sum'})
print (df)
day value
0 7 36
1 14 27
2 19 23
Details:
print ((data['day'] - 1) // 7)
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
11 2
12 2
13 2
14 2
Name: day, dtype: int64
Similar solution if need divide day column by 7:
df = data.groupby((data['day'] - 1) // 7)['value'].sum().reset_index()
df['day'] = (df['day'] + 1) * 7
print (df)
day value
0 7 36
1 14 27
2 21 23
EDIT: Need rolling with sum, but first is necessary add missing dates by reindex - necessary unique values of day column.
idx = np.arange(data['day'].min(), data['day'].max() + 1)
df = data.set_index('day').reindex(idx).rolling(7, min_periods=1).sum()
df = df[df.index.isin(data['day'])]
print (df)
value
day
1 5.0
2 8.0
3 15.0
5 23.0
6 32.0
7 36.0
8 37.0
9 39.0
10 34.0
11 42.0
14 27.0
15 28.0
17 30.0
18 27.0
19 29.0
If get:
ValueError: cannot reindex from a duplicate axis
it means duplicates day values and solution is aggregate sum first:
#duplicated day 1
data =pd.DataFrame({'day':[1,1,3,5,6,7,8,9,10,11,14,15,17,18,19],
'value':[5,3,7,8,9,4,6,5,2,8,6,7,9,5,2]})
idx = np.arange(data['day'].min(), data['day'].max() + 1)
df = data.groupby('day')['value'].sum().reindex(idx).rolling(7, min_periods=1).sum()
df = df[df.index.isin(data['day'])]
print (df)
day
1 8.0
3 15.0
5 23.0
6 32.0
7 36.0
8 34.0
9 39.0
10 34.0
11 42.0
14 27.0
15 28.0
17 30.0
18 27.0
19 29.0
Name: value, dtype: float64
I've got a headache which I would love some help with
So I have the following table:
Table 1:
Date Hour Volume Value Price
10/09/2018 1 10 400 40.0
10/09/2018 2 80 200 2.5
10/09/2018 3 14 190 13.6
10/09/2018 4 74 140 1.9
11/09/2018 1 34 547 16.1
11/09/2018 2 26 849 32.7
11/09/2018 3 95 279 2.9
11/09/2018 4 31 216 7.0
Then what I wan to do is view the weighted average by hour.
e.g.
Hour Price
1 21.52272727
2 9.896226415
3 4.302752294
4 3.39047619
And if possible (bonus point). Then be able to change this by time period e.g. each hour within specified dates.
The way it looks in Excel:
A B C D E
1 Date Hour Volume Value Price
2 10/09/2018 1 10 400 40.0
3 10/09/2018 2 80 200 2.5
4 10/09/2018 3 14 190 13.6
5 10/09/2018 4 74 140 1.9
6 11/09/2018 1 34 547 16.1
7 11/09/2018 2 26 849 32.7
8 11/09/2018 3 95 279 2.9
9 11/09/2018 4 31 216 7.0
The output should be:
Hour Price
1 21.52272727
2 9.896226415
3 4.302752294
4 3.39047619
Calculated Like:
Hour Price
1 ((E2*C2)+(E6*C6))/SUM(C2,C6)
2 ((E3*C3)+(E7*C7))/SUM(C3,C7)
3 ((E4*C4)+(E8*C8))/SUM(C4,C8)
4 ((E5*C5)+(E9*C9))/SUM(C5,C9)
I've looked at lots of weighted average questions and answers and they all make sense but I can't quite put them together the way I want.
I hope that makes sense.
Thanks guys,
I have reproduced your desired report:
There is no need to involve price in the calculations. Weighted Average price is simply total value / total volume, for the selected set of dates. Let's say your table is called "Data". Create measure:
Weighted Average Price = DIVIDE( SUM(Data[Value]), SUM(Data[Volume]))
Put it into a pivot table against hours, and you are done.
The formular will work correctly for any set of dates you select. For example, in a version of the above report with both hours and dates on pivot:
you can see that it calculates prices correctly on individual dates, subtotals and the total.