create a python function as store_id & date as input and output as the previous date's sales - python-3.x

I have this store_df DataFrame:
store_id date sales
0 1 2023-1-2 11
1 2 2023-1-3 22
2 3 2023-1-4 33
3 1 2023-1-5 44
4 2 2023-1-6 55
5 3 2023-1-7 66
6 1 2023-1-8 77
7 2 2023-1-9 88
8 3 2023-1-10 99
I am not able to solve this in the interview.
This was the exact question asked :
Create a dataset with 3 columns – store_id, date, sales Create 3 Store_id Each store_id has 3 consecutive dates Sales are recorded for 9 rows We are considering the same 9 dates across all stores Sales can be any random number
Write a function that fetches the previous day’s sales as output once we give store_id & date as input

The question can be handled in multiple ways.
If you want to just get the previous row per group, assuming that the values are consecutive and sorted by increasing dates, use a groupby.shift:
store_df['prev_day_sales'] = store_df.groupby('store_id')['sales'].shift()
Output:
store_id date sales prev_day_sales
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 55.0
8 3 2023-01-04 99 66.0
If, you really want to get the previous day's value (not the previous available day), use a merge:
store_df['date'] = pd.to_datetime(store_df['date'])
store_df.merge(store_df.assign(date=lambda d: d['date'].add(pd.Timedelta('1D'))),
on=['store_id', 'date'], suffixes=(None, '_prev_day'), how='left'
)
Note. This makes it easy to handle other deltas, like business days (replace pd.Timedelta('1D') with pd.offsets.BusinessDay(1)).
Example (with a different input):
store_id date sales sales_prev_day
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 NaN # there is no data for 2023-01-04
8 3 2023-01-04 99 66.0

Related

Pandas: Combine pandas columns that have the same column name

If we have the following df,
df
A A B B B
0 10 2 0 3 3
1 20 4 19 21 36
2 30 20 24 24 12
3 40 10 39 23 46
How can I combine the content of the columns with the same names?
e.g.
A B
0 10 0
1 20 19
2 30 24
3 40 39
4 2 3
5 4 21
6 20 24
7 10 23
8 Na 3
9 Na 36
10 Na 12
11 Na 46
I tried groupby and merge and both are not doing this job.
Any help is appreciated.
If columns names are duplicated you can use DataFrame.melt with concat:
df = pd.concat([df['A'].melt()['value'], df['B'].melt()['value']], axis=1, keys=['A','B'])
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46
EDIT:
uniq = df.columns.unique()
df = pd.concat([df[c].melt()['value'] for c in uniq], axis=1, keys=uniq)
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46

How to do substruction in the cells of columns in python

I have this dataframe (df) in python:
Cumulative sales
0 12
1 28
2 56
3 87
I want to create a new column in which I whould have the the number of new sales (N-(N-1)) as below:
Cumulative sales New Sales
0 12 12
1 28 16
2 56 28
3 87 31
You can do
df['new sale']=df.Cumulativesales.diff().fillna(df.Cumulativesales)
df
Cumulativesales new sale
0 12 12.0
1 28 16.0
2 56 28.0
3 87 31.0
Do this:
df['New_sales'] = df['Cumlative_sales'].diff()
df.fillna(df.iloc[0]['Cumlative_sales'], inplace=True)
print(df)
Output:
Cumlative_sales New_sales
0 12 12.0
1 28 16.0
2 56 28.0
3 87 31.0

Extarct Rows Until a Certain Row with Certain Word of a Column Pandas

I have a data frame like this,
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14
4 Total 6223 16
5 RRA 7222 18
6 MLQ 5648 45
Now, I need to extract rows/new dataframe that has rows until Total that is in Name column.
Output needed:
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14
I tried this,
df[df.Name.str.contains("Total", na=False)]
This is not helpful for now. Any suggestion would be great.
Select the index where the True value is located and slice using df.iloc:
df_new=df.iloc[:df.loc[df.Name.str.contains('Total',na=False)].index[0]]
or using series.idxmax() which allows you to get the index of max value (max of True/False is True):
df_new=df.iloc[:df.Name.str.contains('Total',na=False).idxmax()]
print(df_new)
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14

find running total on every 7th day in pandas

I have a data like this. first column is the number of days from one starting point. second column is value generated after each number of days as given.
example after 1 day i get 5$, after 2nd day i get 3$ and so on. there may be some time where there is no revenue like 4th day. the numbers are not consecutive.
data =pd.DataFrame({'day':[1,2,3,5,6,7,8,9,10,11,14,15,17,18,19],
'value':[5,3,7,8,9,4,6,5,2,8,6,7,9,5,2]})
I want to find total value after every 7 day window.
output should be like
day value
7 36
14 27
21 23
I am using loop to achieve this. is there a better pythonic way of doing this.
df =pd.DataFrame({})
sum_value=0
for index, row in data.iterrows():
sum_value+= row['value']
if row['day'] %7==0:
df = df.append(pd.DataFrame({'day':row['day'],'sum_value':[sum_value]}))
sum_value=0
pritn(df)
Also, how to find sum of previous 7 day values at each day (each row)
expected output
day value
1 5
2 8
3 15
5 23
6 32
7 36
8 37
9 39
10 34
and so on...
I hope i did the calculation right. it is basically running total of previous 7 days of values. it would be easier if the numbers are not missing in days column.
Use groupby with helper Series with subtract 1 and integer division with aggregate sum and last:
df = data.groupby((data['day'] - 1) // 7 , as_index=False).agg({'day':'last', 'value':'sum'})
print (df)
day value
0 7 36
1 14 27
2 19 23
Details:
print ((data['day'] - 1) // 7)
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
11 2
12 2
13 2
14 2
Name: day, dtype: int64
Similar solution if need divide day column by 7:
df = data.groupby((data['day'] - 1) // 7)['value'].sum().reset_index()
df['day'] = (df['day'] + 1) * 7
print (df)
day value
0 7 36
1 14 27
2 21 23
EDIT: Need rolling with sum, but first is necessary add missing dates by reindex - necessary unique values of day column.
idx = np.arange(data['day'].min(), data['day'].max() + 1)
df = data.set_index('day').reindex(idx).rolling(7, min_periods=1).sum()
df = df[df.index.isin(data['day'])]
print (df)
value
day
1 5.0
2 8.0
3 15.0
5 23.0
6 32.0
7 36.0
8 37.0
9 39.0
10 34.0
11 42.0
14 27.0
15 28.0
17 30.0
18 27.0
19 29.0
If get:
ValueError: cannot reindex from a duplicate axis
it means duplicates day values and solution is aggregate sum first:
#duplicated day 1
data =pd.DataFrame({'day':[1,1,3,5,6,7,8,9,10,11,14,15,17,18,19],
'value':[5,3,7,8,9,4,6,5,2,8,6,7,9,5,2]})
idx = np.arange(data['day'].min(), data['day'].max() + 1)
df = data.groupby('day')['value'].sum().reindex(idx).rolling(7, min_periods=1).sum()
df = df[df.index.isin(data['day'])]
print (df)
day
1 8.0
3 15.0
5 23.0
6 32.0
7 36.0
8 34.0
9 39.0
10 34.0
11 42.0
14 27.0
15 28.0
17 30.0
18 27.0
19 29.0
Name: value, dtype: float64

Python: Summing every five rows of column b data and create a new column

I have a dataframe like below. I would like to sum row 0 to 4 (every 5 rows) and create another column with summed value ("new column"). My real dataframe has 263 rows so, last three rows every 12 rows will be sum of three rows only. How I can do this using Pandas/Python. I have started to learn Python recently. Thanks for any advice in advance!
My data patterns is more complex as I am using the index as one of my column values and it repeats like:
Row Data "new column"
0 5
1 1
2 3
3 3
4 2 14
5 4
6 8
7 1
8 2
9 1 16
10 0
11 2
12 3 5
0 3
1 1
2 2
3 3
4 2 11
5 2
6 6
7 2
8 2
9 1 13
10 1
11 0
12 1 2
...
259 50 89
260 1
261 4
262 5 10
I tried iterrows and groupby but can't make it work so far.
Use this:
df['new col'] = df.groupby(df.index // 5)['Data'].transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Data new col
0 5 NaN
1 1 NaN
2 3 NaN
3 3 NaN
4 2 14.0
5 4 NaN
6 8 NaN
7 1 NaN
8 2 NaN
9 1 16.0
Edit to handle updated question:
g = df.groupby(df.Row).cumcount()
df['new col'] = df.groupby([g, df.Row // 5])['Data']\
.transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Row Data new col
0 0 5 NaN
1 1 1 NaN
2 2 3 NaN
3 3 3 NaN
4 4 2 14.0
5 5 4 NaN
6 6 8 NaN
7 7 1 NaN
8 8 2 NaN
9 9 1 16.0
10 10 0 NaN
11 11 2 NaN
12 12 3 5.0
13 0 3 NaN
14 1 1 NaN
15 2 2 NaN
16 3 3 NaN
17 4 2 11.0
18 5 2 NaN
19 6 6 NaN
20 7 2 NaN
21 8 2 NaN
22 9 1 13.0
23 10 1 NaN
24 11 0 NaN
25 12 1 2.0

Resources