I need advice on how to compact this dataframe - python-3.x

Below is my dataframe, I believe I need to use groupby or pivot but haven't gotten anything to work correctly.
LOGIN MANAGER 7 8 9 10 11 UNITS HOURS UPH
0 joeblow MSmith 1 21 1 47.01
1 joeblow MSmith 0.25 18 0.25 75.83
2 joeblow MSmith 1 12 1 87.05
3 joeblow MSmith 0.26 13 0.26 206.9
4 joeblow MSmith 0.43 23 0.43 53.18
My expected output would look like below, where the UNITS and HOURS are summed and UPH is averaged and the other columns groupby:
LOGIN MANAGER 7 8 9 10 11 UNITS HOURS UPH
0 joeblow MSmith 1 0.25 1 0.26 0.43 66 2.94 93.994

First Create your columns dict with functions
d={'7':'first','8':'first','9':'first','10':'first','11':'first','UNITS':'sum','HOURS':'sum','UPH':'mean'}
Then do with agg
yourdf=df.groupby(['LOGIN','MANAGER']).agg(d)

Related

create a python function as store_id & date as input and output as the previous date's sales

I have this store_df DataFrame:
store_id date sales
0 1 2023-1-2 11
1 2 2023-1-3 22
2 3 2023-1-4 33
3 1 2023-1-5 44
4 2 2023-1-6 55
5 3 2023-1-7 66
6 1 2023-1-8 77
7 2 2023-1-9 88
8 3 2023-1-10 99
I am not able to solve this in the interview.
This was the exact question asked :
Create a dataset with 3 columns – store_id, date, sales Create 3 Store_id Each store_id has 3 consecutive dates Sales are recorded for 9 rows We are considering the same 9 dates across all stores Sales can be any random number
Write a function that fetches the previous day’s sales as output once we give store_id & date as input
The question can be handled in multiple ways.
If you want to just get the previous row per group, assuming that the values are consecutive and sorted by increasing dates, use a groupby.shift:
store_df['prev_day_sales'] = store_df.groupby('store_id')['sales'].shift()
Output:
store_id date sales prev_day_sales
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 55.0
8 3 2023-01-04 99 66.0
If, you really want to get the previous day's value (not the previous available day), use a merge:
store_df['date'] = pd.to_datetime(store_df['date'])
store_df.merge(store_df.assign(date=lambda d: d['date'].add(pd.Timedelta('1D'))),
on=['store_id', 'date'], suffixes=(None, '_prev_day'), how='left'
)
Note. This makes it easy to handle other deltas, like business days (replace pd.Timedelta('1D') with pd.offsets.BusinessDay(1)).
Example (with a different input):
store_id date sales sales_prev_day
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 NaN # there is no data for 2023-01-04
8 3 2023-01-04 99 66.0

How to subtract X rows in a dataframe with first value from another dataframe?

I am using pandas for this work.
I have a 2 datasets. The first dataset has approximately 6 million rows and 6 columns. For example the first data set looks something like this:
Date
Time
U
V
W
T
2020-12-30
2:34
3
4
5
7
2020-12-30
2:35
2
3
6
5
2020-12-30
2:36
1
5
8
5
2020-12-30
2:37
2
3
0
8
2020-12-30
2:38
4
4
5
7
2020-12-30
2:39
5
6
5
9
this is just the raw data collected from the machine.
The second is the average values of three rows at a time from each column (U,V,W,T).
U
V
W
T
2
4
6.33
5.67
3.66
4.33
3.33
8
What I am trying to do is calculate the perturbation for each column per second.
U(perturbation)=U(raw)-U(avg)
U(raw)= dataset 1
U(avg)= dataset 2
Basically take the first three rows from the first column of the first dataset and individually subtract them by the first value in the first column of the second dataset, then take the next three values from the first column of the first data set and individually subtract them by second value in the first column of the second dataset. Do the same for all three columns.
The desired final output should be as the following:
Date
Time
U
V
W
T
2020-12-30
2:34
1
0
-1.33
1.33
2020-12-30
2:35
0
-1
-0.33
-0.67
2020-12-30
2:36
-1
1
1.67
-0.67
2020-12-30
2:37
-1.66
-1.33
-3.33
0
2020-12-30
2:38
0.34
-0.33
1.67
-1
2020-12-30
2:39
1.34
1.67
1.67
1
I am new to pandas and do not know how to approach this.
I hope it makes sense.
a = df1.assign(index = df1.index // 3).merge(df2.reset_index(), on='index')
b = a.filter(regex = '_x', axis=1) - a.filter(regex = '_y', axis = 1).to_numpy()
pd.concat([a.filter(regex='^[^_]+$', axis = 1), b], axis = 1)
Date Time index U_x V_x W_x T_x
0 2020-12-30 2:34 0 0.00 0.00 -1.33 1.33
1 2020-12-30 2:35 0 -1.00 -1.00 -0.33 -0.67
2 2020-12-30 2:36 0 -2.00 1.00 1.67 -0.67
3 2020-12-30 2:37 1 -1.66 -1.33 -3.33 0.00
4 2020-12-30 2:38 1 0.34 -0.33 1.67 -1.00
5 2020-12-30 2:39 1 1.34 1.67 1.67 1.00
You can use numpy:
import numpy as np
df1[df2.columns] -= np.repeat(df2.to_numpy(), 3, axis=0)
NB. This modifies df1 in place, if you want you can make a copy first (df_final = df1.copy()) and apply the subtraction on this copy.

pandas: append a column with quantile values

I have the following data frame
item_id group price
0 1 A 10
1 3 A 30
2 4 A 40
3 6 A 60
4 2 B 20
5 5 B 50
I am looking to add a quantile column based on the price for each group like below:
item_id group price quantile
01 A 10 0.25
03 A 30 0.5
04 A 40 0.75
06 A 60 1.0
02 B 20 0.5
05 B 50 1.0
I could loop over entire data frame and perform computation for each group. However, I am wondering is there a more elegant way to resolve this? Thanks!
You need df.rank() with pct=True:
pct : bool, default False
Whether or not to display the returned rankings in percentile form.
df['quantile']=df.groupby('group')['price'].rank(pct=True)
print(df)
item_id group price quantile
0 1 A 10 0.25
1 3 A 30 0.50
2 4 A 40 0.75
3 6 A 60 1.00
4 2 B 20 0.50
5 5 B 50 1.00
Although the df.rank method above is probably the way to go for this problem. Here's another solution using pd.qcut with GroupBy:
df['quantile'] = (
df.groupby('group')['price']
.apply(lambda x: pd.qcut(x, q=len(x), labels=False)
.add(1)
.div(len(x))
)
)
item_id group price quantile
0 1 A 10 0.25
1 3 A 30 0.50
2 4 A 40 0.75
3 6 A 60 1.00
4 2 B 20 0.50
5 5 B 50 1.00

find running total on every 7th day in pandas

I have a data like this. first column is the number of days from one starting point. second column is value generated after each number of days as given.
example after 1 day i get 5$, after 2nd day i get 3$ and so on. there may be some time where there is no revenue like 4th day. the numbers are not consecutive.
data =pd.DataFrame({'day':[1,2,3,5,6,7,8,9,10,11,14,15,17,18,19],
'value':[5,3,7,8,9,4,6,5,2,8,6,7,9,5,2]})
I want to find total value after every 7 day window.
output should be like
day value
7 36
14 27
21 23
I am using loop to achieve this. is there a better pythonic way of doing this.
df =pd.DataFrame({})
sum_value=0
for index, row in data.iterrows():
sum_value+= row['value']
if row['day'] %7==0:
df = df.append(pd.DataFrame({'day':row['day'],'sum_value':[sum_value]}))
sum_value=0
pritn(df)
Also, how to find sum of previous 7 day values at each day (each row)
expected output
day value
1 5
2 8
3 15
5 23
6 32
7 36
8 37
9 39
10 34
and so on...
I hope i did the calculation right. it is basically running total of previous 7 days of values. it would be easier if the numbers are not missing in days column.
Use groupby with helper Series with subtract 1 and integer division with aggregate sum and last:
df = data.groupby((data['day'] - 1) // 7 , as_index=False).agg({'day':'last', 'value':'sum'})
print (df)
day value
0 7 36
1 14 27
2 19 23
Details:
print ((data['day'] - 1) // 7)
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
11 2
12 2
13 2
14 2
Name: day, dtype: int64
Similar solution if need divide day column by 7:
df = data.groupby((data['day'] - 1) // 7)['value'].sum().reset_index()
df['day'] = (df['day'] + 1) * 7
print (df)
day value
0 7 36
1 14 27
2 21 23
EDIT: Need rolling with sum, but first is necessary add missing dates by reindex - necessary unique values of day column.
idx = np.arange(data['day'].min(), data['day'].max() + 1)
df = data.set_index('day').reindex(idx).rolling(7, min_periods=1).sum()
df = df[df.index.isin(data['day'])]
print (df)
value
day
1 5.0
2 8.0
3 15.0
5 23.0
6 32.0
7 36.0
8 37.0
9 39.0
10 34.0
11 42.0
14 27.0
15 28.0
17 30.0
18 27.0
19 29.0
If get:
ValueError: cannot reindex from a duplicate axis
it means duplicates day values and solution is aggregate sum first:
#duplicated day 1
data =pd.DataFrame({'day':[1,1,3,5,6,7,8,9,10,11,14,15,17,18,19],
'value':[5,3,7,8,9,4,6,5,2,8,6,7,9,5,2]})
idx = np.arange(data['day'].min(), data['day'].max() + 1)
df = data.groupby('day')['value'].sum().reindex(idx).rolling(7, min_periods=1).sum()
df = df[df.index.isin(data['day'])]
print (df)
day
1 8.0
3 15.0
5 23.0
6 32.0
7 36.0
8 34.0
9 39.0
10 34.0
11 42.0
14 27.0
15 28.0
17 30.0
18 27.0
19 29.0
Name: value, dtype: float64

Multiply a column by a constant value per year in Tableau

I have an Excel file with normalized structure:
[table1]
id percentage region year
1 0.10 A 01-01-1995
2 0.61 A 01-02-1995
3 0.97 A 01-03-1995
4 0.11 B 01-01-1995
5 0.21 B 01-02-1995
6 0.99 B 01-03-1995
7 0.02 A 01-01-1996
8 0.61 A 01-02-1996
9 0.96 A 01-03-1996
10 0.05 B 01-01-1996
11 0.55 B 01-02-1996
12 0.99 B 01-03-1996
and a second table, with
[table2]
other_id region value year
1 A 99 1995
2 A 76 1996
3 B 102 1995
4 B 50 1996
and now I need to multiply the values from the first table to the second table, to get the total by month by year evolution. I tried to do a calculated field like
[table1] * [table2]
but I get the not aggregated error. The calc
avg[table1] * avg[table2]
or
attr[table1] * [table2]
is valid in Tableau, but the result is not correct. How can I accomplish this directly on Tableau?
EDIT: The expected result would be
[table1]
id percentage region year
1 0.10 * 99 A 01-01-1995
2 0.61 * 99 A 01-02-1995
3 0.97 * 99 A 01-03-1995
4 0.11 * 102 B 01-01-1995
5 0.21 * 102 B 01-02-1995
6 0.99 * 102 B 01-03-1995
7 0.02 * 76 A 01-01-1996
8 0.61 * 76 A 01-02-1996
9 0.96 * 76 A 01-03-1996
10 0.05 * 50 B 01-01-1996
11 0.55 * 50 B 01-02-1996
You can try creating a formula. It can go something like this-
For each row in table1,
Find a corresponding row in table2 (with same year and region).
Multiply the percentage in table1 with value in table2

Resources