Here is a sample of my data
import pandas as pd
dic = {'Drug': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'],
'Date': ['01-01-20', '01-02-20', '01-03-20', '01-04-20', '01-05-20', '01-10-20', '01-15-20', '01-20-20', '01-21-20', '01-01-20', '01-02-20', '01-03-20', '01-04-20', '01-05-20'],
'Amount': [10, 20, 30, 40, 50,60, 70, 80, 90, 10, 20, 30, 40, 50]}
df = pd.DataFrame(dic)
| Drug | Date | Amount |
| ---- | -------- | ------ |
| A | 01-01-20 | 10 |
| | 01-02-20 | 20 |
| | 01-03-20 | 30 |
| | 01-04-20 | 40 |
| | 01-05-20 | 50 |
| | 01-10-20 | 60 |
| | 01-15-20 | 70 |
| | 01-20-20 | 80 |
| | 01-21-20 | 90 |
| B | 01-01-20 | 10 |
| | 01-02-20 | 20 |
| | 01-03-20 | 30 |
| | 01-04-20 | 40 |
| | 01-05-20 | 50 |
I have performed a groupby on Drug and want to apply a lambda function that calculates 3 metrics -
Lag -> The amount for a drug x days ago
Trend -> The difference between the amount for a drug today and the amount x days ago
Window -> Mean of the amounts for a drug between today and x days ago (Days not seen in the dataframe are assumed to have the same value as the day that comes before them in the data ie, Jan 6th 2020 has the same value as Jan 5th 2020. Days in 2019 are considered to have the same value as on Jan 1st 2020)
Here is my desired output for the case where x=7 -
| Drug | Date | Amount | Date 7 Days Ago | Lag | Trend | Window |
| ---- | -------- | ------ | --------------- | --- | ----- | ------ |
| A | 01-01-20 | 10 | 12-26-19 | 10 | 0 | 10.00 |
| | 01-02-20 | 20 | 12-27-19 | 10 | 10 | 11.43 |
| | 01-03-20 | 30 | 12-28-19 | 10 | 20 | 14.29 |
| | 01-04-20 | 40 | 12-29-19 | 10 | 30 | 18.57 |
| | 01-05-20 | 50 | 12-30-19 | 10 | 40 | 24.29 |
| | 01-10-20 | 60 | 01-04-20 | 40 | 20 | 50.00 |
| | 01-15-20 | 70 | 01-09-20 | 50 | 20 | 60.00 |
| | 01-20-20 | 80 | 01-14-20 | 60 | 20 | 70.00 |
| | 01-21-20 | 90 | 01-15-20 | 70 | 20 | 74.29 |
| B | 01-01-20 | 10 | 12-26-19 | 10 | 0 | 10.00 |
| | 01-02-20 | 20 | 12-27-19 | 10 | 10 | 11.43 |
| | 01-03-20 | 30 | 12-28-19 | 10 | 20 | 24.29 |
| | 01-04-20 | 40 | 12-29-19 | 10 | 30 | 18.57 |
| | 01-05-20 | 50 | 12-30-19 | 10 | 40 | 24.29 |
I have performed the above using for loops but I want to use a more Pandas way of doing this which I am unable to figure out.
First get your df in order, then try df.shift(...).
Related
How to join tables creating a new row if it doesn't exist.
I tried: Products.join(Class, Products.id_product == Class.id_product, 'right')
Products
+-----------+-------+----------+-----------+
|day | store | quantity | id_product|
|2022-05-05 | 01 | 10 | 1 |
|2022-05-05 | 01 | 10 | 2 |
|2022-05-05 | 01 | 7 | 3 |
|2022-05-22 | 01 | 8 | 1 |
+-----------+-------+----------+-----------+
Class
+-----------+-----+
|id_product | size|
|1 | S |
|2 | L |
|3 | XL |
+-----------+-----+
I would like new rows to be created with null value for quantity of stock, but keeping the information of day, store, id_product and size.
My result
+-----------+-------+----------+------------+-----+
|day | store | quantity | id_product | size|
|2022-05-05 | 01 | 10 | 1 | S |
|2022-05-05 | 01 | 10 | 2 | L |
|2022-05-05 | 01 | 7 | 3 | XL |
|2022-05-22 | 01 | 8 | 1 | S |
+-----------+-------+----------+------------+-----+
Expected
+-----------+-------+----------+------------+-----+
|day | store | quantity | id_product | size|
|2022-05-05 | 01 | 10 | 1 | S |
|2022-05-05 | 01 | 10 | 2 | L |
|2022-05-05 | 01 | 7 | 3 | XL |
|2022-05-22 | 01 | 8 | 1 | S |
|2022-05-22 | 01 | null | 2 | L |
|2022-05-22 | 01 | null | 3 | XL |
+-----------+-------+----------+------------+-----+
Guess you need to first have all combinations of day, store, product and size and then outer join with products.
products
.select($"day", $"store")
.distinct
.crossJoin(classes)
.as("keys")
.join(
products.as("prds"),
Seq("day", "store", "id_product"),
"left"
)
.select(
$"keys.day", $"keys.store", $"prds.quantity",
$"keys.id_product", $"keys.size"
)
Excel
| A | B | C | D | E | F | G | H |
---|-----------------|----------|--------|--------|-----------|-------------|---------|----------|---
1 | Loan | 50.000 | Year | Start | Interests | Repayment | Annuity | End |
2 | Interests p.a. | 2% | 1 | 50.000 | -1.250 | -1.750 | -3.000 | 48.250 |
3 | Annuity p.a. | 3.000 | 2 | 48.250 | -1.206 | -1.794 | -3.000 | 46.456 |
4 | Maturity | ?? | 3 | 46.456 | -1.161 | -1.839 | -3.000 | 44.618 |
5 | | | 4 | 44.618 | -1.115 | -1.885 | -3.000 | 42.733 |
| | | | | | | | |
| | | | | | | | |
21 | | | 20 | 8.094 | -202 | -2.798 | -3.000 | 5.297 |
22 | | | 21 | 5.297 | -132 | -2.868 | -3.000 | 2.429 |
23 | | | 22 | 2.429 | -61 | -2.939 | -3.000 | 0 |
The above loan of 50.000 has an interest rate of 2% and an annuity of 3.000.
In the table from C1:H23 the annual development of the remaining loan is displayed.
Based on this helper table I know that the maturity of the loan is 22 years by using the following formula in Cell B4:
B4 = COUNTA(C1:C22)
However, my question is if there is an Excel-Formula that can calculate the maturity in one cell so I do not need the helper table in C1:H23?
I have two dataframes
df1
+----+-------+
| | Key |
|----+-------|
| 0 | 30 |
| 1 | 31 |
| 2 | 32 |
| 3 | 33 |
| 4 | 34 |
| 5 | 35 |
+----+-------+
df2
+----+-------+--------+
| | Key | Test |
|----+-------+--------|
| 0 | 30 | Test4 |
| 1 | 30 | Test5 |
| 2 | 30 | Test6 |
| 3 | 31 | Test4 |
| 4 | 31 | Test5 |
| 5 | 31 | Test6 |
| 6 | 32 | Test3 |
| 7 | 33 | Test3 |
| 8 | 33 | Test3 |
| 9 | 34 | Test1 |
| 10 | 34 | Test1 |
| 11 | 34 | Test2 |
| 12 | 34 | Test3 |
| 13 | 34 | Test3 |
| 14 | 34 | Test3 |
| 15 | 35 | Test3 |
| 16 | 35 | Test3 |
| 17 | 35 | Test3 |
| 18 | 35 | Test3 |
| 19 | 35 | Test3 |
+----+-------+--------+
I want to count how many times each Test is listed for each Key.
+----+-------+-------+-------+-------+-------+-------+-------+
| | Key | Test1 | Test2 | Test3 | Test4 | Test5 | Test6 |
|----+-------|-------|-------|-------|-------|-------|-------|
| 0 | 30 | | | | 1 | 1 | 1 |
| 1 | 31 | | | | 1 | 1 | 1 |
| 2 | 32 | | | 1 | | | |
| 3 | 33 | | | 2 | | | |
| 4 | 34 | 2 | 1 | 3 | | | |
| 5 | 35 | | | 5 | | | |
+----+-------+-------+-------+-------+-------+-------+-------+
What I've tried
Using join and groupby, I first got the count for each Key, regardless of Test.
result_df = df1.join(df2.groupby('Key').size().rename('Count'), on='Key')
+----+-------+---------+
| | Key | Count |
|----+-------+---------|
| 0 | 30 | 3 |
| 1 | 31 | 3 |
| 2 | 32 | 1 |
| 3 | 33 | 2 |
| 4 | 34 | 6 |
| 5 | 35 | 5 |
+----+-------+---------+
I tried to group the Key with the Test
result_df = df1.join(df2.groupby(['Key', 'Test']).size().rename('Count'), on='Key')
but this returns an error
ValueError: len(left_on) must equal the number of levels in the index of "right"
Check with crosstab
pd.crosstab(df2.Key,df2.Test).reindex(df1.Key).replace({0:''})
Here another solution with groupby & pivot. Using this solution you don't need df1 at all.
# | create some dummy data
tests = ['Test' + str(i) for i in range(1,7)]
df = pd.DataFrame({'Test': np.random.choice(tests, size=100), 'Key': np.random.randint(30, 35, size=100)})
df['Count Variable'] = 1
# | group & count aggregation
df = df.groupby(['Key', 'Test']).count()
df = df.pivot(index="Key", columns="Test", values="Count Variable").reset_index()
I am new to this site and haven't done much in Excel of decades (yes, decades), so I forgotten more than I know now.
Background: I am working on a simple pay sheet checking spreadsheet. One Worksheet is input timesheet for data entry, the complex one does all the calculations (Hourly rate; shift loading; tax formula, etc.) and the final worksheet presents the results in the same format as pay slip. Having finished the complex formulas in the calculation sheet, I am now stuck on condensing the results for the final results on the last sheet. I have try numerous functions including: vlookup, index, match, rank.eq, small and others, as per examples of other question on this site. Sample data is:
+----+-----------------------------------------------------+----------------+------------+--------------+--------+--------+-----+--------+
| | A | B | C | D | E | F | G | H |
+----+-----------------------------------------------------+----------------+------------+--------------+--------+--------+-----+--------+
| 1 | Sample data: | | | | | | | |
| 2 | Monday | Ordinary Hours | 30/04/2018 | Day Shift | 10.85 | 21.85 | 1 | 237.07 |
| 3 | Tuesday | Ordinary Hours | 1/05/2018 | | | 21.85 | 1 | |
| 4 | Wednesday | Ordinary Hours | 2/05/2018 | | | 21.85 | 1 | |
| 5 | Thursday | Ordinary Hours | 3/05/2018 | | | 21.85 | 1 | |
| 6 | Friday | Ordinary Hours | 4/05/2018 | | | 21.85 | 1 | |
| 7 | | | | | | | | |
| 8 | | | | | | | | |
| 9 | Monday | Ordinary Hours | 7/05/2018 | | | 21.85 | 1 | |
| 10 | Tuesday | Ordinary Hours | 8/05/2018 | | | 21.85 | 1 | |
| 11 | Wednesday | Ordinary Hours | 9/05/2018 | Day Shift | 10.85 | 21.85 | 1 | 237.07 |
| 12 | Thursday | Ordinary Hours | 10/05/2018 | Day Shift | 10.85 | 21.85 | 1 | 237.07 |
| 13 | Friday | Ordinary Hours | 11/05/2018 | | | 21.85 | 1 | |
| 14 | | | | | | | | |
| 15 | Monday | Overtime 1.5 | 30/04/2018 | | | 21.85 | 1.5 | |
| 16 | Tuesday | Overtime 1.5 | 1/05/2018 | Overtime 1.5 | 2 | 21.85 | 1.5 | 65.55 |
| 17 | Wednesday | Overtime 1.5 | 2/05/2018 | | | 21.85 | 1.5 | |
| 18 | Thursday | Overtime 1.5 | 3/05/2018 | | | 21.85 | 1.5 | |
| 19 | Friday | Overtime 1.5 | 4/05/2018 | | | 21.85 | 1.5 | |
| 20 | Saturday | Overtime 1.5 | 5/05/2018 | | | 21.85 | 1.5 | |
| 21 | | | | | | | | |
| 22 | Monday | Overtime 1.5 | 7/05/2018 | | | 21.85 | 1.5 | |
| 23 | Tuesday | Overtime 1.5 | 8/05/2018 | | | 21.85 | 1.5 | |
| 24 | Wednesday | Overtime 1.5 | 9/05/2018 | | | 21.85 | 1.5 | |
| 25 | Thursday | Overtime 1.5 | 10/05/2018 | | | 21.85 | 1.5 | |
| 26 | Friday | Overtime 1.5 | 11/05/2018 | | | 21.85 | 1.5 | |
| 27 | Saturday | Overtime 1.5 | 12/05/2018 | | | 21.85 | 1.5 | |
| 28 | | | | | | | | |
| 29 | | | | | | | | |
| 30 | Required result on separate sheet in same workbook: | | | | | | | |
| 31 | Taxable Allowances | Comments | Qty | Rate | Factor | Amount | | |
| 32 | Ordinary Hours | 30/04/2018 | 10.85 | 21.85 | 1 | 237.07 | | |
| 33 | Ordinary Hours | 9/05/2018 | 10.85 | 21.85 | 1 | 237.07 | | |
| 34 | Ordinary Hours | 10/05/2018 | 10.85 | 21.85 | 1 | 237.07 | | |
| 35 | Overtime 1.5 | 1/05/2018 | 2 | 21.85 | 1.5 | 65.55 | | |
| 36 | | | | | | | | |
| 37 | | | | | | | | |
| 38 | | | | | | | | |
| 39 | | | | | | | | |
| 40 | | | | | | | | |
+----+-----------------------------------------------------+----------------+------------+--------------+--------+--------+-----+--------+
let's say that I have a table like the below:
| | Value 1 | Value 2 | Value 3 | |
|---|---------|---------|---------|---|
| A | 22 | 12 | 3 | |
| A | 5 | 6 | 12 | |
| A | 19 | 9 | 13 | |
| A | 22 | 43 | 31 | |
| B | 7 | 12 | 23 | |
| B | 5 | 5 | 8 | |
| B | 35 | 78 | 9 | |
| B | 45 | 1 | 8 | |
| C | 34 | 56 | 0 | |
| C | 22 | 1 | 14 | |
| C | 13 | 46 | 45 | |
and that I'd need to transform it into the below:
| | Value 1 | Value 2 | Value 3 | |
|---|---------|---------|---------|---|
| A | 22 | 12 | 3 | |
| A | 5 | 6 | 12 | |
| A | 19 | 9 | 13 | |
| A | 22 | 43 | 31 | |
| | 68 | 70 | 59 | |
| | | | | |
| B | 7 | 12 | 23 | |
| B | 5 | 5 | 8 | |
| B | 35 | 78 | 9 | |
| B | 45 | 1 | 8 | |
| | 92 | 96 | 48 | |
| | | | | |
| C | 34 | 56 | 0 | |
| C | 22 | 1 | 14 | |
| C | 13 | 46 | 45 | |
| | 69 | 103 | 59 | |
How could I obtain the desired effect automatically?
There would be n empty rows after each group and the sums of each column within the group.
You can use the Subtotal feature of Excel. Subtotal is in the "Data" tab of the ribbon. To automatically add the totals between groupings. I don't think it adds the blank row. If you absolutely need the blank row, then I can generate some VBA that will work.