Dynamically pass in name of column based on value from another - apache-spark

Say I have a dataframe like below.
eff_month eff_date latest_volume_month_1 latest_volume_month_2 latest_volume_month_3 pct
1 2022-01-13 55 60 70 .5
2 2022-02-10 40 50 60 .1
3 2022-03-02 30 50 70 .2
I am trying to create a new column that depending on the eff_month and pct will do some math for the values within the latest_volume_month_{eff_month} and pct.
The math: For the 1st month only take the sum of latest_volume_month_1 and multiple by pct to get vol.
For the 2nd month take the sum of latest_volume_month_1 and latest_volume_month_2 and then multiply by pct.
Third row take the sum of month 1,2,3 and multiply by pct for vol.
The expected output would look something like this.
eff_month eff_date latest_volume_month_1 latest_volume_month_2 latest_volume_month_3 pct vol
1 2022-01-13 55 60 70 .5 22.5
2 2022-02-10 40 50 60 .1 9.1
3 2022-03-02 30 50 70 .2 25

Related

Max (date) # - Min(Date) # to get the weight difference - Excel Formula

I have different entries for the weight over different periods of time and wanted to see the weight difference from the max point in time to the min point in time based on a specific ID.
ID Date Weight Difference
12345 1/1/22 100 -10
12345 2/1/22 95 -10
12345 3/1/22 90 -10
44440 1/1/22 90 10
44440 2/1/22 95 10
44440 3/1/22 100 10
18258 2/1/22 105 -1
18258 3/1/22 104 -1
23584 3/1/22 110 0
So far, I have used the following formula: =MAX(IF(A:A=A2,U:U))-MIN(IF(A:A=A2,U:U))
A is the ID, U is the Weight
The formula is not taking the max date into account but the max weight and subtracting that from the min weight, instead of taking the weight from the max date and the weight from the min date.

How do I create series in Excel using criteria from values in other cells?

I have an Excel spreadsheet populated as below:
Latitude
Longitude
Altitude
Value
10
10
1
100
10
10
5
105
10
10
20
120
10
5
1
150
10
5
5
155
10
5
20
170
15
10
1
500
15
10
5
505
15
10
20
520
15
5
1
550
15
5
5
555
15
5
20
570
Using this data, I would like to create a Chart in Excel where I have Value on the X-axis, Altitude on the Y-Axis and a series for each unique combination of Latitude and Longitude.
This should result in 4 series being plotted on the Chart with each series having 3 values (one value for each Altitude. I feel like this should be easy to do but I'm struggling to do it myself or find something using the grand-old Google.
Any help you could provide this Excel-noob would be greatly appreciated!
If you re-arrange your data like that
value
altitude
value
altitude
value
altitude
value
altitude
long-lat:
10-10
10-5
15-10
15-5
100
1
150
1
500
1
550
1
105
5
155
5
505
5
555
5
120
20
170
20
520
20
570
20
you can insert the four curves individually into a "points (x/y)" diagram:
Here is a screenshot of how the curves are defined:

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

Maximum for each column, return value of other for max, create new dataframe of returns

I hope the title is not misleading.
I need to go from this dataframe:
Column_1 Columns_2 First Second Third
0 Element_1 to_be_ignored 10 5 77
1 Element_2 to_be_ignored 30 30 11
2 Element_3 to_be_ignored 60 7 3
3 Element_4 to_be_ignored 20 87 90
to:
New_Column New_Column_1 Max
0 Element_3 First 60
1 Element_4 Second 87
2 Element_4 Third 90
get maximum value of every column
get responding value of Column_1 for maximum value
transform to new dataframe
what i got so far:
data = {'Column_1': ['Element_1', 'Element_2', 'Element_3', 'Element_4'],
'Columns_2': ['to_be_ignored', 'to_be_ignored', 'to_be_ignored', 'to_be_ignored'],
'First': [10,30,60,20], 'Second': [5,30,7,87], 'Third': [77,11,3,90]}
df = pd.DataFrame(data)
df.loc[df.iloc[:, 1:].idxmax(), ['Column_1']
so i am able to get the index position and value for the maximum in the columns.
2 Element_3
3 Element_4
3 Element_4
Unfortunately i can't figure out the rest.
THX
IIUC melt then sort_values + drop_duplicates
df.melt(['Column_1','Columns_2']).sort_values('value').drop_duplicates(['variable'],keep='last')
Column_1 Columns_2 variable value
2 Element_3 to_be_ignored First 60
7 Element_4 to_be_ignored Second 87
11 Element_4 to_be_ignored Third 90

Apply multiple operations on same columns after groupby

I have the following df,
id year_month amount
10 201901 10
10 201901 20
10 201901 30
20 201902 40
20 201902 20
I want to groupby id and year-month and then get the group size and sum of amount,
df.groupby(['id', 'year_month'], as_index=False)['amount'].sum()
df.groupby(['id', 'year_month'], as_index=False).size().reset_index(name='count')
I am wondering how to do it at the same time in one line;
id year_month amount count
10 201901 60 3
20 201902 60 2
Use agg:
df.groupby(['id', 'year_month']).agg({'amount': ['count', 'sum']})
amount
count sum
id year_month
10 201901 3 60
20 201902 2 60
If you want to remove the multi-index, use MultiIndex.droplevel:
s = df.groupby(['id', 'year_month']).agg({'amount': ['count', 'sum']}).rename(columns ={'sum': 'amount'})
s.columns = s.columns.droplevel(level=0)
s.reset_index()
id year_month count amount
0 10 201901 3 60
1 20 201902 2 60

Resources