Extract weekly data from daily and reshape it from long to wide format using Pandas - python-3.x

Given a sample data as follows, I hope to extract one data entry for each week, if for the week having multiple entries, then I will use the largest weekday's data as for that week:
date variable value
0 2020-11-4 quantity 564.0000
1 2020-11-11 quantity 565.0000
2 2020-11-18 quantity 566.0000
3 2020-11-25 quantity 566.0000
4 2020-11-2 price 1829.1039
5 2020-11-3 price 1789.5883
6 2020-11-4 price 1755.4307
7 2020-11-5 price 1750.0727
8 2020-11-6 price 1746.7239
9 2020-11-9 price 1756.1005
10 2020-11-10 price 1752.0820
11 2020-11-11 price 1814.3693
12 2020-11-12 price 1833.7922
13 2020-11-13 price 1833.7922
14 2020-11-16 price 1784.2302
15 2020-11-17 price 1764.1376
16 2020-11-18 price 1770.1654
17 2020-11-19 price 1757.4400
18 2020-11-20 price 1770.1654
To get week number of each date, I use df['week_number'] = pd.to_datetime(df['date']).dt.week.
date variable value week_number
0 2020-11-4 quantity 564.0000 45 --> to keep
1 2020-11-11 quantity 565.0000 46 --> to keep
2 2020-11-18 quantity 566.0000 47 --> to keep
3 2020-11-25 quantity 566.0000 48 --> to keep
4 2020-11-2 price 1829.1039 45
5 2020-11-3 price 1789.5883 45
6 2020-11-4 price 1755.4307 45
7 2020-11-5 price 1750.0727 45
8 2020-11-6 price 1746.7239 45 --> to keep, since it's the largest weekday for this week
9 2020-11-9 price 1756.1005 46
10 2020-11-10 price 1752.0820 46
11 2020-11-11 price 1814.3693 46
12 2020-11-12 price 1833.7922 46
13 2020-11-13 price 1833.7922 46 --> to keep, since it's the largest weekday for this week
14 2020-11-16 price 1784.2302 47
15 2020-11-17 price 1764.1376 47
16 2020-11-18 price 1770.1654 47
17 2020-11-19 price 1757.4400 47
18 2020-11-20 price 1770.1654 47 --> to keep, since it's the largest weekday for this week
Finally, I will reshape rows indicating to_keep to the expected result as follow:
variable the_45th_week the_46th_week the_47th_week the_48th_week
0 quantity 564.0000 565.0000 566.0000 566.0
1 price 1756.1005 1833.7922 1770.1654 NaN
How could I manipulate data to get the expected result? Sincere thanks.
EDIT:
df = df.sort_values(by=['variable','date'], ascending=False)
df.drop_duplicates(['variable', 'week_number'], keep='last')
Out:
date variable value week_number
0 2020-11-4 quantity 564.0000 45
3 2020-11-25 quantity 566.0000 48
2 2020-11-18 quantity 566.0000 47
1 2020-11-11 quantity 565.0000 46
4 2020-11-2 price 1829.1039 45
14 2020-11-16 price 1784.2302 47
10 2020-11-10 price 1752.0820 46

In your solution is possible add pivot with rename:
df['week_number'] = pd.to_datetime(df['date']).dt.week
df = df.sort_values(by=['variable','date'], ascending=False)
df = df.drop_duplicates(['variable', 'week_number'], keep='last')
f = lambda x: f'the_{x}th_week'
out = df.pivot('variable','week_number','value').rename(columns=f)
print(out)
week_number the_45th_week the_46th_week the_47th_week the_48th_week
variable
price 1829.1039 1752.082 1784.2302 NaN
quantity 564.0000 565.000 566.0000 566.0
Or remove DataFrame.drop_duplicates, so is possible use DataFrame.pivot_table with aggregate function last:
df['week_number'] = pd.to_datetime(df['date']).dt.week
df = df.sort_values(by=['variable','date'], ascending=False)
f = lambda x: f'the_{x}th_week'
out = df.pivot_table(index='variable',columns='week_number',values='value', aggfunc='last').rename(columns=f)
EDIT: to get an exact same result as the expected one:
out.reset_index().rename_axis(None, axis=1)
Out:
variable the_45th_week the_46th_week the_47th_week the_48th_week
0 price 1829.1039 1752.082 1784.2302 NaN
1 quantity 564.0000 565.000 566.0000 566.0

Related

How to sum by month in timestamp Data Frame?

i have dataframe like this :
trx_date
trx_amount
2013-02-11
35
2014-03-10
26
2011-02-9
10
2013-02-12
5
2013-01-11
21
how do i filter that into month and year? so that i can sum the trx_amount
example expected output :
trx_monthly
trx_sum
2013-02
40
2013-01
21
2014-02
35
You can convert values to month periods by Series.dt.to_period and then aggregate sum:
df['trx_date'] = pd.to_datetime(df['trx_date'])
df1 = (df.groupby(df['trx_date'].dt.to_period('m').rename('trx_monthly'))['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df1)
trx_monthly trx_sum
0 2011-02 10
1 2013-01 21
2 2013-02 40
3 2014-03 26
Or convert datetimes to strings in format YYYY-MM by Series.dt.strftime:
df2 = (df.groupby(df['trx_date'].dt.strftime('%Y-%m').rename('trx_monthly'))['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df2)
trx_monthly trx_sum
0 2011-02 10
1 2013-01 21
2 2013-02 40
3 2014-03 26
Or convert to month and years, then output is different - 3 columns:
df2 = (df.groupby([df['trx_date'].dt.year.rename('year'),
df['trx_date'].dt.month.rename('month')])['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df2)
year month trx_sum
0 2011 2 10
1 2013 1 21
2 2013 2 40
3 2014 3 26
You can try this -
df['trx_month'] = df['trx_date'].dt.month
df_agg = df.groupby('trx_month')['trx_sum'].sum()

Unstack a dataframe with duplicated index in Pandas

Given a toy dataset as follow which has duplicated price and quantity:
city item value
0 bj price 12
1 bj quantity 15
2 bj price 12
3 bj quantity 15
4 bj level a
5 sh price 45
6 sh quantity 13
7 sh price 56
8 sh quantity 7
9 sh level b
I want to reshape it into the following dataframe, which means add sell_ for the first pair and buy_ for the second pair:
city sell_price sell_quantity buy_price buy_quantity level
0 bj 12 15 13 16 a
1 sh 45 13 56 7 b
I have tried with df.set_index(['city', 'item']).unstack().reset_index(), but it raises an error: ValueError: Index contains duplicate entries, cannot reshape.
How could I get the desired output as above? Thanks.
You can add for second duplicated values buy_ and for first duplicates sell_ and change values in item before your solution:
m1 = df.duplicated(['city', 'item'])
m2 = df.duplicated(['city', 'item'], keep=False)
df['item'] = np.where(m1, 'buy_', np.where(m2, 'sell_', '')) + df['item']
df = (df.set_index(['city', 'item'])['value']
.unstack()
.reset_index()
.rename_axis(None, axis=1))
#for change order of columns names
df = df[['city','sell_price','sell_quantity','buy_price','buy_quantity','level']]
print (df)
city sell_price sell_quantity buy_price buy_quantity level
0 bj 12 15 12 15 a
1 sh 45 13 56 7 b

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

Pandas dataframe row, divide the row elements with a fixed float value, Dataframe has only one row

QTY 1 QTY 2 QTY 3 QTY 4 QTY 5 QTY 6 QTY 7 QTY 8 QTY 9 QTY 10 QTY 11 QTY 12 QTY 13 QTY 14 QTY 15 QTY 16 QTY 17 QTY 18 QTY 19 QTY 20 REV Price
28 39 41 44 33 33 44 36 41 46 29 34 35 31 28 38 31 36 45 30 250 25
I have only one in a pandas Data frame which 22 columns
I want to divide all the columns that starts with QTY with "mean_val" which is equivalent to 25.3654
How to do this?
You can do this, lets say your dataframe is in variable df :
df.loc[:,df.columns.str.startswith('QTY')] /= mean_val
import pandas as pd
# Sample data frame defined as per your requirement and the mean value.
data = { 'qty 1': [23] , 'qty 2': [26], 'price': [25], 'rate': [250]}
df = pd.DataFrame(data=data, index=None)
mean_value = 25.3654
# Using dataframe's update() and filter() functions to achieve the outcome
df.update(df.filter(regex='qty.*') / mean_value)
print(df)

How can I delete duplicates group 3 columns using two criteria (first two columns)?

That is my data set enter code here
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
1 2018 6 62 47 18
2 2018 6 62 47 18
3 2018 6 62 47 18
4 2018 6 62 47 18
In last three columns there is already the sum for the year and week. I need to get rid of duplicates so that the table contains unique values (for the example above):
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
4 2018 6 62 47 18
I tried to group data but it somehow works wrong and does what I need but just for one column.
df.groupby(['Year created', 'Week created']).size()
And output:
Year created Week created
2017 48 2
49 25
50 54
51 36
52 1
2018 1 17
2 50
3 37
But it is just one column and I don't know which one because even if I separate the data on three parts and do the same procedure for each part I get the same result (as above) for all.
I believe need drop_duplicates:
df = df.drop_duplicates(['Year created', 'Week created'])
print (df)
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
1 2018 6 62 47 18
df2 = df.drop_duplicates(['Year created', 'Week created', 'SUM_New', 'SUM_Closed'])
print(df2)
hope this helps.

Resources