Pandas: Transpose the first 3 rows but duplicate the rest of the data so I get a unique row and a larger table - python-3.x

I'm trying to convert an existing excel sheet that has 3 layers of columns. The first column is the year, but it's a merged cell. The 2nd column is of the months, also merged, and the 3rd layer is alternating Rent | Other.
Original data is shaped like this:
My data looks like this:
2022
Unnamed: 1
Unnamed: 2
Unnamed: 3 ...
2023
Unnamed: 135
Unnamed: 136 ...
January
NaN
February
NaN
January
NaN
February
Rent
Other
Rent
Other
Rent
Other
Rent
100
0
120
30
110
25
100
I added the "...." to the table, this continues for ~130 or so columns per year.
I tried to forward fill the year and months:
2022
2022
2022
2022
January
January
February
February
Rent
Other
Rent
Other
100
0
120
30
I want it to look like this:
Year
Month
Rent
Other
2022
January
100
0
2022
February
120
30

Flying blind here since I don't have access to your Excel file.
df = (
# Your file look to be row-oriented instead of pandas' usual column-oriented
# format. We will import it without column names and assign them later.
pd.read_excel("file.xlsx", header=None)
# Fill in the blanks since some of the cells are merged
.ffill(axis=1)
# Set the row's index, then transpose the dataframe to the usual
# column-oriented format
.set_axis(["Year", "Month", "Metric", "Value"])
.T
)
# Month name is usually a pain in the neck to work with. By default, they sort
# in alphabetical order so April, August, February, ... It's best to convert
# them into number, but if you want to keep the name, use CategoricalDType to
# keep them in semantic order
MonthDType = pd.CategoricalDtype(
pd.date_range("2022-01-01", "2022-12-01", freq="M").strftime("%B"), ordered=True
)
df["Month"] = df["Month"].astype(MonthDType)
# The final pivot
df = df.pivot_table(
index=["Year", "Month"],
columns="Metric",
values="Value",
aggfunc="sum",
observed=True,
).reset_index()

Related

Pandas groupby Month with annual summary

I have a list of items describing some orders placed like this
items=[('September 2021',1,40),('June 2022',1,77),....]
In order to get a dataframe grouped by how many orders did I receive and how much did I get paid I do the following
tabla2=tabla.sort_values(by=['Date']).groupby(['Date']).agg({'Subscriptions':'count','Total amount (€)':'sum'}).astype('float64').round(2)
What I want is to include a row with the yearly numbers after each month of that year, and a Totals at the bottom of it
For the totals I do the following
df1=pd.DataFrame(pd.Series({'Date':"<b>Totals</b>",'Subscriptions':"<b>{}</b>".format(tabla['Subscriptions'].sum().astype('int')),
'Total amount (€)':"<b>{}</b>".format(tabla['Total amount (€)'].sum().round(2))})).T.set_index(['Date'])
tabla2=tabla2.append(df1)
The <b> is for making it bold later when representing it with plotly.
So I end up having something like this
Date Subscriptions Total amount (€)
September 2021 15 345
.... ... ...
<b>2021</b> 132 1256
June 2022 17 452
... ... ...
<b>2022</b> 144 3215
<b>Totals</b> 1234 4567
What is the most pythonic way of accomplish this from the tabla2 dataframe?

How to join 1.5 million rows of data based on 15 fields

I have a fact table of sales data from the year 2019, which has about 1.5 million rows of data. I need to compare 2019 sales with 2018 sales. The 2018 sales fact table also has about 1.5 million rows of data.
Each fact table has 15 of the same columns which include fields such as date, category, department, location, etc.
Date
Field 1
Field 2..
…field 15
Sales
01.01.18
ABC
XYZ
A12
100
01.02.18
ABCD
XXY
A13
200
01.03.18
ABB
XYY
A14
300
01.04.18
ACC
ZXX
A15
400
Date
Field 1
Field 2..
…field 15
Sales
01.01.19
ABC
XYZ
A12
110
01.02.19
ABCD
XXY
A13
210
01.03.19
ABB
XYY
A14
310
01.04.19
ACC
ZXX
A15
410
I need to have 2018 sales and 2019 sales in two columns that are next to each other.
I have tried this through a left join (matching the minimum amount of fields needed for a correct mapping) , but then my PC ran out of memory. I also tried doing this through power pivot, but my PC also ran out of energy while attempting to load the second fact table to the data model.
How can I have 2018 Sales and 2019 Sales, with the correct mapping, in columns next to each other?
Date '18
Date '19
Field 1
Field 2..
…field 15
Sales 2018
Sales 2019
01.01.18
01.01.19
ABC
XYZ
A12
100
110
01.02.18
01.02.19
ABCD
XXY
A13
200
210
01.03.18
01.03.19
ABB
XYY
A14
300
310
01.04.18
01.04.19
ACC
ZXX
A15
400
410
Assuming the csv data is imported to Sheet2(2018) & Sheet3(2019) via Data > Get External > From text. Put this in Sheet A1 :
=OFFSET(INDIRECT(CHOOSE(2-MOD(COLUMN(),2),"Sheet2","Sheet3")&"!A1",TRUE),ROW()-1,INT(COLUMN()/2+0.5)-1)
and drag right+downwards.
Idea : use column() with mod(), to 'drive' offset cell selection. choose() do the sheet selection.
Please share if it works/not. ( :

Excel: Calculation with different beginning dates

I have a long time-series of monthly returns for the S&P500. I would like to calculate the 3-month return, for a subset of dates?
For example, I would like to calculate the 3-month return for the following
A B C
Month Year 3-Month Return
Jan 1929
Dec 1948
July 1981
I have monthly return data like:
A B C
Month Year Monthly Return
Jan 1929 0.102
Feb 1929 0.072
Mar 1929 -0.231
....
....
Dec 2019 0.157
So the first calculation would be something like (1+0.102)(1+0.072)(1-0.231)-1. I can do this manually but have many calculations, unfortunately.
to find the match, you need a unique column, so combine A & B to another column, so that the month+year combination is unique, then use that combination to find the values.
Suppose you have data in this form:
The formula for match index is =match(P3,U$3:U$13,0) and that for your 3 monthly return is =(1+index(V$3:V$13,Q3))*(1+index(V$3:V$13,Q3+1))*(1+index(V$3:V$13,Q3+2))-1
You can put the match index inside the formula in C column to avoid the column, but you'd have to put it 3 times. you can also use different combination for combined date, like the actual date format of mm/yyyy, it'll still work.

Splitting a Panda Column based on character count

I have a pandas dataframe which includes the below date column with more than thousand raw of below format [YearMonth]
Date:
_____
201801
201802
201910
How can i split them so 2018 in one and month is n an other column. I tried splitstr but hard to get the count setting right.
Appreciate your help
You can using to_datetime then using dt to access the year , month etc
s=pd.to_datetime(df.Date,format='%Y%m')
df['Year']=s.dt.year
df['Month']=s.dt.month
df
Date Year Month
0 201801 2018 1
1 201802 2018 2
2 201910 2019 10

List all the corresponding numbers which fall within a month

I have a table which looks something like this
Month Item Haul
June Gravel 23
July Asphalt 45
June Asphalt 5
June Asphalt 7
September Asphalt 26
October Gravel 17
June Asphalt 21
September Gravel 25
I want to create a function that will list all of the different "Asphalt" hauls which happen within a given month in another sheet so that I can calculate the tonnages of each haul. The result should look something like this
June
5
7
21
Is this even possible?
With a PivotTable (Month for ROWS , Item for COLUMNS and Haul for Σ VALUES) you can set Σ VALUES to calculate the average for you and if you want the detail for Asphalt and June just double click on that row/column intersect. The drill down will automatically open in another sheet and the PT will default to a separate sheet unless you choose otherwise.

Resources