Pandas merge two dataframes by taking the mean between the columns - python-3.x

I am working with Pandas DataFrames and looking to take the mean between two of them. I am looking to take the mean between columns with the same names.
For example
df1
time x y z
0 1 1.25 2.5 0.75
1 2 2.75 2.5 3.00
2 3 1.50 2.5 1.25
3 4 3.00 2.5 3.50
4 5 0.50 2.5 2.25
df2
time x y z
0 2 0.75 2.5 1.75
1 3 3.00 2.5 3.00
2 4 1.25 2.5 0.25
3 5 3.50 2.5 2.00
4 6 2.25 2.5 2.25
and the result I am looking for is
df3
time x y z
0 1 1.25 2.5 0.75
1 2 1.75 2.5 2.375
2 3 2.25 2.5 2.125
3 4 2.125 2.5 1.875
4 5 2.00 2.5 2.125
5 6 2.25 2.5 2.25
Is there a simple way in Pandas that I can do this, using the merge function or similar?
I am looking for a way of doing it without having to specify the name of the columns.

I think you need concat + groupby and aggregate mean:
df = pd.concat([df1, df2]).groupby('time', as_index=False).mean()
print (df)
time x y z
0 1 1.250 2.5 0.750
1 2 1.750 2.5 2.375
2 3 2.250 2.5 2.125
3 4 2.125 2.5 1.875
4 5 2.000 2.5 2.125
5 6 2.250 2.5 2.250

Related

Finding the Max of Excel Matrix Data based on Criteria from Maxtrix

I have a data on Matrix and I also have the criteria data in Matrix as well See below
Data from the Matrix
Period
0.0
30
45
60
75
90
105
120
135
150
180
6.0
0.356
0.443
0.469
0.505
0.579
0.525
0.516
0.475
0.342
0.271
0.171
7.0
0.439
0.541
0.558
0.678
0.802
0.642
0.747
0.499
0.436
0.336
0.232
8.0
0.505
0.544
0.591
0.694
0.759
0.747
0.736
0.584
0.560
0.467
0.269
9.0
0.489
0.614
0.618
0.630
0.791
0.687
0.631
0.577
0.507
0.562
0.340
10.0
0.538
0.603
0.572
0.580
0.703
0.643
0.619
0.556
0.489
0.459
0.399
11.0
0.503
0.491
0.513
0.578
0.585
0.630
0.587
0.542
0.439
0.459
0.345
12.0
0.517
0.446
0.539
0.588
0.546
0.564
0.552
0.497
0.411
0.412
0.355
13.0
0.470
0.439
0.545
0.534
0.530
0.482
0.510
0.470
0.422
0.404
0.329
14.0
0.399
0.427
0.469
0.442
0.462
0.434
0.409
0.425
0.382
0.395
0.340
15.0
0.370
0.390
0.388
0.397
0.421
0.393
0.355
0.387
0.355
0.341
0.331
Criteria for the matrix
Period
0.0
30
45
60
75
90
105
120
135
150
180
6.0
3
5
5
6
7
6
6
5
3
2
0
7.0
5
6
7
9
10
8
10
6
5
3
1
8.0
6
6
7
9
10
10
9
7
7
5
2
9.0
6
8
8
8
10
9
8
7
6
7
3
10.0
6
7
7
7
9
8
8
7
6
5
4
11.0
6
6
6
7
7
8
7
6
5
5
3
12.0
6
5
6
7
6
7
7
6
4
4
3
13.0
5
5
6
6
6
5
6
5
4
4
3
14.0
4
5
5
5
5
5
4
5
4
4
3
15.0
4
4
4
4
4
4
3
4
3
3
3
Is there any way to find the maximum of no 3 or 10 from the criteria data on the criteria Matrix, and the max values should be taken the matrix data based on the location from the matrix criteria ?
So from the above No 10 should be the maximum from Matrix ( [7,75] or [7,105] or [8,75] or [8,90] or [9,75] )?
I am expecting Excel function or VBA to find the max data of those numbers?
Thanks alot for your help and taught about it
Excel Function or Excel VBA
Assume tables start (with header row and column) in cell A1 of two sheets named Criteria and Data:
=MAX(SUMPRODUCT( (Criteria!B2:L11=10) * (Data!B2:L11) ) )
Max in Matrix Using Criteria Matrix
If you have Microsoft 365 and if the criteria are in the range N2:N12, in cell O2 of sheet Criteria you could use:
=MAX(TOCOL(($B$2:$L$11=N2)*Data!$B$2:$L$11))
or (more of the same i.e. getting familiar with the LET function)
=LET(tCriteria,$B$2:$L$11,tData,Data!$B$2:$L$11,Criteria,N2,
MAX(TOCOL((tCriteria=Criteria)*tData)))
used in cell P2 of the screenshot, and copy down.

Quartiles calculations and classifications filtering by product

I am having a hard time to get this done:
What I have: pandas dataframe:
product seller price
A Yo 10
A Ka 5
A Poy 7.5
A Nyu 2.5
A Poh 1.25
B Poh 11.25
What I want:
given a df like the one above product, seller, price I wan to calculate 4 quartiles based on price's column for that particulary product and classify the price of each seller of that product into these quartiles.
When all prices are the same, the 4 quartiles has the same value and the price is classified as 1st quartile
Expected Outuput:
product seller price Quartile 1Q 2Q 3Q 4Q
A Yo 10 4 2.5 5 7.5 10
A Ka 5 2 2.5 5 7.5 10
A Poy 7.5 3 2.5 5 7.5 10
A Nyu 2.5 1 2.5 5 7.5 10
A Poh 1.25 1 2.5 5 7.5 10
B Poh 11.25 1 11.25 11.25 11.25 11.25
What I did so far:
if I use: df['Price'].quantile([0.25,0.5,0.75,1]) it will claculate 4 quartiles of all prices without filter by product, so its wrong.
I am lost because I dont know how to do this in python.
Can anyone give me some light here?
Thanks
#Hamza, look the output below. ThereĀ“s something still not workin properly
Try:
dfQuantile = df.groupby("product")['Price'].quantile([0.25,0.5,0.75,1]).unstack().reset_index().rename(columns={0.25:"1Q",0.5:"2Q",0.75:"3Q",1:"4Q"})
out = pd.merge(df,dfQuantile,on="product",how="left")
out["Quantile"] = df.groupby(['product'])['Price'].transform(
lambda x: pd.qcut(x, 4, labels=False, duplicates="drop")).fillna(0).add(1)
print(out)
product seller Price Quantile 1Q 2Q 3Q 4Q
0 A Yo 10.00 4 2.50 5.00 7.50 10.00
1 A Ka 5.00 2 2.50 5.00 7.50 10.00
2 A Poy 7.50 3 2.50 5.00 7.50 10.00
3 A Nyu 2.50 1 2.50 5.00 7.50 10.00
4 A Poh 1.25 1 2.50 5.00 7.50 10.00
5 B Poh 11.25 1 11.25 11.25 11.25 11.25

Mapping percentage change into string description in Pandas or Numpy

I have a dataframe as follows:
date price pct
0 2020/6/1 6.000 NaN
1 2020/6/2 3.000 -0.500000
2 2020/6/3 4.000 0.333333
3 2020/6/4 -1.000 -1.250000
4 2020/6/5 -1.025 0.025000
5 2020/6/6 3.000 -3.926829
6 2020/6/7 3.000 0.000000
7 2020/6/8 15.000 4.000000
8 2020/6/9 2.000 -0.866667
9 2020/6/10 2.500 0.250000
Now I would like to create a new column pct_desc to map the values from pct to string description of pct based on the following conditions:
(-float(inf), -1] ---> severe decrease
(-1, -0.5] ---> decrease
(-0.5, 0.5] ---> stable
(0.5, 1] ---> increase
(1, float(inf)] ---> severe increase
The expected output will like this:
date price pct pct_desc
0 2020/6/1 6.000 NaN NaN
1 2020/6/2 3.000 -0.500000 decrease
2 2020/6/3 4.000 0.333333 stable
3 2020/6/4 -1.000 -1.250000 severe decrease
4 2020/6/5 -1.025 0.025000 stable
5 2020/6/6 3.000 -3.926829 severe decrease
6 2020/6/7 3.000 0.000000 stable
7 2020/6/8 15.000 4.000000 severe increase
8 2020/6/9 2.000 -0.866667 decrease
9 2020/6/10 2.500 0.250000 stable
How could I do that in Pandas or Numpy? Thanks.
We do cut
pd.cut(df.pct,[-np.inf,-1,-0.5,0.5,1,np.inf],labels=['se d','de','st','in','se i'])
0 NaN
1 de
2 st
3 se d
4 st
5 se d
6 st
7 se i
8 de
9 st
Name: pct, dtype: category
Categories (5, object): [se d < de < st < in < se i]

Fill the missing values in the data set

I have a dataset as below.
building_id meter meter_reading primary_use square_feet air_temperature dew_temperature sea_level_pressure wind_direction wind_speed hour day weekend month
0 0 0 NaN 0 7432 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
1 1 0 NaN 0 2720 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
2 2 0 NaN 0 5376 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
3 3 0 NaN 0 23685 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
4 4 0 NaN 0 116607 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
You can see that the values under meter_reading are Nan and i like to fill that up with that column mean grouped by "primary_use" and "square_feet" column. Which api I could use to achieve this. I am currently using scikit learn's imputer.
Thanks and your help is highly appreciated.
If you use pandas data frame, it already brings everything you need.
Note that priary_use is a categorical feature while square_feet is continuous. So first you would like to split square_feet into categories, so you can calculate the mean meter_reading per group.

Python 3.x: Pandas DataFrame How do we combine multiple csv files into one csv file?

I have multiple datasets that has the same number of rows and columns. The column is 0.1,2,3,4,5,6,7,8.
For instance,
Data1
0.1 3
2 3
3 0.1
4 10
5 5
6 7
7 9
8 2
Data2
0.1 2
2 1
3 0.1
4 0.5
5 4
6 0.3
7 9
8 2
I want to combine the data sets. However, I would like to combine the data by keeping the column and by adding the 2nd columns for multiple files.
0.1 3 2
2 3 1
3 0.1 0.1
4 10 0.5
5 5 4
6 7 0.3
7 9 9
8 2 2
I prefer to use Pandas Dataframe. Any clever way to go about this?
Assuming the first column is the index and the second is data:
df = Data1.join(Data2, lsuffix='_1', rsuffix='_2')
Or using merge, and setting column names as 'A' and 'B'
pd.merge(df1, df2, on='A',suffixes=('_data1','_data2'))
A B_data1 B_data2
0 0.1 3.0 2.0
1 2.0 3.0 1.0
2 3.0 0.1 0.1
3 4.0 10.0 0.5
4 5.0 5.0 4.0
5 6.0 6.0 0.3
6 7.0 9.0 9.0
7 8.0 2.0 2.0

Resources