I have a dataframe with 217 rows:
df
A C
fm ae fm ae
0 0.491 0.368 0.789 0.789
1 0.369 0.333 0.433 0.412
2 0.372 0.276 0.772 0.759
3 0.346 0.300 0.474 0.391
4 0.213 0.161 0.323 0.312
.. ... ... ... ...
212 0.000 0.000 1.000 1.000
213 1.000 1.000 1.000 1.000
214 1.000 1.000 1.000 1.000
215 1.000 1.000 1.000 1.000
216 1.000 1.000 1.000 1.000
[217 rows x 4 columns]
I need to report the mean results, and I get the following results:
df.mean()
A fm 0.548
ae 0.508
C fm 0.671
ae 0.650
dtype: float64
But I would like the output to look like this:
fm ae
A 0.548 0.508
C 0.671 0.650
I tried with df.mean().groupby(level=0) but the output was none.
What is the cleanest way to achieve this organization of aggregate results?
One way is to unstack it:
out = df.mean().unstack()
or you could stack it, then use groupby + mean:
out = df.stack(level=0).groupby(level=1).mean()
Output:
ae fm
A 0.5438 0.5791
C 0.7663 0.7791
Related
I am using pandas for this work.
I have a 2 datasets. The first dataset has approximately 6 million rows and 6 columns. For example the first data set looks something like this:
Date
Time
U
V
W
T
2020-12-30
2:34
3
4
5
7
2020-12-30
2:35
2
3
6
5
2020-12-30
2:36
1
5
8
5
2020-12-30
2:37
2
3
0
8
2020-12-30
2:38
4
4
5
7
2020-12-30
2:39
5
6
5
9
this is just the raw data collected from the machine.
The second is the average values of three rows at a time from each column (U,V,W,T).
U
V
W
T
2
4
6.33
5.67
3.66
4.33
3.33
8
What I am trying to do is calculate the perturbation for each column per second.
U(perturbation)=U(raw)-U(avg)
U(raw)= dataset 1
U(avg)= dataset 2
Basically take the first three rows from the first column of the first dataset and individually subtract them by the first value in the first column of the second dataset, then take the next three values from the first column of the first data set and individually subtract them by second value in the first column of the second dataset. Do the same for all three columns.
The desired final output should be as the following:
Date
Time
U
V
W
T
2020-12-30
2:34
1
0
-1.33
1.33
2020-12-30
2:35
0
-1
-0.33
-0.67
2020-12-30
2:36
-1
1
1.67
-0.67
2020-12-30
2:37
-1.66
-1.33
-3.33
0
2020-12-30
2:38
0.34
-0.33
1.67
-1
2020-12-30
2:39
1.34
1.67
1.67
1
I am new to pandas and do not know how to approach this.
I hope it makes sense.
a = df1.assign(index = df1.index // 3).merge(df2.reset_index(), on='index')
b = a.filter(regex = '_x', axis=1) - a.filter(regex = '_y', axis = 1).to_numpy()
pd.concat([a.filter(regex='^[^_]+$', axis = 1), b], axis = 1)
Date Time index U_x V_x W_x T_x
0 2020-12-30 2:34 0 0.00 0.00 -1.33 1.33
1 2020-12-30 2:35 0 -1.00 -1.00 -0.33 -0.67
2 2020-12-30 2:36 0 -2.00 1.00 1.67 -0.67
3 2020-12-30 2:37 1 -1.66 -1.33 -3.33 0.00
4 2020-12-30 2:38 1 0.34 -0.33 1.67 -1.00
5 2020-12-30 2:39 1 1.34 1.67 1.67 1.00
You can use numpy:
import numpy as np
df1[df2.columns] -= np.repeat(df2.to_numpy(), 3, axis=0)
NB. This modifies df1 in place, if you want you can make a copy first (df_final = df1.copy()) and apply the subtraction on this copy.
I have two dataframes, df1, and df2 respectively
A1 A2
0 0.001 0.002
1 100 200
2 0.3 0.4
B1 B2
86 0.0002 0.003
12 0.2 0.3
123 -0.001 0.000
Due to their generation mechanism them, their indexes are not matched, and one of the dataframes was not even indexed orderly. I need to just combine these two dataframes regardless of index, the expected result is as follows.
A1 A2 B1 B2
0 0.001 0.002 0.002 0.003
1 100 200 0.2 0.3
2 0.3 0.4 -.0.001 0.000
In other words, I just need to combine them based on the row orders. I tried
Pd.concat([df1,df2],index=1,ignore_index=True) or pd.DataFrame([df1,df2]). Neither works.
Use pd.concat:
>>> pd.concat([df1.reset_index(drop=True),
df2.reset_index(drop=True)], axis=1)
A1 A2 B1 B2
0 0.001 0.002 0.0002 0.003
1 100.000 200.000 0.2000 0.300
2 0.300 0.400 -0.0010 0.000
1) numpy array r which consists of strings.
import numpy as np
r = np.array([['S', 'S'],['S', 'V1'],['S', 'V2'],['V1', 'S'],['V1', 'V1']])
2) numpy array acc conatin values. The first value refers to the first element of two dimensional array r and second value refers to the second element of two dimensional array r
acc = np.array([0.613,0.387])
3) Question: I want to fill dataframe df1. For example: Row1) Array r[0]=['S', 'S'] contains 'S' in both then fill S=0.613+0.387=1.0 in df1 and V1 and V2 in df1 will zero as they do not exist in the array. Row2) Array r[1]=['S', 'V1'] contains one 'S' then fill S=0.613 and V1=0.387 in df1,and V2=0 (does not exist).......and so on.
Desired output:
import pandas as pd
df1 = pd.DataFrame({'S':[1,0.613,0.613,0.387,0], 'V1': [0,0.387,0,0.613,1],'V2': [0,0,0.387,0,0]})
print(df1)
S V1 V2
0 1.000 0.000 0.000
1 0.613 0.387 0.000
2 0.613 0.000 0.387
3 0.387 0.613 0.000
4 0.000 1.000 0.000
You can stack the dataframe, map the values and pivot back:
s = pd.DataFrame(r).stack().reset_index(name='val')
s['level_1'] = acc[s['level_1']]
s.pivot_table(index='level_0',
columns='val',
values='level_1',
aggfunc='sum',
fill_value=0)
Output:
val S V1 V2
level_0
0 1.000 0.000 0.000
1 0.613 0.387 0.000
2 0.613 0.000 0.387
3 0.387 0.613 0.000
4 0.000 1.000 0.000
Another way using pd.get_dummies(),np.vectorize and df.groupby() on axis=1:
df=pd.get_dummies(pd.DataFrame(r),prefix='',prefix_sep='')
s=pd.Series(acc,index=range(1,len(acc)+1))
final=(pd.DataFrame(np.vectorize(s.get)(np.where(df.eq(1),df.cumsum(axis=1),df)),
columns=df.columns).groupby(df.columns,axis=1).sum())
S V1 V2
0 1.000 0.000 0.000
1 0.613 0.387 0.000
2 0.613 0.000 0.387
3 0.387 0.613 0.000
4 0.000 1.000 0.000
I want to import the data for a three-dimensional parameter p(i,j,k) that is stored in in k excel sheets but GAMS does not let me use dollar control statements in loops. Is there any way to do that using loops or other flow control statements like 'for' or 'while'?
I need to do something like this but it is seemingly impossible:
loop(k,
$call gdxxrw Data.xlsx par=temp rng=k!A1:Z20 rdim=1 cdim=1
$gdxin Data.gdx
$load temp
$gdxin
p(i,j,k)=temp(i,j);
);
Suppose each sheet looks like:
(only difference is I use 2's in sheet2 and 3's in sheet3).
To read this do:
$set xls d:\tmp\test2.xlsx
$set gdx s.gdx
set
i /i1*i3/
j /j1*j5/
k 'sheet names' /Sheet1*Sheet3/
;
parameter
s(i,j) 'single sheet'
a(i,j,k) 'all data'
;
file f /task.txt/;
loop(k,
putclose f,'par=s rng=',k.tl:0,'!a1 rdim=1 cdim=1'/
execute 'gdxxrw i=%xls% o=%gdx% #task.txt trace=2';
execute_loaddc '%gdx%',s;
a(i,j,k) = s(i,j);
);
display a;
My results are:
---- 23 PARAMETER a all data
sheet1 sheet2 sheet3
i1.j1 1.000 2.000 3.000
i1.j2 1.000 2.000 3.000
i1.j3 1.000 2.000 3.000
i1.j4 1.000 2.000 3.000
i1.j5 1.000 2.000 3.000
i2.j1 1.000 2.000 3.000
i2.j2 1.000 2.000 3.000
i2.j3 1.000 2.000 3.000
i2.j4 1.000 2.000 3.000
i2.j5 1.000 2.000 3.000
i3.j1 1.000 2.000 3.000
i3.j2 1.000 2.000 3.000
i3.j3 1.000 2.000 3.000
i3.j4 1.000 2.000 3.000
i3.j5 1.000 2.000 3.000
In addition to my previous question - how canI extract values in the Columns format from a large text file. Also If I want to extract out the value of the specific column. The text file looks like the attached image.
I want to extract out the RESOL and FSC values as Given in the text to plot it in Numbers or EXCEL.
To be more general - what should be done if I do want to extract the values to Part_FSC and CC later.
Thanks
Text File:
Opening MRC/CCP4 file for WRITE...
File : map2.mrc
NX, NY, NZ: 600 600 600
MODE : real
SYMMETRY REDUNDANCY: 1
Fraction_mask, Fraction_particle = 0.1193 0.0111
C sqrt sqrt
C NO. RESOL RING RAD FSPR FSC Part_FSC Part_SSNR Rec_SSNR CC
C 2 952.50 0.0017 0.09 1.000 0.000 400.0145 479.40 0.1222
C 3 476.25 0.0033 0.19 1.000 0.000 159.3959 159.98 0.1586
C 4 317.50 0.0050 0.92 0.999 0.000 48.2248 43.27 0.0155
C 5 238.12 0.0067 0.42 1.000 0.000 88.3074 76.69 0.2637
C 6 190.50 0.0083 0.48 0.999 0.000 64.0162 56.25 0.4148
C 7 158.75 0.0100 1.41 0.992 0.000 17.1695 15.64 0.1282
C 8 136.07 0.0117 5.56 0.954 0.000 6.8244 6.47 0.0171
C 9 119.06 0.0133 1.49 0.993 0.000 16.1918 16.42 0.2729
C 10 105.83 0.0150 1.68 0.990 0.000 12.8313 13.83 0.3729
C 11 95.25 0.0167 3.55 0.969 0.000 6.8012 7.95 0.2624
C 12 86.59 0.0183 16.00 0.830 0.000 2.5273 3.13 0.0826
Untested, but should work
awk '$1 == "C"{printf "%s\t%s\n", $3,$6}' <filename>