My excel sheet looks like this:
Name C.p Value
a 1 1.75
b 1 2.35
c 1 1.32
d 1 2.45
a 2 2.7
b 2 1.85
c 2 1.9
d 2 2.6
a 3 3.2
b 3 4.5
c 3 9.2
d 3 5.01
Like this 4~5 names 50 ~ 60 check points and values at those check points
I want the excel to look like
C.p a b c d
1 1.75 2.35 1.32 2.45
2 2.7 1.85 1.9 2.6
3 3.2 4.5 9.2 5.01
Here C.p is check point. it is not always 1 2 3 .. it changes values form sheet to sheet
Could Some one help with the code
thank you
If that is the only thing you want to do,You can do it quickly by pivot table in excel itself. You will get some extra columns like Grand Total Which you can remove. As far as effort for removing the unwanted columns to the code it will be quite less.
see the below pic.
Related
I have this df:
d = pd.DataFrame({'Name':['Andres','Lars','Paul','Mike'],
'target':['A','A','B','C'],
'number':[10,12.3,11,6]})
And I want classify each number in a quartile. I am doing this:
(d.groupby(['Name','target','number'])['number']
.quantile([0.25,0.5,0.75,1]).unstack()
.reset_index()
.rename(columns={0.25:"1Q",0.5:"2Q",0.75:"3Q",1:"4Q"})
)
But as you can see, the 4 quartiles are all equal because the code above is calculating per row so if there's one 1 number per row all quartiles are equal.
If a run instead:
d['number'].quantile([0.25,0.5,0.75,1])
Then I have the 4 quartiles I am looking for:
0.25 9.000
0.50 10.500
0.75 11.325
1.00 12.300
What I need as output(showing only first 2 rows)
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.30 1
1 Lars A 12.3 9.0 10.5 11.325 12.30 4
you can see all quartiles has the the values considering tall values in the number column. Besides that, now we have a column names Rank that classify the number according to it's quartile. ex. In the first row 10 is within the 1st quartile.
Here's one way that build on the quantiles you've created by making it a DataFrame and joining it to d. Also assigns "Rank" column using rank method:
out = (d.join(d['number'].quantile([0.25,0.5,0.75,1])
.set_axis([f'{i}Q' for i in range(1,5)], axis=0)
.to_frame().T
.pipe(lambda x: x.loc[x.index.repeat(len(d))])
.reset_index(drop=True))
.assign(Rank=d['number'].rank(method='dense')))
Output:
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.3 2.0
1 Lars A 12.3 9.0 10.5 11.325 12.3 4.0
2 Paul B 11.0 9.0 10.5 11.325 12.3 3.0
3 Mike C 6.0 9.0 10.5 11.325 12.3 1.0
I have a dataset similar to this, but really extensive:
Row
Levels
Level 1
Size
Department
1
1
AA
2.0
Dept 1
2
2
AA
0.8
Dept 1
3
3
AA
1.5
Dept 1
4
2
BB
3.0
Dept 1
5
3
BB
2.0
Dept 1
6
3
BB
2.5
Dept 2
7
2
CC
5.0
Dept 2
8
3
CC
1.5
Dept 2
9
3
DD
0.5
Dept 2
10
3
DD
3.0
Dept 2
11
2
EE
4.0
Dept 2
12
3
EE
2.0
Dept 2
What I need is to achieve a total size per Department, however I want to sum only the first match per Level 1, i.e.:
Department 1 would be 2.0 (row 1) + 3.0 (row 4) = 5.0
Department 2 would be 2.5 (row 6) + 5.0 (row 7) + 0.5 (row 9) + 4.0 (row 11) = 12.0
Does anyone have any idea how to accomplish this in Excel?
Alternate solution to the same formula:
=SUM(XLOOKUP(UNIQUE(FILTER(C:C,(ROW(C:C)>1)*(E:E=#$F$2#))&#$F$2#),C:C&E:E,D:D))
Where F2 holds =UNIQUE(FILTER(E:E,(ROW(E:E)>1)*(E:E<>"")))
If you have Excel 365, you could try something like this:
=LET(FilteredLevel,FILTER(C$2:C$13,E$2:E$13=H2),
SUM(XLOOKUP(UNIQUE(FilteredLevel),FilteredLevel,FILTER(D$2:D$13,E$2:E$13=H2))))
Note
You can also use full-column references if you wish
=LET(FilteredLevel,FILTER(C:C,E:E=H2),
SUM(XLOOKUP(UNIQUE(FilteredLevel),FilteredLevel,FILTER(D:D,E:E=H2))))
SUMIFS() will not do what you want. Use SUMPRODUCT() with some boolean:
=SUMPRODUCT($C$2:$C$13*($D$2:$D$13=F2)*(COUNTIFS(OFFSET($B$2,0,0,ROW($B$2:$B$13)-1),$B$2:$B$13,OFFSET($D$2,0,0,ROW($B$2:$B$13)-1),F2)=1))
One note, the use of OFFSET() makes this a volatile function, meaning that it will recalc with every change made to excel. If there are too many then it will slow down the responsiveness in Excel.
To do it without the volatility we need a helper column. In E2 put:
=COUNTIFS($D$2:D2,D2,$B$2:B2,B2)=1
And copy down. Then we can use SUMIFS():
=SUMIFS(C:C,D:D,F2,E:E,TRUE)
i have a dataframe like:
shops prod_id atv_y1
company_b A 56.3
company_b B 4.3
company_b C 136.3
company_b D 89.3
company_c A 7.3
company_c B 64.0
company_c A 34.7
For the purpose of plotting i would like to remove the repeated company_b/company_c values so that it takes only the first time it is referenced like below:
shops prod_id atv_y1
company_b A 56.3
B 4.3
C 136.3
D 89.3
company_c A 7.3
B 64.0
A 34.7
how can i do this in pandas ?
You might be able to manage this within plots itself by the way. But if you really want the df transformed like you asked, then you could try something like below.
It may not be the best way, but does the job.
shops = df.groupby('shops').first().reset_index()['shops']
for i in shops:
l = np.where(df['shops'] == i)[0]
df.loc[l[1]:l[len(l)-1],'shops'] = ''
print(df)
prints
shops prod_id atv_y1
0 company_b A 56.3
1 B 4.3
2 C 136.3
3 D 89.3
4 company_c A 7.3
5 B 64.0
6 A 34.7
I have a dataframe which contains timeseries data. What i want to do is efficiently fill all the missing values in different columns by substituting with a median value using timedelta of say "N" mins. E.g if for a column say i have data for 10:20, 10:21,10:22,10:23,10:24,.... and data in 10:22 is missing then with timedelta of say 2 mins i would want it to be filled by median value of 10:20,10:21,10:23 and 10:24.
One way i can do is :
for all column in dataframe:
Find index which has nan value
for all index which has nan value:
extract all values using between_time with index-timedelta and index_+deltatime
find the media of extracted value
set value in the index with that extracted median value.
This looks like 2 for loops running and not a very efficient one. Is there a efficient way to do it.
Thanks
IIUC you can resample your time column, then fillna with rolling window set to center:
# dummy data setup
np.random.seed(500)
n = 2
df = pd.DataFrame({"time":pd.to_timedelta([f"10:{i}:00" for i in range(15)]),
"value":np.random.randint(2, 10, 15)})
df = df.drop(df.index[[5,10]]).reset_index(drop=True)
print (df)
time value
0 10:00:00 4
1 10:01:00 9
2 10:02:00 3
3 10:03:00 3
4 10:04:00 8
5 10:06:00 9
6 10:07:00 2
7 10:08:00 9
8 10:09:00 9
9 10:11:00 7
10 10:12:00 3
11 10:13:00 3
12 10:14:00 7
s = df.set_index("time").resample("60S").asfreq()
print (s.fillna(s.rolling(n*2+1, min_periods=1, center=True).mean()))
value
time
10:00:00 4.0
10:01:00 9.0
10:02:00 3.0
10:03:00 3.0
10:04:00 8.0
10:05:00 5.5
10:06:00 9.0
10:07:00 2.0
10:08:00 9.0
10:09:00 9.0
10:10:00 7.0
10:11:00 7.0
10:12:00 3.0
10:13:00 3.0
10:14:00 7.0
ID Height Phase Corrected_Height Final
1 0 A 0 0
2 1.2 A 1.2 1.2
3 3.9 A 3.9 3.9
4 5.8 A 5.8 5.8
5 4.6 A NA 7.7
6 7.7 A 7.7 9.3
7 9.3 A 9.3 10.8
..
300 237.5 P 237.5 ..
301 234.7 D 234.7 ..
302 233.3 D 233.3 ..
303 235.1 D NA ..
555 1.0 D 1.0
I have a set of data of similar structure. Calculation of the Phase column was done according to the formula =IF(B2=MAX(B:B);"P";IF(ROW(B2)<MATCH(MAX(B:B);B:B;0);"A";"D"))thanks to #Scott Craner for the solution Naming a behavior in Excel
and to calculate the Corrected_Height column I used =IF(C4="A" & B4>B3;B4; IF(C4="D" & B4<B3;B4;"NA"))). However I did not get the required result. The idea is when in "A" phase, should a lower value arise than the previous one it should change to NA and in "D" phase, should a value be higher than the previous one it should again change to NA. Any suggestion what should I change in the formula? And I also want a final column that gives me the values without NA in it. A,P,and D in phase means Ascent,Peak, and Descent.
The & operator cannot be used to logically and together two conditions in an Excel formula. Instead, use the AND() function:
=IF(AND(C4="A", B4>B3), B4, IF(AND(C4="D", B4<B3), B4, "NA"))