winsorize does not affect the outlier - python-3.x

I have this set of data in a DataFrame :
data
winsor_data
0
1660
1660
1
600
600
2
50
50
3
3173.55
3173.55
4
30
30
5
120
120
6
7.84
7.84
7
1660
1660
8
33.3
33.3
9
2069.49
2069.49
10
42
42
11
384.29
384.29
12
1660
1660
13
1338.57
1338.57
14
200000
200000
15
1760
1760
The 14th value is clearly an outlier.
from scipy.stats.mstats import winsorize
dfdailyIncome['winsor_data'] = winsorize(df['data'], limits=(0,0.95))
I do not understand why the outlier is not clipped. May be it has something to do with the way the quantiles are calculated.

I think you are misinterpreting the 'limits' parameter.
If you want to cut 10 percent of your largest values, you need:
dfdailyIncome['winsor_data'] = winsorize(df['data'], limits=[0,0.1])
You cut 95 percent of your largest data in your example.
Hint: Even if you would use winsorize(df['data'], limits=[0,0.05]), your data would stay the same because 5 percent of your largest data is the original data because you have less than 20 values.
See the example from here for further explanation: scipy.stats.mstats.winsorize

Related

How do I create series in Excel using criteria from values in other cells?

I have an Excel spreadsheet populated as below:
Latitude
Longitude
Altitude
Value
10
10
1
100
10
10
5
105
10
10
20
120
10
5
1
150
10
5
5
155
10
5
20
170
15
10
1
500
15
10
5
505
15
10
20
520
15
5
1
550
15
5
5
555
15
5
20
570
Using this data, I would like to create a Chart in Excel where I have Value on the X-axis, Altitude on the Y-Axis and a series for each unique combination of Latitude and Longitude.
This should result in 4 series being plotted on the Chart with each series having 3 values (one value for each Altitude. I feel like this should be easy to do but I'm struggling to do it myself or find something using the grand-old Google.
Any help you could provide this Excel-noob would be greatly appreciated!
If you re-arrange your data like that
value
altitude
value
altitude
value
altitude
value
altitude
long-lat:
10-10
10-5
15-10
15-5
100
1
150
1
500
1
550
1
105
5
155
5
505
5
555
5
120
20
170
20
520
20
570
20
you can insert the four curves individually into a "points (x/y)" diagram:
Here is a screenshot of how the curves are defined:

Spotfire Add several columns with a custom expression

I would like add several columns in a Bar Chart in Y with a custom expression. I have several columns which begin with "HB" or "PASS".
Their number change as well as their name every time I refresh the table. But HB or PASS remains in column name.
I tried to use this expression :
Sum($map("[$csearch([pvtable],"PASS*")]",","))/Count([SUBLOT_ID])
or
$map("[$csearch([pvtable],"PASS*")]",","))
If I have only one column with PASS or HB in key word it works, but not if I have several columns with this key words in their name.
It's an example of my datas. They are in percentage.
LOT_ID SUBLOD_ID WL_PART_CNT PASS_HB1 PASS_HB2 HB5 HB10 HB13 HB25
Q640123 01 3841 86 11 0.25 0.5 0.25 2
Q640123 05 3841 96 3 0 1 0 0
Q640123 10 3841 80 12 0 2 4 2
Q640123 16 3841 40 50 1 1 4 4
Q640123 22 3841 85 5 9 0.5 0.5 0
Q640345 01 3841 86 11 0.25 0.5 0.25 2
Q640345 05 3841 96 3 1 0 0 0
Q640345 10 3841 80 12 0 2 4 2
Q640345 16 3841 40 50 1 1 4 4
Q640345 22 3841 85 5 9 0.5 0.5 0
I want to put LOT_ID in X, and PASS together in Y. I don't want to color my bar chart but I would like a result like this. One bar chart with all columns PASS and an other with all columns HB.
This bar chart represent HB.
Thank you for your help, regards, Laurent
You shouldn't need the $map function, only the $csearch
Sum($csearch([pvtable],"PASS*")) /Count([SUBLOT_ID])
EDIT
After looking at your test data, you will need to map the values.
$map("sum([$csearch([pvtable],"PASS*")])","+"),$map("sum([$csearch([pvtable],"HB*")])","+")
Then, on your X-AXIS you will need: <[LOT_ID] NEST [Axis.Default.Names]>

Average formula using number of blank rows above

I'm working on spreadsheet with logged flows that are not at uniform periods.
Looking for formula for Col G that will average values in Col A for logged values for previous 10 minutes.
Here's the spreadsheet data:
Flow Time min sec sec 10_min Average
187.29 06:10:09 10 9 609
202.90 06:11:21 11 21 681
280.94 06:12:37 12 37 757
218.51 06:13:43 13 43 823
187.29 06:15:13 15 13 913
124.86 06:16:26 16 26 986
109.25 06:18:52 18 52 1132
109.25 06:20:00 20 0 1200 1 177.54
202.90 06:22:30 22 30 1350
265.33 06:23:36 23 36 1416
280.94 06:24:42 24 42 1482
249.73 06:25:58 25 58 1558
218.51 06:27:39 27 39 1659
421.41 06:28:47 28 47 1727
421.41 06:30:00 30 0 1800 1 294.32
Use an AVERAGEIFS and construct the criteria with the TEXT function while modifying one criteria by ten minutes.
=AVERAGEIFS(A:A,B:B, TEXT(B9-TIME(0, 10, 0), "\>0.0###############"),B:B, TEXT(B9, "\<\=0.0###############"))
Note that times can also be resolved as decimal numbers which I have used here. My second average came up slightly different from yours. You may wish to change the \>\= to \> .

Excel - running % of running total in pivot table

I have a table like:
periodo quintil pos
201611 1 10
201611 2 20
201611 3 30
201611 4 40
201611 5 50
201612 1 9
201612 2 19
201612 3 29
201612 4 39
201612 5 49
I need to create a pivot table like:
periodo quintil running_pos running_%
201611
1 10 7%
2 30 20%
3 60 40%
4 100 67%
5 150 100%
201612
1 9 6%
2 28 19%
3 57 39%
4 96 66%
5 145 100%
Since the running total is not a new field, but a way to show an older field (pos- show as total in quintil), the problem arises when I try to create the running % of the running total.
How can I introduce also this field (running % of running total)?
In spanish there's nothing with a name like running totals translation....
To display what you want in a Pivot Table
- Drag pos to the values area three times
- For the first, use the SUM
- For the second, use the "show as running total"
- For the third, use the "show as % running total"
Here are the results with minimal formatting
Here are the value settings for the third column:

How to format a number to appear as percentage in Excel

So lets say I have a few numbers in a sheet
a b c d
1 33 53 23 11
2 42 4 83 64
3 75 3 48 38
4 44 0 22 45
5 2 34 76 6
6
7 Total 85
I would like to display those numbers so that the cell value still holds the original figure (A1 = 33)
but the cell displays both the number and a percentage from the total (B7) eg
a b c d
1 33 (39%) 53 (62%) 23 (27%) 11 (13%)
2 42 (49%) 4 (5%) 83 (98%) 64 (75%)
3 75 (88%) 3 (4%) 48 (56%) 38 (45%)
4 44 (52%) 0 (0%) 22 (26%) 45 (53%)
5 2 (2%) 34 (40%) 76 (89%) 6 (7%)
6
7 Total 85
I know how to format a cell as a percentage, but I can't figure out how to display both original values, the calculated percentage value (value/total*100), but not change the cell value so I could still sum the cells in the end (eg. A6 =SUM(A1:A5) = 196)
Does anyone have an idea? I was hoping there could be a way to duplicate and calculate the figure using text formatting, but I can't get anything to work.
I'm guessing this is a trivial answer and maybe not what you're looking for, but why not just add a column for each of the columns you have now?
a a' b b' c c' d d'
1 33 (39%) 53 (62%) 23 (27%) 11 (13%)
2 42 (49%) 4 (5%) 83 (98%) 64 (75%)
3 75 (88%) 3 (4%) 48 (56%) 38 (45%)
4 44 (52%) 0 (0%) 22 (26%) 45 (53%)
5 2 (2%) 34 (40%) 76 (89%) 6 (7%)
6
7 Total 85
#Ari’s answer seems to meet to meet the requirements in your question, not repeat information more than the example you gave for output requirement and be viable for up to around 8000 or so columns to start with (unless a very old version of Excel) and Jerry’s comment is also correct that what you want to achieve the way you want to achieve it is not possible.
However there are other approaches that might be acceptable substitutes. One is to copy your data and Paste Special with Operation Divide, either elsewhere or over the top of your data. If over the top this either shows the values or the percentages otherwise duplicates your data. Over the top would also require something like Operation Multiply to revert back to values, and reformatting each time if to appear as in your example.
Another is to use a PivotTable with some calculated fields and both are shown below:
I appreciate neither is exactly what you are asking for.

Resources