find average variance across columns & average of variance across rows in python - python-3.x

I have the data as follows:-
a b c d
5 6 32 12
9 8 16 23
15 8 14 20
I want to check the variance of each column and then have 1 variance for each column. Then I would like to get average of variance across columns & finally reach one number for this entire dataset.
Can anyone please help?

Do you want var and mean?
df.var().mean()
output: 39.083333333333336
intermediate:
df.var()
a 25.333333
b 1.333333
c 97.333333
d 32.333333
dtype: float64
NB. by default, var used 1 degree of freedom (ddof=1) to compute the variance. Id you want 0, use df.var(ddof=0).

Related

how to get percentage of columns to sum of row in python [duplicate]

This question already has an answer here:
Normalize rows of pandas data frame by their sums [duplicate]
(1 answer)
Closed 2 years ago.
I have a very high dimensional data with more than 100 columns. As an example, I am sharing the simplified version of it given as a below:
date product price amount
11/17/2019 A 10 20
11/24/2019 A 10 20
12/22/2020 A 20 30
15/12/2019 C 40 50
02/12/2020 C 40 50
I am trying to calculate the percentage of columns based on total row sum illustrated below:
date product price amount
11/17/2019 A 10/(10+20) 20/(10+20)
11/24/2019 A 10/(10+20) 20/(10+20)
12/22/2020 A 20/(20+30) 30/(20+30)
15/12/2019 C 40/(40+50) 50/(40+50)
02/12/2020 C 40/(40+50) 50/(40+50)
Is there any way to do this efficiently for high dimensional data? Thank you.
In addition to the provided link (Normalize rows of pandas data frame by their sums), you need to locate the specific columns as your first two column are non-numeric:
cols = df.columns[2:]
df[cols] = df[cols].div(df[cols].sum(axis=1), axis=0)
Out[1]:
date product price amount
0 11/17/2019 A 0.3333333333333333 0.6666666666666666
1 11/24/2019 A 0.3333333333333333 0.6666666666666666
2 12/22/2020 A 0.4 0.6
3 15/12/2019 C 0.4444444444444444 0.5555555555555556
4 02/12/2020 C 0.4444444444444444 0.5555555555555556

Difference between consecutive maxima and minima in a .csv dataset

I have a dataset which represents tracking data of a mouse's paw moving up and down in the y-axis as it reaches up for and pulls down on a piece of string.
The output of the data is a list of y-coordinates corresponding to a fraction of a second. For example:
1 333.9929833
2 345.4504726
3 355.7046572
4 367.6136684
5 379.7906121
6 390.5470788
7 397.9017118
8 403.677123
9 412.1550843
10 416.516814
11 419.8205706
12 423.7994881
13 429.4874275
14 419.2652898
15 360.1626136
16 298.8212249
17 264.3647809
18 265.0078862
19 268.1828407
20 283.101321
21 294.8219163
22 308.4875135
In this series, there is a max value of 429... and a minimum of 264... - however, as you can see from an example image:
(excuse the gaps), there are multiple consecutive wave-like maxima and minima.
The goal is to find the difference between each maxima and consecutive minima, and each minima and consecutive maxima (i.e. max1-min1, min2-max1, max2-min2...). Ideally, this would also provide the timepoints of each max and min (e.g. 13 and 17 for the provided dataset) - there is a column with integer labels (1, 2, 3...) corresponding to each coordinate.
Thanks for your help!

Microsoft Excel Pie Chart bug

Pie chart percentage not calculated correctly by excel. In the picture you can see that the c and d values are exactly the same, but for some reason "c" has a higher percentage denoted to it and I can't figure out why.
The values are a-21; b-5; c-11; d-11; e-3; f-5; g-1; h-39. On the pie chart the percentage received is a-22%; b-5%; c-12%; d-11%; e-3%; f-5%; g-1%; h-41%
While not the ideal solution, if you right click on one of the labels and press the Format Data Labels option, you can change the Number display type to percentage, this will increase the number of decimal places in the percentage shown but give you the accurate result asked for.
The problem is caused by your actual percentages being:
Name Val %
a 21 21.875
b 5 5.208333333
c 11 11.45833333
d 11 11.45833333
e 3 3.125
f 5 5.208333333
g 1 1.041666667
h 39 40.625
As you can see these numbers can't be exactly represented as a (whole number) percentage, the compensations have to be made somewhere. It just so happened that the compensations were made on numbers that should be the same.
Another possible option would be to round your percentage results:
Name Val % Rounded %
a 21 21.875 22
b 5 5.208333333 5
c 11 11.45833333 11
d 11 11.45833333 11
e 3 3.125 3
f 5 5.208333333 5
g 1 1.041666667 1
h 39 40.625 41
The sum of these values is now 99 instead of 96 as in your original, which results in a better graph:
You can do this using the formula =ROUND(num,0) for each of your calculated percentages.

Histogram in Excel with bins

I have an excel spreadsheet with score and frequency of scores, as such:
Score Count
0 2297802
1 2392803
2 1258527
3 969550
4 818579
5 675646
6 591326
7 598960
8 506268
9 448232
10 414830
11 382808
...
I'm looking for a way to 'bucket' these scores in intervals of (say) 3 and plot them to show the distribution:
Score Count
0-2 5949132
3-5 2463775
...
And so on
I'm using Excel for Mac and I tried defining a 3 interval bin in the Analysis ToolPak but that appears to work only on raw data as opposed to the counts that I already have.
in cells D2 downwards, enter your upper (inclusive) bin limits
in cell E2, enter =SUMIF(A:A,"<="&D2,B:B)-SUM(E$1:E1)
copy E2 downwards
example result:

In Excel aggregate values from a column if the following values are larger than zero

I would like excel to iterate through a column of precipitation values and keep adding them as long the value is larger than zero. The table for example looks like this with the required result in NewCol.
Date prec NewCol
x 10 10
x 8 18
x 3 21
x 0 0
x 1 1
x 0 0
I would like to use the value 21 (and all other values with largest values) to assess consequtive rainfall days. Wondering if this is possible in Excel? Haven't been able to find the solution in Mathlab.
Thanks in Advance!

Resources