Computing Standard Deviation without packages in Python? - python-3.x

I'm trying to figure out how to create a script which calculates a standard deviation for a file. As an example, say I DLed a csv with a list of values on it. I want to find the SD of these values by running a python program. We are not using numpy here!

If you allow the use of the standard library,
import math
xs = [0.5,0.7,0.3,0.2] # values (must be floats!)
mean = sum(xs) / len(xs) # mean
var = sum(pow(x-mean,2) for x in xs) / len(xs) # variance
std = math.sqrt(var) # standard deviation
If not, you need to approximate sqrt by hand. For example, you can use binary search or Newton's Method. Here's a wikipedia page for methods of doing so

with Python 3.4 and above there is a package called statistics, that has standard deviation (pstdev) and other functions
Here is an example of how to use it:
import statistics
data = [1, 1, 2.5, 6.5, 7.3, 8, 9.2]
print(statistics.pstdev(data))
# 3.2159043543498815

from math import sqrt
n= [11, 8, 8, 3, 4, 4, 5, 6, 6, 7, 8]
mean =sum(n)/len(n)
SUM= 0
for i in n :
SUM +=(i-mean)**2
stdeV = sqrt(SUM/(len(n)-1))
print(stdeV)

filename = "C:\Users\mmb0368\Desktop\input.txt"
file = open("C:\Users\mmb0368\Desktop\input.txt","rb")
n = file.readlines()
num_list = map(lambda n: n.rstrip("\n"), n)
num_list = [int(x) for x in num_list]
mean = sum(num_list)/len(num_list)
print mean, max(num_list), min(num_list)
for snDev in num_list:
snDev = mean**(1.0/2)
print snDev

from math import sqrt
def getAverage(mylist):
"""
This function calculates the average of a list of numbers.
Parameters:
mylist (list): List of numbers
Returns:
float: Average of the numbers in the list
Example:
>>> getAverage([1,5,10])
5.333333333333333
"""
return sum(mylist)/len(mylist)
def getStandardDeviation(mylist):
"""
This function calculates the standard deviation of a list of numbers.
Parameters:
mylist (list): List of numbers
Returns:
float: Standard deviation of the numbers in the list
Example:
>>> getStandardDeviation([1,5,10])
4.509249752822894
"""
ls=[]
for i in mylist:
ls.append((i - getAverage(mylist))**2)
return sqrt( sum(ls) / (len(mylist) - 1) )
mylist = [1,5,10]
getAverage(mylist=mylist)
# 5.333333333333333
getStandardDeviation(mylist=mylist)
# 4.509249752822894
This code contains two functions getAverage and getStandardDeviation for calculating average and standard deviation of a list of numbers respectively. The getAverage function takes in a list of numbers and returns the average of those numbers. The getStandardDeviation function takes in a list of numbers and returns the standard deviation of those numbers by first finding the square difference of each number from the average and then taking the square root of the average of those squared differences. A sample list mylist of numbers is defined at the end and both functions are called with this list as argument.

Related

How to use random_split with percentage split (sum of input lengths does not equal the length of the input dataset)

I tried to use torch.utils.data.random_split as follows:
import torch
from torch.utils.data import DataLoader, random_split
list_dataset = [1,2,3,4,5,6,7,8,9,10]
dataset = DataLoader(list_dataset, batch_size=1, shuffle=False)
random_split(dataset, [0.8, 0.1, 0.1], generator=torch.Generator().manual_seed(123))
However, when I tried this, I got the error raise ValueError("Sum of input lengths does not equal the length of the input dataset!")
I looked at the docs and it seems like I should be able to pass in decimals that sum to 1, but clearly it's not working.
I also Googled this error and the closest thing that comes up is this issue.
What am I doing wrong?
You're likely using an older version of PyTorch, such as Pytorch 1.10, which does not have this functionality.
To replicate this functionality in the older version, you can just copy the source code of the newer version:
import math
from torch import default_generator, randperm
from torch._utils import _accumulate
from torch.utils.data.dataset import Subset
def random_split(dataset, lengths,
generator=default_generator):
r"""
Randomly split a dataset into non-overlapping new datasets of given lengths.
If a list of fractions that sum up to 1 is given,
the lengths will be computed automatically as
floor(frac * len(dataset)) for each fraction provided.
After computing the lengths, if there are any remainders, 1 count will be
distributed in round-robin fashion to the lengths
until there are no remainders left.
Optionally fix the generator for reproducible results, e.g.:
>>> random_split(range(10), [3, 7], generator=torch.Generator().manual_seed(42))
>>> random_split(range(30), [0.3, 0.3, 0.4], generator=torch.Generator(
... ).manual_seed(42))
Args:
dataset (Dataset): Dataset to be split
lengths (sequence): lengths or fractions of splits to be produced
generator (Generator): Generator used for the random permutation.
"""
if math.isclose(sum(lengths), 1) and sum(lengths) <= 1:
subset_lengths: List[int] = []
for i, frac in enumerate(lengths):
if frac < 0 or frac > 1:
raise ValueError(f"Fraction at index {i} is not between 0 and 1")
n_items_in_split = int(
math.floor(len(dataset) * frac) # type: ignore[arg-type]
)
subset_lengths.append(n_items_in_split)
remainder = len(dataset) - sum(subset_lengths) # type: ignore[arg-type]
# add 1 to all the lengths in round-robin fashion until the remainder is 0
for i in range(remainder):
idx_to_add_at = i % len(subset_lengths)
subset_lengths[idx_to_add_at] += 1
lengths = subset_lengths
for i, length in enumerate(lengths):
if length == 0:
warnings.warn(f"Length of split at index {i} is 0. "
f"This might result in an empty dataset.")
# Cannot verify that dataset is Sized
if sum(lengths) != len(dataset): # type: ignore[arg-type]
raise ValueError("Sum of input lengths does not equal the length of the input dataset!")
indices = randperm(sum(lengths), generator=generator).tolist() # type: ignore[call-overload]
return [Subset(dataset, indices[offset - length : offset]) for offset, length in zip(_accumulate(lengths), lengths)]
If you know the length of your dataset, ie, it has the len method,
proportions = [.75, .10, .15]
lengths = [int(p * len(dataset)) for p in proportions]
lengths[-1] = len(dataset) - sum(lengths[:-1])
tr_dataset, vl_dataset, ts_dataset = random_split(dataset, lengths)

Matrix Sum logic

I am working on a 2D matrix and finding sum of elements, below is my logic:
def calculateSum(a, x, y):
s = 0;
for i in range(0,x+1):
for j in range(0,y+1):
s = s + a[i][j];
print(s)
return s
def check(a):
arr = []
x = 0
y = 0
for i in range(len(a)):
row = []
y = 0
for j in range(len(a[i])):
row.append(calculateSum(a, x, y))
y = y + 1
x = x + 1
print(row)
check([[1, 2], [3, 4]])
calculateSum is the function that calculates sum of elements.
Now my question is, if the matrix size is huge then is there is a way to improve performance of the above program?
Update:
import numpy as np
def calculateSum(a, x, y):
return np.sum(a[x:,y:])
After using numpy I am getting error as TypeError: list indices must be integers or slices, not tuple if I use numpy
As the matrix dimensions increases, Efficiency will fall, the efficient way to deal with this is to parallelize the task of summing the values, this is possible because addition follows Associative property.
Luckily for you this parallelization is already implemented in a library known as numpy.
To get started with numpy, use pip install numpy To get an overview of the library visit: https://www.geeksforgeeks.org/numpy-in-python-set-1-introduction/
And for your question you will need to use function numpy.sum()
Edit:
Also as #Mad Physicist pointed out Numpy also has packed memory layout and the routines are implemented in C which boost its speed even further.

Difference Between ADD and SUM in Python

SUM function results explanation when given two 2-d arrays
When I run the Code in Spyder IDE the Sum function and numpy.add function is showing different results. Can anyone help me to understand how the "SUM" function output is coming when we had given two , 2-d arrays for two parameters in the sum function instead of array and a number. Thank you
import numpy as np
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
print(x)
print(y)
print (x+y)
print(sum(x,y))
print(np.add(x,y))
Output
[[1. 2.]
[3. 4.]]
[[5. 6.]
[7. 8.]]
[[ 6. 8.]
[10. 12.]]
[[ 9. 12.]
[11. 14.]]
[[ 6. 8.]
[10. 12.]]
In Numpy, the + operator is defined to be element-wise addition and in fact equivalent to np.add(...).
The sum(iterable, [start]) built-in function
Sums start and the items of an iterable from left to right and returns the total. start defaults to 0.
So if only given one matrix, it will perform a column-wise summation. If given a second argument, it will (element-wise) add to the sum. So some smaller examples might be
sum(x)
> array([4., 6.])
# i.e. [(1+3), (2+4)]
sum(x, 1)
> array([5., 7.])
# i.e. [(1+1+3), (1+2+4)]
sum(y)
> array([12., 14.])
# i.e. [(5+7), (6+8)]
sum(x, sum(y))
> array([16., 20.])
# i.e. [((5+7)+1+3), ((6+8)+2+4)]
sum(x, y)
> array([[ 9., 12.],
[11., 14.]])
# i.e. [[(5+1+3), (6+2+4)],
# [(7+1+3), (8+2+4)]]
The last sum() is performing the column-wise sum of x, and then adding the result to each element of y with a shared column. Written with Numpy, it's equivalent to
sum(x, y) == x.sum(axis=0) + y

In SymPy, how to calculate posterior probability?

Assuming that I have defined 2 probability variables in SymPy:
x = Normal('x', 0, 2)
y = 2*x + Normal('0', 3)
Now given evidence that y = 4, is it possible to define a new probability variable that follow the posterior distribution P(x | y=4)?
It is easy to simply multiply the probability distribution function of 2, however I wonder whether sympy has the feature to yield a probability variable directly.
The typical way is to pass conditions as the second argument without creating a new random symbol: for example,
density(x, Eq(y, 4)) # Lambda(x, 5*sqrt(2)*exp(8/25)*exp(-x**2/8)*exp(-2*(-x + 2)**2/9)/(12*sqrt(pi)))
P(x > 0, Eq(y, 4)) # -erfc(8*sqrt(2)/15)/2 + 1
But it's also possible to create a random variable with a custom density using ContinuousRV:
from sympy.stats import ContinuousRV
x_post = Symbol("x_post")
X_post = ContinuousRV(x_post, density(x, Eq(y, 4))(x_post))
For example, simplify(E(X_post)) returns 16*erf(3*sqrt(2)/10)/25 + 16*erfc(3*sqrt(2)/10)/25 + 16/25.

Using Theano.scan with multidimensional arrays

To speed up my code I am converting a multidimensional sumproduct function from Python to Theano. My Theano code reaches the same result, but only calculates the result for one dimension at a time, so that I have to use a Python for-loop to get the end result. I assume that would make the code slow, because Theano cannot optimize memory usage and transfer (for the gpu) between multiple function calls. Or is this a wrong assumption?
So how can I change the Theano code, so that the sumprod is calculated in one function call?
The original Python function:
def sumprod(a1, a2):
"""Sum the element-wise products of the `a1` and `a2`."""
result = numpy.zeros_like(a1[0])
for i, j in zip(a1, a2):
result += i*j
return result
For the following input
a1 = ([1, 2, 4], [5, 6, 7])
a2 = ([1, 2, 4], [5, 6, 7])
the output would be: [ 26. 40. 65.] that is 1*1 + 5*5, 2*2 + 6*6 and 4*4 + 7*7
The Theano version of the code:
import theano
import theano.tensor as T
import numpy
a1 = ([1, 2, 4], [5, 6, 7])
a2 = ([1, 2, 4], [5, 6, 7])
# wanted result: [ 26. 40. 65.]
# that is 1*1 + 5*5, 2*2 + 6*6 and 4*4 + 7*7
Tk = T.iscalar('Tk')
Ta1_shared = theano.shared(numpy.array(a1).T)
Ta2_shared = theano.shared(numpy.array(a2).T)
outputs_info = T.as_tensor_variable(numpy.asarray(0, 'float64'))
Tsumprod_result, updates = theano.scan(fn=lambda Ta1_shared, Ta2_shared, prior_value:
prior_value + Ta1_shared * Ta2_shared,
outputs_info=outputs_info,
sequences=[Ta1_shared[Tk], Ta2_shared[Tk]])
Tsumprod_result = Tsumprod_result[-1]
Tsumprod = theano.function([Tk], outputs=Tsumprod_result)
result = numpy.zeros_like(a1[0])
for i in range(len(a1[0])):
result[i] = Tsumprod(i)
print result
First, there is more people that will answer your questions on theano mailing list then on stackoverflow. But I'm here:)
First, your function isn't a good fit for GPU. Even if everything was well optimized, the transfer of the input to the gpu just to add and sum the result will take more time to run then the python version.
Your python code is slow, here is a version that should be faster:
def sumprod(a1, a2):
"""Sum the element-wise products of the `a1` and `a2`."""
a1 = numpy.asarray(a1)
a2 = numpy.asarray(a2)
result (a1 * a2).sum(axis=0)
return result
For the theano code, here is the equivalent of this faster python version(no need of scan)
m1 = theano.tensor.matrix()
m2 = theano.tensor.matrix()
f = theano.function([m1, m2], (m1 * m2).sum(axis=0))
The think to remember from this is that you need to "vectorize" your code. The "vectorize" is used in the NumPy context and it mean to use numpy.ndarray and use function that work on the full tensor at a time. This is always faster then doing it with loop (python loop or theano scan). Also, Theano optimize some of thoses cases by moving the computation outside the scan, but it don't always do it.

Resources