How to get the specific range values from numpy.random.normal()? - python-3.x

Is there a way to get the specific range of array from the results of numpy.random.normal()? without computing all the random numbers, it only computes the said range limits
Normal application
random_numbers = numpy.random.normal(0, 1, 1000)
What i want is get the range of this random_numbers without computing it all first
first_100_random_numbers = needs the results of the first 100 values
300th_400th_random_numbers = needs the results of the 300 - 400 values

If you generate the random numbers one at a time, you can just keep track of whether they increase the max or min values. You will still have to compute the values, but you won't run into a memory issue since you only have to store three numbers (max, min, and latest_random)
import numpy as np
max_=0
min_=0
for i in range(1000):
new_number=np.random.normal(0,1,1)
if new_number>max_:
max_=new_number
if new_number<min_:
min_=new_number
range_=max_-min_
print(range_)
To speed up the computation you can do larger blocks at a time. If you want to do a run with a billion numbers, you can calculate a million at a time and run the loop a thousand times. Modified code and time results below
import numpy as np
import time
max_=0
min_=0
start=time.time()
for i in range(1000):
new_array=np.random.normal(0,1,1000000)
new_max=np.max(new_array)
new_min=np.min(new_array)
if new_max>max_:
max_=new_max
if new_min<min_:
min_=new_min
range_=max_-min_
print('Range ', range_)
end = time.time()
Time=end - start
print('Time ',Time)
Range 12.421138327443614
Time 36.7797749042511
Comparing the results of running one random number at a time vs. ten at a time to see if results are significantly different
(each one run three times)
One at a time:
new_numbers=[]
for i in range(10):
new_numbers.append(np.random.normal(0,1,1)[0])
print(new_numbers)
[-1.0145267697638918, -1.1291506481372602, 1.3622608858856742, 0.16024562390261188, 1.062550043104352, -0.4160329548439351, -0.05464203711515494, -0.7416629430695286, 0.35066071936940363, 0.06498345663995017]
[-1.5632632129838873, -1.0314300796946991, 0.5014408178125339, -0.37806631815396563, 0.45396918178048334, -0.6630479858064194, -0.47097483551189306, 0.40734077106402056, 1.1167819302886144, -0.6594075991871857]
[0.4448783416507262, 0.20160041940565818, -0.4781753245124433, -0.7130750653981222, -0.8035305391034386, -0.41543648761183466, 0.25166027175788847, -0.7051417978559822, 0.6017351178904993, -1.3719596304190458]
Ten at a time:
np.random.normal(0,1,10)
array([-1.79498658, 0.89073416, -0.25302627, -0.17237986, -0.38988131,
-0.93635678, 0.28824899, 0.52675642, 0.86195635, -0.89584341])
array([ 1.41602405, 1.33800937, 1.87837334, 0.2082182 , -0.25116545,
1.37953259, 0.34445565, -0.33647043, -0.24414261, -0.14505838])
array([ 0.43848371, -0.60967936, 1.2902231 , 0.44589728, -2.39725248,
-1.42715386, -1.0627627 , 1.15998483, 0.96427742, -2.01062938])

maybe just draw them from a np.random.RandomState:
import numpy as np
# random state
RS = np.random.RandomState(seed = 0)
# first 10 elments
print(RS.normal(0, 1, 10))
# another 20
print(RS.normal(0, 1, 20))
Its allays going to be the same random numbers to the according seed.
first_100_random_numbers = RS.normal(0, 1, 100)
100th_200th_random_numbers = RS.normal(0, 1, 100)
200th_400th_random_numbers = RS.normal(0, 1, 200)
Otherwise you could think about using a generator.

Related

Implementing probability over time

I have a set of 100 diseased people. According to data, the probability of them getting themselves reported/tested is 0.03. Let us assume that the count of the diseased remains a constant over 10 days. How do I implement the logic so that at the end of 10 days, 3-4 diseased people get reported? Assuming that the resolution of time chosen is 1-day. A small example of the current implementation is given below:
import numpy as np
Population = 1000 # Total Population
# Store Infection Status
Infection_status = np.zeros(Population)
# Total Infections
Infection_status[:100] = 1
# Store Test Status
Test_status = np.zeros(Population)
for t in range(10):
# generate N random numbers
temp = np.random.uniform(0,1,Population)
# Choose 1 in 30
Test_status[(temp<0.03) & (Infection_status==1)] = 1

Is there a way to define the size of a chunk in pandas as a function of available memory?

I am aware that I can load a file containing data in chunks:
import pandas
for chunk in pandas.read_csv("path_to_my_csv.csv", chunksize=1e9):
# Process
where the value of chunksize corresponds to the number of rows each "chunk" contains. What I want to be able to do is something like:
import pandas
for chunk in pandas.read_csv("path_to_my_csv.csv", chunkmem="200GB"):
# Process
The reason I want to do this is to be able to process data on different machines (with different amounts of available RAM), and to parameterise my chunking in an automated way using psutil.virtual_memory or similar.
One way of doing this would be to calculate the memory footprint of a single row (from the datatypes of each column), and use that to parameterise the value of chunksize, but I'd ideally like to be able to do this with datasets of different structures.
Edit (In response to Bill Huang):
The way I would do this, given that there is no direct implementation in the Pandas API, is first to estimate the memory footprint of the data frame:
import pandas
numberOfRows = int(1e10) # Known a-priori
firstRecord = pandas.read_csv("big_data.csv", chunksize=2).get_chunk()
firstTwoRecords = pandas.read_csv("big_data.csv", chunksize=3).get_chunk()
rowFootprint = (firstTwoRecords.memory_usage().sum() -
firstRecord.memory_usage().sum())
estimatedFootprint = numberOfRows * rowFootprint
print(estimatedFootprint)
Then to divide available memory (from psutil.virtual_memory) by this estimate to get a chunk size. This estimator only requires reading the first two rows of the file.
Short answer: No. There is just no such parameter as the documentation shows.
Useful answer: You can estimate the suitable chunksize. For instance, guessing the average size of rows by reading the first 30 lines or so. But the odds of failure is really data-dependent.
import psutil
from pathlib import Path
# https://www.kaggle.com/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv
file_path = Path("/mnt/ramdisk/tmdb_5000_movies.csv")
fill_rate = 0.1
n_rows = 10
def estimate_bpl(file_path, n_rows=10):
"""Return estimates of bytes per line using the first n lines"""
with open(file_path) as f:
length = 0
for i, line in enumerate(f):
if i == n_rows:
break
length += len(line.encode('utf8'))
return length / n_rows
avail_mem = psutil.virtual_memory().available
bpl = estimate_bpl(file_path, n_rows)
chunksize = int(avail_mem * fill_rate / bpl)
print(f"avail_mem={avail_mem}, fill rate={fill_rate}, bytes per line={bpl}, chunksize={chunksize}")
avail_mem=11166822400, fill rate=0.1, bytes per line=1409.4, chunksize=792310

Generate High, Medium, Low categories from a skewed distribution

I have been working on a Churn Prediction use case in Python using XGBoost. The data trained on various parameters like Age, Tenure, Last 6 months income etc gives us the prediction if an employee is likely to leave based on its employee ID.
Additionally, if the user wants to the see why this ML system categorised the employee as such, the user can see the features that contributed to this, which are extracted form the model via eli5 library.
So to make this more explainable to the users, we had created some ranges for each feature:
Tenure (in days)
[0-100] = High Risk
[101-300] = Medium Risk
[301-800] = Low Risk
To define these ranges we've analysed the distributions of each feature and manually defined the ranges for our use in the system. We saw the impact of each feature on the target variable IsTerminated in training data. Following is an example of Tenure distribution.
Here the green bar represents the employees who are terminated or left and pink represents those who didn't.
So the question is that, as time passes and new data would be added to the model the such features' risk ranges would change. In this case of Tenure, if an employee has tenure of 780 days, after a month his tenure feature would show 810. Obviously, we keep the upper end on "Low Risk" as open ended. But real problem is, how can we define the internal boundaries / ranges programtically ?
EDIT: Thanks for the clarification. I have changed the answer.
It is important to realize that you are trying to project a selection in multi-dimensional space into a 1D space. Not in every case you will be able to see a clear separation like the one you got. There are also various possibilities to do that, here I made a simple example that could help your client to interpret the model, but does not represent the full complexity of the model, of course.
You did not provide any sample data, so I will generate some from the breast cancer dataset.
First let's import what we need:
from sklearn import datasets
from xgboost import XGBClassifier
import pandas as pd
import numpy as np
And now import the dataset and train a very simple XGBoost Model
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
xgb_model = XGBClassifier(n_estimators=5,
objective="binary:logistic",
random_state=42)
xgb_model.fit(X, y)
y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
There are multiple ways to solve this.
One approach is to bin in the probability given by the model. So you will decide which probabilities you consider to be "High Risk", "Medium Risk" and "Low Risk" and the intervals on data can be classified. In this example I considered low to be 0 <= p <= 0.5, medium for 0.5 < p <= 0.8 and high for 0.8 < p <= 1.
First you have to calculate the probability for each prediction. I would suggest to maybe use the test set for that, to avoid bias from a possible model overfitting.
y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
df = pd.DataFrame(X, columns=cancer.feature_names)
# Stores the probability of a malignant cancer
df['probability'] = y_prob
Then you have to bin your data and calculate average probabilities for each of those bins. I would suggest to bin your data using np.histogram_bin_edges automatic calculation:
def calculate_mean_prob(feat):
"""Calculates mean probability for a feature value, binning it."""
# Bins from the automatic rules from numpy, check docs for details
bins = np.histogram_bin_edges(df[feat], bins='auto')
binned_values = pd.cut(df[feat], bins)
return df['probability'].groupby(binned_values).mean()
Now you can classify each bin following what you would consider to be a low/medium/high probability:
def classify_probability(prob, medium=0.5, high=0.8, fillna_method= 'ffill'):
"""Classify the output of each bin into a risk group,
according to the probability.
Following the follow rules:
0 <= p <= medium: Low risk
medium < p <= high: Medium risk
high < p <= 1: High Risk
If a bin has no entries, it will be filled using fillna with the method
specified in fillna_method
"""
risk = pd.cut(prob, [0., medium, high, 1.0], include_lowest=True,
labels=['Low Risk', 'Medium Risk', 'High Risk'])
risk.fillna(method=fillna_method, inplace=True)
return risk
This will return you the risk for each bin that you divided your data. Since you will probably have multiple bins that have consecutive values, you might want to merge the consecutive pd.Interval bins. The code for that is shown below:
def sum_interval(i1, i2):
if i2 is None:
return None
if i1.right == i2.left:
return pd.Interval(i1.left, i2.right)
return None
def sum_intervals(args):
"""Given a list of pd.Intervals,
returns a list summing consecutive intervals."""
result = list()
current_interval = args[0]
for next_interval in list(args[1:]) + [None]:
# Try to sum the current interval and nex interval
# The None in necessary for the last interval
sum_int = sum_interval(current_interval, next_interval)
if sum_int is not None:
# Update the current_interval in case if it is
# possible to sum
current_interval = sum_int
else:
# Otherwise tries to start a new interval
result.append(current_interval)
current_interval = next_interval
if len(result) == 1:
return result[0]
return result
def combine_bins(df):
# Group them by label
grouped = df.groupby(df).apply(lambda x: sorted(list(x.index)))
# Sum each category in intervals, if consecutive
merged_intervals = grouped.apply(sum_intervals)
return merged_intervals
Now you can combine all the functions to calculate the bins for each feature:
def generate_risk_class(feature, medium=0.5, high=0.8):
mean_prob = calculate_mean_prob(feature)
classification = classify_probability(mean_prob, medium=medium, high=high)
merged_bins = combine_bins(classification)
return merged_bins
For example, generate_risk_class('worst radius') results in:
Low Risk (7.93, 17.3]
Medium Risk (17.3, 18.639]
High Risk (18.639, 36.04]
But in case you get features which are not so good discriminators (or that do not separate the high/low risk linearly), you will have more complicated regions. For example generate_risk_class('mean symmetry') results in:
Low Risk [(0.114, 0.209], (0.241, 0.249], (0.272, 0.288]]
Medium Risk [(0.209, 0.225], (0.233, 0.241], (0.249, 0.264]]
High Risk [(0.225, 0.233], (0.264, 0.272], (0.288, 0.304]]

How to process the data returned from a function (Python 3.7)

Background:
My question should be relatively easy, however I am not able to figure it out.
I have written a function regarding queueing theory and it will be used for ambulance service planning. For example, how many calls for service can I expect in a given time frame.
The function takes two parameters; a starting value of the number of ambulances in my system starting at 0 and ending at 100 ambulances. This will show the probability of zero calls for service, one call for service, three calls for service….up to 100 calls for service. Second parameter is an arrival rate number which is the past historical arrival rate in my system.
The function runs and prints out the result to my screen. I have checked the math and it appears to be correct.
This is Python 3.7 with the Anaconda distribution.
My question is this:
I would like to process this data even further but I don’t know how to capture it and do more math. For example, I would like to take this list and accumulate the probability values. With an arrival rate of five, there is a cumulative probability of 61.56% of at least five calls for service, etc.
A second example of how I would like to process this data is to format it as percentages and write out a text file
A third example would be to process the cumulative probabilities and exclude any values higher than the 99% cumulative value (because these vanish into extremely small numbers).
A fourth example would be to create a bar chart showing the probability of n calls for service.
These are some of the things I want to do with the queueing theory calculations. And there are a lot more. I am planning on writing a larger application. But I am stuck at this point. The function writes an output into my Python 3.7 console. How do I “capture” that output as an object or something and perform other processing on the data?
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import math
import csv
def probability_x(start_value = 0, arrival_rate = 0):
probability_arrivals = []
while start_value <= 100:
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
print(probability_arrivals)
start_value = start_value + 1
return probability_arrivals
#probability_x(arrival_rate = 5, x = 5)
#The code written above prints to the console, but my goal is to take the returned values and make other calculations.
#How do I 'capture' this data for further processing is where I need help (for example, bar plots, cumulative frequency, etc )
#failure. TypeError: writerows() argument must be iterable.
with open('ExpectedProbability.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
for value in probability_x(arrival_rate = 5):
writer.writerows(value)
writeFile.close()
#Failure. Why does it return 2. Yes there are two columns but I was expecting 101 as the length because that is the end of my loop.
print(len(probability_x(arrival_rate = 5)))
The problem is, when you write
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
You're overwriting the previous contents of probability_arrivals. Everything that it held previously is lost.
Instead of using = to reassign probability_arrivals, you want to append another entry to the list:
probability_arrivals.append([start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)])
I'll also note, your while loop can be improved. You're basically just looping over start_value until it reaches a certain value. A for loop would be more appropriate here:
for s in range(start_value, 101): # The end value is exclusive, so it's 101 not 100
probability_arrivals = [s, math.pow(arrival_rate, s) * math.pow(math.e, -arrival_rate) / math.factorial(s)]
print(probability_arrivals)
Now you don't need to manually worry about incrementing the counter.

Why is my notebook crashing when I run this for loop and what is the fix?

I have taken code in relation to the Kalman Filter and am attempting to iterate through each column of data. What I would like to have happen is:
The column data is fed into the filter
The filtered column data (xhat) is placed into another DataFrame (filtered)
The filtered column data (xhat) is used to produce a visual.
I have created a for loop to iterate through the column data, but when I run the cell, I crash the notebook. When it doesn't crash, I get this warning:
C:\Users\perso\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:45: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Thanks in advance for any help. I hope this question is detailed enough. I bombed on the last one.
'''A Python implementation of the example given in pages 11-15 of "An
Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
University of North Carolina at Chapel Hill, Department of Computer
Science, TR 95-041,
https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf'''
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
# dataframe created to hold filtered data
filtered = pd.DataFrame()
# intial parameters
for column in data:
n_iter = len(data.index) #number of iterations equal to sample numbers
sz = (n_iter,) # size of array
z = data[column] # observations
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 1.0**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = z[0]
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
# add new data to created dataframe
filtered.assign(a = [xhat])
#create visualization of noise reduction
plt.rcParams['figure.figsize'] = (10, 8)
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('column data')
plt.ylabel('Measurement')
This seems like a pretty straightforward error. The warning indicates that you have attempted to plot more figures than the current limit before a warning is created (a parameter you can change but which by default is set to 20). This is because in each iteration of your for loop, you create a new figure. Depending on the size of n_iter, you are opening potentially hundreds or thousands of figures. Each of these figures takes resources to generate and show, so you are creating a very large resource load on your system. Either it is processing very slowly due or is crashing altogether. In any case, the solution is to plot fewer figures.
I don't know exactly what you're plotting in your loop but it seems like each iteration of your loop corresponds to one time step and at each time step you'd like to plot the estimated and actual values. In this case, you need to define a figure and figure options once, outside of the loop, rather than at each iteration. But a better way to do this is probably to generate all of the data you want to plot ahead of time and store it in an easy-to-plot datatype like lists, then plot it once at the end.

Resources