Related
I have been working on a Churn Prediction use case in Python using XGBoost. The data trained on various parameters like Age, Tenure, Last 6 months income etc gives us the prediction if an employee is likely to leave based on its employee ID.
Additionally, if the user wants to the see why this ML system categorised the employee as such, the user can see the features that contributed to this, which are extracted form the model via eli5 library.
So to make this more explainable to the users, we had created some ranges for each feature:
Tenure (in days)
[0-100] = High Risk
[101-300] = Medium Risk
[301-800] = Low Risk
To define these ranges we've analysed the distributions of each feature and manually defined the ranges for our use in the system. We saw the impact of each feature on the target variable IsTerminated in training data. Following is an example of Tenure distribution.
Here the green bar represents the employees who are terminated or left and pink represents those who didn't.
So the question is that, as time passes and new data would be added to the model the such features' risk ranges would change. In this case of Tenure, if an employee has tenure of 780 days, after a month his tenure feature would show 810. Obviously, we keep the upper end on "Low Risk" as open ended. But real problem is, how can we define the internal boundaries / ranges programtically ?
EDIT: Thanks for the clarification. I have changed the answer.
It is important to realize that you are trying to project a selection in multi-dimensional space into a 1D space. Not in every case you will be able to see a clear separation like the one you got. There are also various possibilities to do that, here I made a simple example that could help your client to interpret the model, but does not represent the full complexity of the model, of course.
You did not provide any sample data, so I will generate some from the breast cancer dataset.
First let's import what we need:
from sklearn import datasets
from xgboost import XGBClassifier
import pandas as pd
import numpy as np
And now import the dataset and train a very simple XGBoost Model
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
xgb_model = XGBClassifier(n_estimators=5,
objective="binary:logistic",
random_state=42)
xgb_model.fit(X, y)
y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
There are multiple ways to solve this.
One approach is to bin in the probability given by the model. So you will decide which probabilities you consider to be "High Risk", "Medium Risk" and "Low Risk" and the intervals on data can be classified. In this example I considered low to be 0 <= p <= 0.5, medium for 0.5 < p <= 0.8 and high for 0.8 < p <= 1.
First you have to calculate the probability for each prediction. I would suggest to maybe use the test set for that, to avoid bias from a possible model overfitting.
y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
df = pd.DataFrame(X, columns=cancer.feature_names)
# Stores the probability of a malignant cancer
df['probability'] = y_prob
Then you have to bin your data and calculate average probabilities for each of those bins. I would suggest to bin your data using np.histogram_bin_edges automatic calculation:
def calculate_mean_prob(feat):
"""Calculates mean probability for a feature value, binning it."""
# Bins from the automatic rules from numpy, check docs for details
bins = np.histogram_bin_edges(df[feat], bins='auto')
binned_values = pd.cut(df[feat], bins)
return df['probability'].groupby(binned_values).mean()
Now you can classify each bin following what you would consider to be a low/medium/high probability:
def classify_probability(prob, medium=0.5, high=0.8, fillna_method= 'ffill'):
"""Classify the output of each bin into a risk group,
according to the probability.
Following the follow rules:
0 <= p <= medium: Low risk
medium < p <= high: Medium risk
high < p <= 1: High Risk
If a bin has no entries, it will be filled using fillna with the method
specified in fillna_method
"""
risk = pd.cut(prob, [0., medium, high, 1.0], include_lowest=True,
labels=['Low Risk', 'Medium Risk', 'High Risk'])
risk.fillna(method=fillna_method, inplace=True)
return risk
This will return you the risk for each bin that you divided your data. Since you will probably have multiple bins that have consecutive values, you might want to merge the consecutive pd.Interval bins. The code for that is shown below:
def sum_interval(i1, i2):
if i2 is None:
return None
if i1.right == i2.left:
return pd.Interval(i1.left, i2.right)
return None
def sum_intervals(args):
"""Given a list of pd.Intervals,
returns a list summing consecutive intervals."""
result = list()
current_interval = args[0]
for next_interval in list(args[1:]) + [None]:
# Try to sum the current interval and nex interval
# The None in necessary for the last interval
sum_int = sum_interval(current_interval, next_interval)
if sum_int is not None:
# Update the current_interval in case if it is
# possible to sum
current_interval = sum_int
else:
# Otherwise tries to start a new interval
result.append(current_interval)
current_interval = next_interval
if len(result) == 1:
return result[0]
return result
def combine_bins(df):
# Group them by label
grouped = df.groupby(df).apply(lambda x: sorted(list(x.index)))
# Sum each category in intervals, if consecutive
merged_intervals = grouped.apply(sum_intervals)
return merged_intervals
Now you can combine all the functions to calculate the bins for each feature:
def generate_risk_class(feature, medium=0.5, high=0.8):
mean_prob = calculate_mean_prob(feature)
classification = classify_probability(mean_prob, medium=medium, high=high)
merged_bins = combine_bins(classification)
return merged_bins
For example, generate_risk_class('worst radius') results in:
Low Risk (7.93, 17.3]
Medium Risk (17.3, 18.639]
High Risk (18.639, 36.04]
But in case you get features which are not so good discriminators (or that do not separate the high/low risk linearly), you will have more complicated regions. For example generate_risk_class('mean symmetry') results in:
Low Risk [(0.114, 0.209], (0.241, 0.249], (0.272, 0.288]]
Medium Risk [(0.209, 0.225], (0.233, 0.241], (0.249, 0.264]]
High Risk [(0.225, 0.233], (0.264, 0.272], (0.288, 0.304]]
I'm working with Keras, and trying to create a Learning Rate Scheduler that schedules on the basis of number of batches processed, instead of number of epochs. To do this, I've inserted the scheduling code into the get_updates method of my `Optimizer'. For the most part, I've tried to use regular Python variables for values that remain constant during a given training run and computational graph nodes only for parameters that actually vary.
My 2 Questions are:
Does the code below look like it should behave properly as a Learning Rate Scheduler, if placed within the get_updates method of a Keras Optimizer.
How could one embed this code in a Class similar to LearningRateScheduler, but which scheduled based upon number of batches, rather than number of epochs?
#Copying graph node that stores original value of learning rate
lr = self.lr
# Checking whether learning rate schedule is to be used
if self.initial_lr_decay > 0:
# this decay mimics exponential decay from
# tensorflow/python/keras/optimizer_v2/exponential_decay
# Get value of current number of processed batches from graph node
# and convert to numeric value for use in K.pow()
curr_batch = float(K.get_value(self.iterations))
# Create graph node containing lr decay factor
# Note: self.lr_decay_steps is a number, not a node
# self.lr_decay is a node, not a number
decay_factor = K.pow(self.lr_decay, (curr_batch / self.lr_decay_steps))
# Reassign lr to graph node formed by
# product of graph node containing decay factor
# and graph node containing original learning rate.
lr = lr * decay_factor
# Get product of two numbers to calculate number of batches processed
# in warmup period
num_warmup_batches = self.steps_per_epoch_num * self.warmup_epochs
# Make comparisons between numbers to determine if we're in warmup period
if (self.warmup_epochs > 0) and (curr_batch < num_warmup_batches):
# Create node with value of learning rate by multiplying a number
# by a node, and then dividing by a number
lr = (self.initial_lr *
K.cast(self.iterations, K.floatx()) / curr_batch)
Easier than messing with Keras source code (it's possible, but it's complex and sensible), you could use a callback.
from keras.callbacks import LambdaCallback
total_batches = 0
def what_to_do_when_batch_ends(batch, logs):
total_batches += 1 #or use the "batch" variable,
#which is the batch index of the last finished batch
#change learning rate at will
if your_condition == True:
keras.backend.set_value(model.optimizer.lr, newLrValueAsPythonFloat)
When training, use the callback:
lrUpdater = LambdaCallback(on_batch_end = what_to_do_when_batch_ends)
model.fit(........, callbacks = [lrUpdater, ...other callbacks...])
Is there a way to get the specific range of array from the results of numpy.random.normal()? without computing all the random numbers, it only computes the said range limits
Normal application
random_numbers = numpy.random.normal(0, 1, 1000)
What i want is get the range of this random_numbers without computing it all first
first_100_random_numbers = needs the results of the first 100 values
300th_400th_random_numbers = needs the results of the 300 - 400 values
If you generate the random numbers one at a time, you can just keep track of whether they increase the max or min values. You will still have to compute the values, but you won't run into a memory issue since you only have to store three numbers (max, min, and latest_random)
import numpy as np
max_=0
min_=0
for i in range(1000):
new_number=np.random.normal(0,1,1)
if new_number>max_:
max_=new_number
if new_number<min_:
min_=new_number
range_=max_-min_
print(range_)
To speed up the computation you can do larger blocks at a time. If you want to do a run with a billion numbers, you can calculate a million at a time and run the loop a thousand times. Modified code and time results below
import numpy as np
import time
max_=0
min_=0
start=time.time()
for i in range(1000):
new_array=np.random.normal(0,1,1000000)
new_max=np.max(new_array)
new_min=np.min(new_array)
if new_max>max_:
max_=new_max
if new_min<min_:
min_=new_min
range_=max_-min_
print('Range ', range_)
end = time.time()
Time=end - start
print('Time ',Time)
Range 12.421138327443614
Time 36.7797749042511
Comparing the results of running one random number at a time vs. ten at a time to see if results are significantly different
(each one run three times)
One at a time:
new_numbers=[]
for i in range(10):
new_numbers.append(np.random.normal(0,1,1)[0])
print(new_numbers)
[-1.0145267697638918, -1.1291506481372602, 1.3622608858856742, 0.16024562390261188, 1.062550043104352, -0.4160329548439351, -0.05464203711515494, -0.7416629430695286, 0.35066071936940363, 0.06498345663995017]
[-1.5632632129838873, -1.0314300796946991, 0.5014408178125339, -0.37806631815396563, 0.45396918178048334, -0.6630479858064194, -0.47097483551189306, 0.40734077106402056, 1.1167819302886144, -0.6594075991871857]
[0.4448783416507262, 0.20160041940565818, -0.4781753245124433, -0.7130750653981222, -0.8035305391034386, -0.41543648761183466, 0.25166027175788847, -0.7051417978559822, 0.6017351178904993, -1.3719596304190458]
Ten at a time:
np.random.normal(0,1,10)
array([-1.79498658, 0.89073416, -0.25302627, -0.17237986, -0.38988131,
-0.93635678, 0.28824899, 0.52675642, 0.86195635, -0.89584341])
array([ 1.41602405, 1.33800937, 1.87837334, 0.2082182 , -0.25116545,
1.37953259, 0.34445565, -0.33647043, -0.24414261, -0.14505838])
array([ 0.43848371, -0.60967936, 1.2902231 , 0.44589728, -2.39725248,
-1.42715386, -1.0627627 , 1.15998483, 0.96427742, -2.01062938])
maybe just draw them from a np.random.RandomState:
import numpy as np
# random state
RS = np.random.RandomState(seed = 0)
# first 10 elments
print(RS.normal(0, 1, 10))
# another 20
print(RS.normal(0, 1, 20))
Its allays going to be the same random numbers to the according seed.
first_100_random_numbers = RS.normal(0, 1, 100)
100th_200th_random_numbers = RS.normal(0, 1, 100)
200th_400th_random_numbers = RS.normal(0, 1, 200)
Otherwise you could think about using a generator.
I have hit a mental road block here, or perhaps too much coding and cant think straight.
I have a simple program to model the output of a factor including existing factories with a known production profile (decays each year) along with new factories adding for the next few years.
The goal is to get a total production series which incorporates both. Hopefully the comments in the code better indicate what my issues are.
import pandas as pd
#constants
inv_yr = 5 # years of making new investments
asset_life = 30 # years of production for new factor
# creates blank time series
new_factory_adds = pd.Series(0, index=range(1,asset_life + inv_yr +1), name = 'new factories')
production = pd.Series(0, index=range(1,asset_life+1))
# Fill series of new factory adds
for i in range(1, inv_yr+1):
new_factory_adds[i] = 2 # means for first 5 years 2 are added each yr.
# Fill series of a new individual production line
for j in range(1,asset_life+1):
if j ==1:
production[j] = 100
else:
production[j] = production[j-1]*0.95
# to calculate total production for each year...
# create blank time series
tot_prod = pd.Series(0, index=range(1,asset_life + inv_yr+1),name='Tot Prod')
# data frame to combined
df = pd.concat([new_factory_adds,tot_prod],axis=1)
print(df)
# fill Tot Production series - this is where i am having difficuilties
for k in range(1, asset_life + inv_yr+1):
if k ==1:
tot_prod=new_factory_adds[k]*production[k]
elif:
k<=inv_yr:
tot_prod=new_factory_adds[k]*production[k] + production[k-1]
I have a test list of 10,000 IDs and this is what I have to do:
For every test ID, calculate rank by comparing with other IDs i.e. people from same company
Check if the rank for this test ID is above 'normal' by a) calculating ranks (same as step 1) of 1000 randomly selected IDs b) comparing these 1000 ranks with the rank of test ID
Do this (step 1 and 2) for 10,000 test IDs with data from 10 different months.
To store the master data of 14000 IDs and observation for 10 months I am using sqlite as it makes querying and ranking easier and faster.
To reduce the run time, I am using 'multiprocessing' and parallelize calculations on number of months i.e. ranks calculated for different months on different cores. This works well for less number of test IDs (<=2000) or less random ranks (>=200) but if I calculate ranks for all 10 months in parallel and using 1000 as number of random ranks for each ID than the script freezes after a few hours. No error is provided. I believe SQLite is the culprit and need your help to figure out the issue.
Here is my code:
nproc = 10 ## Number of cores
randNum = 1000 ## Number of random ranks for each ID
def main():
'''
This will go through every specified column one by one, and for each entry
a rank of entry will be computed which is comapred with ranks of randomly selected 1000 entries from same column
'''
## Read master file with 14000 rows X 20 cols, each row pertains to an ID/ID,
## first 9 columns have info related to ID and last 10 have observed values from 10 diff. months
resList = List with 14000 entries Eg. [(123,"ABC",.....),(234,"DEF",........)....14000n]
## Read test file, for which ranks to be calculated. Contains 10,000 IDs/IDs and their names
global testIDList ## for p-value calculation
testIDList = List with 1000 entries Eg. [(123,"ABC"),(234,"DEF")..10,000n]
## Create identifier SET - Used for random selection of IDs
global idSET ## USed in rankCalcTest
idSET = SET OF ALL IDs FROM MASTER FILE
global trackTableName,coordsDB,chrLimit ## Globals for all other functions
## Specify column numbers in master file that have values for each ID from different months
trackList = [10,11,12,13,14,15,16,17,18,19,20] ## Columns in file with 14000 rows each.
### Parallel
allTrackPvals = PPResults(rankCalcTest,trackList)
DO SOME PROCESSING
SCRIPT ENDS
def rankCalcTest(col):
'''
Calculates ranks for test IDs using column/month specified by 'main()' function
'''
DB = '%s_%s.db' % (coordsDB.split('.')[0],col) ## New DB for every column/month - Because current function is paralleized so every core works on a column/month
conn = sqlite3.connect(DB)
trackPvals = [] ## Temporary list that will hold ranks for single column/month
tableCols = [col] ## Column with observed values from an month, that will be used to generate column-specific ranks
## Make sqlite3 table for current track
trackTableName = 'track_%s' % (col) ## Here a table is created containing all IDs and observations from specifc column
trackTableName = tableMaker(trackTableName,annoDict,resList,tableCols,conn) ## This modules not included in example, as it works well -uses SQLite
chrLimit = chrLimits(trackTableName,conn) ## This module not included in examples as it works well - uses SQLite
for ent in testIDList: ## Total 10,000 entries
## Generate Relative Rank for ID/ of interest
mainID = ent[0] ## ID number
mainRank = rankGenerator(mainID,trackTableName,chrLimit,conn) ## See below for function
randomIDs = randomSelect(idSET,randNum)
randomRanks = []
for randID in randomIDs:
randomRank = rankGenerator(randID,trackTableName,chrLimit,conn)
randomRanks.append(randomRank)
### Some calculation
probRR = DO SOME CALCULATION
trackPvals.append(round(probRR,5))
conn.close()
return trackPvals
def rankGenerator(ID,trackTableName,chrLimit,conn):
'''
Generate a rank for each ID provided by 'rankCalcTest' function
'''
print ('\nRank is being calculated for ID:%s' % (ID))
IDCoord = aDict[ID] ## Get required info to construct the read query
company = IDCoord[0]
intervalIDs = [] ## List to hold all the IDs in an interval
rank = 0 ##Initialize
cur = conn.cursor()
print ('ID class 0')
cur.execute("SELECT ID,hours FROM %s WHERE chr = '%s' AND start between %s and %s ORDER BY hours desc" % (trackTableName,comapny))
intIDs = cur.fetchall()
intervalIDs.extend(intIDs) ## There is one ore query in certain cases, removed for brewity of code
Rank = SOME CALCULATION
print('Relative Rank for %s: %s'% (ID,str(weigRelativeRank)))
return Rank
def PPResults(module,alist):
npool = Pool(int(nproc))
res = npool.map_async(module, alist)
results = (res.get())
return results
The script freezes in 'rankGenerator' function:
Rank is being calculated for ID:1423187_at
Rank is being calculated for ID:1452528_a_at
Coordinates found for:1423187_at - 8,111940709,111952915
Coordinates found for:1452528_a_at - 19,43612500,43614912
ID class 0
As, the run was performed in parallel its hard to say at which line script is freezing but seems like the query in 'rankGenerator' is the freezing point. Is it related to locks in SQLite?
Sorry for large code. It is actually a very trimmed version that took me 3 hrs to prepare. I hope to get some help.
AK