Generate High, Medium, Low categories from a skewed distribution - python-3.x

I have been working on a Churn Prediction use case in Python using XGBoost. The data trained on various parameters like Age, Tenure, Last 6 months income etc gives us the prediction if an employee is likely to leave based on its employee ID.
Additionally, if the user wants to the see why this ML system categorised the employee as such, the user can see the features that contributed to this, which are extracted form the model via eli5 library.
So to make this more explainable to the users, we had created some ranges for each feature:
Tenure (in days)
[0-100] = High Risk
[101-300] = Medium Risk
[301-800] = Low Risk
To define these ranges we've analysed the distributions of each feature and manually defined the ranges for our use in the system. We saw the impact of each feature on the target variable IsTerminated in training data. Following is an example of Tenure distribution.
Here the green bar represents the employees who are terminated or left and pink represents those who didn't.
So the question is that, as time passes and new data would be added to the model the such features' risk ranges would change. In this case of Tenure, if an employee has tenure of 780 days, after a month his tenure feature would show 810. Obviously, we keep the upper end on "Low Risk" as open ended. But real problem is, how can we define the internal boundaries / ranges programtically ?

EDIT: Thanks for the clarification. I have changed the answer.
It is important to realize that you are trying to project a selection in multi-dimensional space into a 1D space. Not in every case you will be able to see a clear separation like the one you got. There are also various possibilities to do that, here I made a simple example that could help your client to interpret the model, but does not represent the full complexity of the model, of course.
You did not provide any sample data, so I will generate some from the breast cancer dataset.
First let's import what we need:
from sklearn import datasets
from xgboost import XGBClassifier
import pandas as pd
import numpy as np
And now import the dataset and train a very simple XGBoost Model
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
xgb_model = XGBClassifier(n_estimators=5,
objective="binary:logistic",
random_state=42)
xgb_model.fit(X, y)
y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
There are multiple ways to solve this.
One approach is to bin in the probability given by the model. So you will decide which probabilities you consider to be "High Risk", "Medium Risk" and "Low Risk" and the intervals on data can be classified. In this example I considered low to be 0 <= p <= 0.5, medium for 0.5 < p <= 0.8 and high for 0.8 < p <= 1.
First you have to calculate the probability for each prediction. I would suggest to maybe use the test set for that, to avoid bias from a possible model overfitting.
y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
df = pd.DataFrame(X, columns=cancer.feature_names)
# Stores the probability of a malignant cancer
df['probability'] = y_prob
Then you have to bin your data and calculate average probabilities for each of those bins. I would suggest to bin your data using np.histogram_bin_edges automatic calculation:
def calculate_mean_prob(feat):
"""Calculates mean probability for a feature value, binning it."""
# Bins from the automatic rules from numpy, check docs for details
bins = np.histogram_bin_edges(df[feat], bins='auto')
binned_values = pd.cut(df[feat], bins)
return df['probability'].groupby(binned_values).mean()
Now you can classify each bin following what you would consider to be a low/medium/high probability:
def classify_probability(prob, medium=0.5, high=0.8, fillna_method= 'ffill'):
"""Classify the output of each bin into a risk group,
according to the probability.
Following the follow rules:
0 <= p <= medium: Low risk
medium < p <= high: Medium risk
high < p <= 1: High Risk
If a bin has no entries, it will be filled using fillna with the method
specified in fillna_method
"""
risk = pd.cut(prob, [0., medium, high, 1.0], include_lowest=True,
labels=['Low Risk', 'Medium Risk', 'High Risk'])
risk.fillna(method=fillna_method, inplace=True)
return risk
This will return you the risk for each bin that you divided your data. Since you will probably have multiple bins that have consecutive values, you might want to merge the consecutive pd.Interval bins. The code for that is shown below:
def sum_interval(i1, i2):
if i2 is None:
return None
if i1.right == i2.left:
return pd.Interval(i1.left, i2.right)
return None
def sum_intervals(args):
"""Given a list of pd.Intervals,
returns a list summing consecutive intervals."""
result = list()
current_interval = args[0]
for next_interval in list(args[1:]) + [None]:
# Try to sum the current interval and nex interval
# The None in necessary for the last interval
sum_int = sum_interval(current_interval, next_interval)
if sum_int is not None:
# Update the current_interval in case if it is
# possible to sum
current_interval = sum_int
else:
# Otherwise tries to start a new interval
result.append(current_interval)
current_interval = next_interval
if len(result) == 1:
return result[0]
return result
def combine_bins(df):
# Group them by label
grouped = df.groupby(df).apply(lambda x: sorted(list(x.index)))
# Sum each category in intervals, if consecutive
merged_intervals = grouped.apply(sum_intervals)
return merged_intervals
Now you can combine all the functions to calculate the bins for each feature:
def generate_risk_class(feature, medium=0.5, high=0.8):
mean_prob = calculate_mean_prob(feature)
classification = classify_probability(mean_prob, medium=medium, high=high)
merged_bins = combine_bins(classification)
return merged_bins
For example, generate_risk_class('worst radius') results in:
Low Risk (7.93, 17.3]
Medium Risk (17.3, 18.639]
High Risk (18.639, 36.04]
But in case you get features which are not so good discriminators (or that do not separate the high/low risk linearly), you will have more complicated regions. For example generate_risk_class('mean symmetry') results in:
Low Risk [(0.114, 0.209], (0.241, 0.249], (0.272, 0.288]]
Medium Risk [(0.209, 0.225], (0.233, 0.241], (0.249, 0.264]]
High Risk [(0.225, 0.233], (0.264, 0.272], (0.288, 0.304]]

Related

Contribution analysis versus lca.score

I am interested in which processes/activities contribute most to the Life Cycle Impact Assessment (LCIA) that I am conducting. For this, I run a contribution analysis (see code below). To crosscheck the results of my contribution analysis and to ensure that I get everything right, I wanted to compare the returned contributions with the impact assessment result (lca.score).
The documentation of ca.annotated_top_processes(lca) says: "Returns a list of tuples: (lca score, supply, activity)."
In my understanding, lca.score should be the same value as the sum of all the first values in the tuples that are returned by ca.annotated_top_processes(lca) (the printed values). However, this is not the case. What am I missing? Is there some sort of cut-off applied or did I misunderstand something?
import bw2analyzer as bwa
random_act = db_ei381.random()
lca = bw2data.LCA(
{random_act: 1},
('ReCiPe Midpoint (H) V1.13', 'water depletion', 'WDP')
)
lca.lci()
lca.lcia()
print(lca.score)
# %% Contribution analysis
ca = bwa.ContributionAnalysis()
contributions = ca.annotated_top_processes(lca)
print(sum([i[0] for i in contributions]))
It is not well documented, but you can introduce an argument limit that specifies the number of activities that are considered in the contribution analysis. The default value I think is 25. It is sorted so the most important activities come first. If you write something like this you should see how the result converges to the total score as the number of activities increase:
import matplotlib.pyplot as plt
cutoff = [25,50,100,500,1000,1200]
scores = []
for n in cutoff:
contributions = ca.annotated_top_processes(lca,limit=n)
contr_sum = sum([i[0] for i in contributions])
scores.append(contr_sum)
plt.plot(cutoff,scores)
plt.axhline(lca.score,ls='--',color='r');

Why is my notebook crashing when I run this for loop and what is the fix?

I have taken code in relation to the Kalman Filter and am attempting to iterate through each column of data. What I would like to have happen is:
The column data is fed into the filter
The filtered column data (xhat) is placed into another DataFrame (filtered)
The filtered column data (xhat) is used to produce a visual.
I have created a for loop to iterate through the column data, but when I run the cell, I crash the notebook. When it doesn't crash, I get this warning:
C:\Users\perso\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:45: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Thanks in advance for any help. I hope this question is detailed enough. I bombed on the last one.
'''A Python implementation of the example given in pages 11-15 of "An
Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
University of North Carolina at Chapel Hill, Department of Computer
Science, TR 95-041,
https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf'''
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
# dataframe created to hold filtered data
filtered = pd.DataFrame()
# intial parameters
for column in data:
n_iter = len(data.index) #number of iterations equal to sample numbers
sz = (n_iter,) # size of array
z = data[column] # observations
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 1.0**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = z[0]
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
# add new data to created dataframe
filtered.assign(a = [xhat])
#create visualization of noise reduction
plt.rcParams['figure.figsize'] = (10, 8)
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('column data')
plt.ylabel('Measurement')
This seems like a pretty straightforward error. The warning indicates that you have attempted to plot more figures than the current limit before a warning is created (a parameter you can change but which by default is set to 20). This is because in each iteration of your for loop, you create a new figure. Depending on the size of n_iter, you are opening potentially hundreds or thousands of figures. Each of these figures takes resources to generate and show, so you are creating a very large resource load on your system. Either it is processing very slowly due or is crashing altogether. In any case, the solution is to plot fewer figures.
I don't know exactly what you're plotting in your loop but it seems like each iteration of your loop corresponds to one time step and at each time step you'd like to plot the estimated and actual values. In this case, you need to define a figure and figure options once, outside of the loop, rather than at each iteration. But a better way to do this is probably to generate all of the data you want to plot ahead of time and store it in an easy-to-plot datatype like lists, then plot it once at the end.

efficient way of calculating Monte Carlo results for different impact assessment methods in Brightway

I am trying to do a comparative monte carlo calculation with brightway2 using different impact assessment methods. I thought about using the switch_method method to be more efficient, since the technosphere matrix is the same for a given iteration. However, I am getting an assertion error. A code to reproduce it could be something like this
import brighway as bw
bw.projects.set_current('ei35') # project with ecoinvent 3.5
db = bw.Database("ei_35cutoff")
# select two different transport activities to compare
activity_name = 'transport, freight, lorry >32 metric ton, EURO4'
for activity in bw.Database("ei_35cutoff"):
if activity['name'] == activity_name:
truckE4 = bw.Database("ei_35cutoff").get(activity['code'])
print(truckE4['name'])
break
activity_name = 'transport, freight, lorry >32 metric ton, EURO6'
for activity in bw.Database("ei_35cutoff"):
if activity['name'] == activity_name:
truckE6 = bw.Database("ei_35cutoff").get(activity['code'])
print(truckE6['name'])
break
demands = [{truckE4: 1}, {truckE6: 1}]
# impact assessment method:
recipe_midpoint=[method for method in bw.methods.keys()
if method[0]=="ReCiPe Midpoint (H)"]
mc_mm = bw.MonteCarloLCA(demands[0], recipe_midpoint[0])
next(mc_mm)
If I try switch method I get the assertion error.
mc_mm.switch_method(recipe_midpoint[1])
assert mc_mm.method==recipe_midpoint[1]
mc_mm.redo_lcia()
next(mc_mm)
Am I doing something wrong here?
I usually store characterization factor matrices in a temporary dict and multiply these cfs with the LCI resulting from MonteCarloLCA directly.
import brightway2 as bw
import numpy as np
# Generate objects for analysis
bw.projects.set_current("my_mcs")
my_db = bw.Database('db')
my_act = my_db.random()
my_demand = {my_act:1}
my_methods = [bw.methods.random() for _ in range(2)]
I wrote this simple function to get characterization factor matrices for the product system I will generate in the MonteCarloLCA. It uses a temporara "sacrificial LCA" object that will have the same A and B matrices as the MonteCarloLCA.
This may seem like a waste of time, but it is only done once, and will make MonteCarlo quicker and simpler.
def get_C_matrices(demand, list_of_methods):
""" Return a dict with {method tuple:cf_matrix} for a list of methods
Uses a "sacrificial LCA" with exactly the same demand as will be used
in the MonteCarloLCA
"""
C_matrices = {}
sacrificial_LCA = bw.LCA(demand)
sacrificial_LCA.lci()
for method in list_of_methods:
sacrificial_LCA.switch_method(method)
C_matrices[method] = sacrificial_LCA.characterization_matrix
return C_matrices
Then:
# Create array that will store mc results.
# Shape is (number of methods, number of iteration)
my_iterations = 10
mc_scores = np.empty(shape=[len(my_methods), my_iterations])
# Instantiate MonteCarloLCA object
my_mc = bw.MonteCarloLCA(my_demand)
# Get characterization factor matrices
my_C_matrices = get_C_matrices(my_demand, my_methods)
# Generate results
for iteration in range(my_iterations):
lci = next(my_mc)
for i, m in enumerate(my_methods):
mc_scores[i, iteration] = (my_C_matrices[m]*my_mc.inventory).sum()
All your results are in mc_scores. Each row corresponds to a method, each column to an MC iteration.
Not very elegant, but try this:
iterations = 10
simulations = []
for _ in range(iterations):
mc_mm = MonteCarloLCA(demands[0], recipe_midpoint[0])
next(mc_mm)
mcresults = []
for i in demands:
print(i)
for m in recipe_midpoint[0:3]:
mc_mm.switch_method(m)
print(mc_mm.method)
mc_mm.redo_lcia(i)
print(mc_mm.score)
mcresults.append(mc_mm.score)
simulations.append(mcresults)
CC_truckE4 = [i[1] for i in simulations] # Climate Change, truck E4
CC_truckE6 = [i[1+3] for i in simulations] # Climate Change, truck E6
from matplotlib import pyplot as plt
plt.plot(CC_truckE4 , CC_truckE6, 'o')
If you then make a test and do twice the simulation for the same demand vector, by setting demands = [{truckE4: 1}, {truckE4: 1}] and plot the result you should get a straight line. This means that you are doing dependent sampling and re-using the same tech matrix for each demand vector and for each LCIA. I am not 100% sure of this but I hope it answers your question.

How to feed multiple time series sets into LSTM model for prediction?

I am trying to take this dataset and predict news popularity levels over time.
The dataset is made up of 145 columns (1 being the ID linked to the actual news story in a separate file, 2 - 145 for 144 20-minute time slices where each cell in a row records the popularity level of the corresponding news story).
I have already normalized the dataset "Facebook_Economy.csv" to range from 0 to 1. At the moment I can only feed a single time series set into my model (train ~100 time slices and test ~44 time slices). My aim is to take several rows of 144 time slices to train on and test on several other rows, e.g., take the time series data for news stories 1-20 and train on news stories 21-30, etc.
This is how I am currently feeding data into my model:
def run(filename):
series = read_csv(filename, header=0, index_col=0)
repeats = 1
results = DataFrame()
timesteps = 1
for i in range(len(series)):
results['results'] = experiment(repeats, series.iloc[i].squeeze(), timesteps)
# Where experiment(repeats, series, timesteps)
print(results.describe())
As well (for some insight into how the rest of my code looks) I have been following this tutorial from Jason Brownlee for some guidance.
I'm not sure I understood well the question but I think I already did something similar. First, you have to stack all your features in an array. I found this link very helpful : https://machinelearningmastery.com/how-to-develop-lstm-models-for-multi-step-time-series-forecasting-of-household-power-consumption/
Here is the code I use : (fen_pred = input size, n_output = output size)
dataset_train_features = np.hstack((dataset_train,new_features_train,ssa_feature1_train,ssa_feature2_train))
dataset_train_labels = dataset_train
features_set = list()
labels = list()
# X_train
for i in range(fen_pred, len_train):
if(len(dataset_train_features[i:i+n_output])<n_output):
break
features_set.append(dataset_train_features[i-fen_pred:i])
for i in range(fen_pred, len_train):
if(len(dataset_train_labels[i:i+n_output])<n_output):
break
labels.append(dataset_train_labels[i:i+n_output])
X_train=np.array(features_set)
y_train=np.array(labels)
(please note that i want to predict timesteps on one timeserie, that's why I don't predict multiple features)
In your case I would edit the array dataset_train_features, add every feature you need to train your model in this array and then reproduce this technique to create your test set.

How to deal with temporal correlation/trend in mppm

Good day,
I have been working through Baddeley et al. 2015 to fit a point process model to several point patterns using mppm {spatstat}.
My point patterns are annual count data of large herbivores (i.e. point localities (x, y) of male/female animals * 5 years) in a protected area (owin). I have a number of spatial covariates e.g. distance to rivers (rivD) and vegetation productivity (NDVI).
Originally I fitted a model where herbivore response was a function of rivD + NDVI and allowed the coefficients to vary by sex (see mppm1 in reproducible example below). However, my annual point patterns are not independent between years in that there is a temporally increasing trend (i.e. there are exponentially more animals in year 1 compared to year 5).
So I added year as a random effect, thinking that if I allowed the intercept to change per year I could account for this (see mppm2).
Now I'm wondering if this is the right way to go about it? If I was fitting a GAMM gamm {mgcv} I would add a temporal correlation structure e.g. correlation = corAR1(form=~year) but don't think this is possible in mppm (see mppm3)?
I would really appreciate any ideas on how to deal with this temporal correlation structure in a replicated point pattern with mppm {spatstat}.
Thank you very much
Sandra
# R version 3.3.1 (64-bit)
library(spatstat) # spatstat version 1.45-2.008
#### Simulate point patterns
# multitype Neyman-Scott process (each cluster is a multitype process)
nclust2 = function(x0, y0, radius, n, types=factor(c("male", "female"))) {
X = runifdisc(n, radius, centre=c(x0, y0))
M = sample(types, n, replace=TRUE)
marks(X) = M
return(X)
}
year1 = rNeymanScott(5,0.1,nclust2, radius=0.1, n=5)
# plot(year1)
#-------------------
year2 = rNeymanScott(10,0.1,nclust2, radius=0.1, n=5)
# plot(year2)
#-------------------
year2 = rNeymanScott(15,0.1,nclust2, radius=0.1, n=10)
# plot(year2)
#-------------------
year3 = rNeymanScott(20,0.1,nclust2, radius=0.1, n=10)
# plot(year3)
#-------------------
year4 = rNeymanScott(25,0.1,nclust2, radius=0.1, n=15)
# plot(year4)
#-------------------
year5 = rNeymanScott(30,0.1,nclust2, radius=0.1, n=15)
# plot(year5)
#### Simulate distance to rivers
line <- psp(runif(10), runif(10), runif(10), runif(10), window=owin())
# plot(line)
# plot(year1, add=TRUE)
#------------------------ UPDATE ------------------------#
#### Create hyperframe
#---> NDVI simulated with distmap to point patterns (not ideal but just to test)
hyp.years = hyperframe(year=factor(2010:2014),
ppp=list(year1,year2,year3,year4,year5),
NDVI=list(distmap(year5),distmap(year1),distmap(year2),distmap(year3),distmap(year4)),
rivD=distmap(line),
stringsAsFactors=TRUE)
hyp.years$numYear = with(hyp.years,as.numeric(year)-1)
hyp.years
#### Run mppm models
# mppm1 = mppm(ppp~(NDVI+rivD)/marks,data=hyp.years); summary(mppm1)
#..........................
# mppm2 = mppm(ppp~(NDVI+rivD)/marks,random = ~1|year,data=hyp.years); summary(mppm2)
#..........................
# correlation = corAR1(form=~year)
# mppm3 = mppm(ppp~(NDVI+rivD)/marks,correlation = corAR1(form=~year),use.gam = TRUE,data=hyp.years); summary(mppm3)
###---> Run mppm model with annual trend and random variation in growth
mppmCorr = mppm(ppp~(NDVI+rivD+numYear)/marks,random = ~1|year,data=hyp.years)
summary(mppm1)
If there's a trend in population size over time, then it might make sense to include this trend in the systematic part of the model. I would suggest you add a new numeric variable NumYear to the data frame (eg giving the number of years since 2010). Then try adding simple trend terms such as +NumYear to the model formula (this would correspond to the exponential growth in population that you observed.) You can keep the 1|year random effect term which will then allow for random variation in population size around the long term growth trend.
There's no need to split the data patterns for each year into separate male and female patterns. The variable marks in the model formula can be used to specify any model that depends on sex.
I'm pretty sure that mppm with use.gam=TRUE does not recognise the argument correlation and this is probably just ignored. (It depends what happens inside gam).

Resources