Visualizing multiple linear regression predictions using a heatmap - scikit-learn

I am using multiple linear regression to predict the temperature in every region of a field where wireless sensors are deployed, the sensors are as follows : 42 sensors deployed in a 1000x600 m² surface and collecting the temperature in each of these 42 locations per hour, see picture:
Sensors placement
We have here two features ( the location aka : x and y ), and the output which is the temperature, so i fit my model according to 70% of the dataset, for the sake of later accuracy computations, however after fitting my model I want to have the temperatures prediction over all the surface, specifically a heat map that gives me the temperature as a function of x and y ( see picture : Heatmap)
I am stuck in the part of visualization, as my dataset contains the 42 known locations and their respective temperatures, how can i plot predictions for every x in [0,1000] and y in [0,600]
Do i have to make an nx2 matrix iterating over all the values of x and y and then feeding it my fitted model ? or is there a simpler way

You can use np.meshgrid to create a grid of points, then use your model to predict on this grid of points.
import numpy as np
import matplotlib.pyplot as plt
grid_x, grid_y = np.meshgrid(np.linspace(0, 1000, 100),
np.linspace(0, 600, 60))
X = np.stack([grid_x.ravel(), grid_y.ravel()]).T
y_pred = model.predict(X) # use your scikit-learn model here
image = np.reshape(y_pred, grid_x.shape)
plt.imshow(image, origin="lower")
plt.colorbar()
plt.show()

Related

How to interpolate 2D spatial data with kriging in Python?

I have a spatial 2D domain, say [0,1]×[0,1]. In this domain, there are 6 points where some scalar quantity of interest has been observed (e.g., temperature, mechanical stress, fluid density, etc.). How can I predict the quantity of interest at unobserved points? In other words, how may I interpolate spatial data in Python?
For example, consider the following coordinates for points in the 2D domain (inputs) and corresponding observations of the quantity of interest (outputs).
import numpy as np
coordinates = np.array([[0.0,0.0],[0.5,0.0],[1.0,0.0],[0.0,1.0],[0.5,1.],[1.0,1.0]])
observations = np.array([1.0,0.5,0.75,-1.0,0.0,1.0])
The X and Y coordinates can be extracted with:
x = coordinates[:,0]
y = coordinates[:,1]
The following script creates a scatter plot where yellow (resp. blue) represents high (resp. low) output values.
import matplotlib.pyplot as plt
fig = plt.figure()
plt.scatter(x, y, c=observations, cmap='viridis')
plt.colorbar()
plt.show()
I would like to use Kriging to predict the scalar quantity of interest on a regular grid within the 2D input domain. How can I do this in Python?
In OpenTURNS, the KrigingAlgorithm class can estimate the hyperparameters of a Gaussian process model based on the known output values at specific input points. The getMetamodel method of KrigingAlgorithm, then, returns a function which interpolates the data.
First, we need to convert the Numpy arrays coordinates and observations to OpenTURNS Sample objects:
import openturns as ot
input_train = ot.Sample(coordinates)
output_train = ot.Sample(observations, 1)
The array coordinates has shape (6, 2), so it is turned into a Sample of size 6 and dimension 2. The array observations has shape (6,), which is ambiguous: Is it going to be a Sample of size 6 and dimension 1, or a Sample of size 1 and dimension 6? To clarify this, we specify the dimension (1) in the call to the Sample class constructor.
In the following, we define a Gaussian process model with constant trend function and squared exponential covariance kernel:
inputDimension = 2
basis = ot.ConstantBasisFactory(inputDimension).build()
covariance_kernel = ot.SquaredExponential([1.0]*inputDimension, [1.0])
algo = ot.KrigingAlgorithm(input_train, output_train,
covariance_kernel, basis)
We then fit the value of the trend and the parameters of the covariance kernel (amplitude parameter and scale parameters) and obtain a metamodel:
# Fit
algo.run()
result = algo.getResult()
krigingMetamodel = result.getMetaModel()
The resulting krigingMetamodel is a Function which takes a 2D Point as input and returns a 1D Point. It predicts the quantity of interest. To illustrate this, let us build the 2D domain [0,1]×[0,1] and discretize it with a regular grid:
# Create the 2D domain
myInterval = ot.Interval([0.0, 0.0], [1.0, 1.0])
# Define the number of interval in each direction of the box
nx = 20
ny = 10
myIndices = [nx - 1, ny - 1]
myMesher = ot.IntervalMesher(myIndices)
myMeshBox = myMesher.build(myInterval)
Using our krigingMetamodel to predict the values taken by the quantity of interest on this mesh can be done with the following statements. We first get the vertices of the mesh as a Sample, and then evaluate the predictions with a single call to the metamodel (there is no need for a for loop here):
# Predict
vertices = myMeshBox.getVertices()
predictions = krigingMetamodel(vertices)
In order to see the result with Matplotlib, we first have to create the data required by the pcolor function:
# Format for plot
X = np.array(vertices[:, 0]).reshape((ny, nx))
Y = np.array(vertices[:, 1]).reshape((ny, nx))
predictions_array = np.array(predictions).reshape((ny,nx))
The following script produces the plot:
# Plot
import matplotlib.pyplot as plt
fig = plt.figure()
plt.pcolor(X, Y, predictions_array)
plt.colorbar()
plt.show()
We see that the predictions of the metamodel are equal to the observations at the observed input points.
This metamodel is a smooth function of the coordinates: its smoothness increases with covariance kernel smoothness and squared exponential covariance kernels happen to be smooth.

PySpark: Get Threshold (cuttoff) values for each point in ROC curve

I'm starting with PySpark, building binary classification models (logistic regression), and I need to find the optimal threshold (cuttoff) point for my models.
I want to use the ROC curve to find this point, but I don't know how to extract the threshold value for each point in this curve. Is there a way to find this values?
Things I've found:
This post shows how to extract the ROC curve, but only the values for the TPR and FPR. It's useful for plotting and for selecting the optimal point, but I can't find the threshold value.
I know I can find the threshold values for each point in the ROC curve using H2O (I've done it before), but I'm working on Pyspark.
Here is a post describing how to do it with R... but, again, I need to do it with Pyspark
Other facts
I'm using Apache Spark 2.4.0.
I'm working with Data Frames (I really don't know - yet - how to work with RDDs, but I'm not afraid to learn ;) )
If you specifically need to generate ROC curves for different thresholds, one approach could be to generate a list of threshold values you're interested in and fit/transform on your dataset for each threshold. Or you could manually calculate the ROC curve for each threshold point using the probability field in the response from model.transform(test).
Alternatively, you can use BinaryClassificationMetrics to extract a curve plotting various metrics (F1 score, precision, recall) by threshold.
Unfortunately it appears the PySpark version doesn't implement most of the methods the Scala version does, so you'd need to wrap the class to do it in Python.
For example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
# Returns as a list (false positive rate, true positive rate)
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
Results in:
Here's an example of an F1 score curve by threshold value if you aren't married to ROC:
One way is to use sklearn.metrics.roc_curve.
First use your fitted model to make predictions:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label", featuresCol="features")
model = lr.fit(trainingData)
predictions = model.transform(testData)
Then collect your scores and labels1:
preds = predictions.select('label','probability')\
.rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))\
.collect()
Now transform preds to work with roc_curve
from sklearn.metrics import roc_curve
y_score, y_true = zip(*preds)
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label = 1)
Notes:
I am not 100% certain that the probabilities vector will always be ordered such that the positive label will be at index 1. However in a binary classification problem, you'll know right away if your AUC is less than 0.5. In that case, just take 1-p for the probabilities (since the class probabilities sum to 1).

Plotting residuals of masked values with `statsmodels`

I'm using statsmodels.api to compute the statistical parameters for an OLS fit between two variables:
def computeStats(x, y, yName):
'''
Takes as an argument an array, and a string for the array name.
Uses Ordinary Least Squares to compute the statistical parameters for the
array against log(z), and determines the equation for the line of best fit.
Returns the results summary, residuals, statistical parameters in a list, and the
best fit equation.
'''
# Mask NaN values in both axes
mask = ~np.isnan(y) & ~np.isnan(x)
# Compute model parameters
model = sm.OLS(y, sm.add_constant(x), missing= 'drop')
results = model.fit()
residuals = results.resid
# Compute fit parameters
params = stats.linregress(x[mask], y[mask])
fit = params[0]*x + params[1]
fitEquation = '$(%s)=(%.4g \pm %.4g) \\times redshift+%.4g$'%(yName,
params[0], # slope
params[4], # stderr in slope
params[1]) # y-intercept
return results, residuals, params, fit, fitEquation
The second part of the function (using stats.linregress) plays nicely with the masked values, but statsmodels does not. When I try to plot the residuals against the x values with plt.scatter(x, resids), the dimensions do not match:
ValueError: x and y must be the same size
because there are 29007 x-values, and 11763 residuals (that's how many y-values made it through the masking process). I tried changing the model variable to
model = sm.OLS(y[mask], sm.add_constant(x[mask]), missing= 'drop')
but this had no effect.
How can I scatter-plot the residuals against the x-values they match with?
Hi #jim421616 Since statsmodels dropped few missing values, you should use the model's exog variable to plot the scatter as shown.
plt.scatter(model.model.exog[:,1], model.resid)
For reference a complete dummy example
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
#generate data
x = np.random.rand(1000)
y =np.sin( x*25)+0.1*np.random.rand(1000)
# Make some as NAN
y[np.random.choice(np.arange(1000), size=100)]= np.nan
x[np.random.choice(np.arange(1000), size=80)]= np.nan
# fit model
model = sm.OLS(y, sm.add_constant(x) ,missing='drop').fit()
print model.summary()
# plot
plt.scatter(model.model.exog[:,1], model.resid)
plt.show()

sklearn - label points of PCA

I am generating a PCA which uses scikitlearn, numpy and matplotlib. I want to know how to label each point (row in my data). I found "annotate" in matplotlib, but this seems to be for labeling specific coordinates, or just putting text on arbitrary points by the order they appear. I'm trying to abstract away from this but struggling due to the PCA sections that appear before the matplot stuff. Is there a way I can do this with sklearn, while I'm still generating the plot, so I don't lose its connection to the row I got it from?
Here's my code:
# Create a Randomized PCA model that takes two components
randomized_pca = decomposition.RandomizedPCA(n_components=2)
# Fit and transform the data to the model
reduced_data_rpca = randomized_pca.fit_transform(x)
# Create a regular PCA model
pca = decomposition.PCA(n_components=2)
# Fit and transform the data to the model
reduced_data_pca = pca.fit_transform(x)
# Inspect the shape
reduced_data_pca.shape
# Print out the data
print(reduced_data_rpca)
print(reduced_data_pca)
def rand_jitter(arr):
stdev = .01*(max(arr)-min(arr))
return arr + np.random.randn(len(arr)) * stdev
colors = ['red', 'blue']
for i in range(len(colors)):
w = reduced_data_pca[:, 0][y == i]
z = reduced_data_pca[:, 1][y == i]
plt.scatter(w, z, c=colors[i])
targ_names = ["Negative", "Positive"]
plt.legend(targ_names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title("PCA Scatter Plot")
plt.show()
PCA is a projection, not a clustering (you tagged this as clustering).
There is no concept of a label in PCA.
You can draw texts onto a scatterplot, but usually it becomes too crowded. You can find answers to this on stackoverflow already.

How do I use Theanets LSTM RNN's on my time series data?

I have a simple dataframe consisting of one column. In that column are 10320 observations (numerical). I'm simulating Time-Series data by inserting the data into a plot with a window of 200 observations each. Here is the code for plotting.
import matplotlib.pyplot as plt
from IPython import display
fig_size = plt.rcParams["figure.figsize"]
import time
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
fig, axes = plt.subplots(1,1, figsize=(19,5))
df = dframe.set_index(arange(0,len(dframe)))
std = dframe[0].std() * 6
window = 200
iterations = int(len(dframe)/window)
i = 0
dframe = dframe.set_index(arange(0,len(dframe)))
while i< iterations:
frm = window*i
if i == iterations:
to = len(dframe)
else:
to = frm+window
df = dframe[frm : to]
if len(df) > 100:
df = df.set_index(arange(0,len(df)))
plt.gca().cla()
plt.plot(df.index, df[0])
plt.axhline(y=std, xmin=0, xmax=len(df[0]),c='gray',linestyle='--',lw = 2, hold=None)
plt.axhline(y=-std , xmin=0, xmax=len(df[0]),c='gray',linestyle='--', lw = 2, hold=None)
plt.ylim(min(dframe[0])- 0.5 , max(dframe[0]) )
plt.xlim(-50,window+50)
display.clear_output(wait=True)
display.display(plt.gcf())
canvas = FigureCanvas(fig)
canvas.print_figure('fig.png', dpi=72, bbox_inches='tight')
i += 1
plt.close()
This simulates a flow of real-time data and visualizes it. What I want is to apply theanets RNN LSTM to the data to detect anomalies unsupervised. Because I am doing it unsupervised I don't think that I need to split my data into training and test sets. I haven't found much of anything that makes sense to me so far and have been googling for about 2 hours. Just hoping that you guys may be able to help. I want to put the prediction output of the RNN on the graph as well and define a threshold that, if the error is too large, the values will be identified as anomalous. If you need more information please comment and let me know. Thank you!
READING
Like neurons, LSTM networks are build of interconnected LSTM Blocks whose training is done via BackPropogation Through Time.
Classical anomaly detection using time series required prediction of time series output in future (at one or more points) and finding error on these points with true values. Prediction Error above a threshold will reflect and amomly
SOLUTION
Having said this
You've to train network so you need training sets and test sets both
Use N inputs to predict M outputs (decide upon N and M with experimentation - values for which training error is low)
Scroll a window of (N+M) elements in input data and use this data array of (N+M) items also termed as frame to train or test network.
Typically we use 90% of starting series for training and 10% for testing.
This scheme will fail as if training is not proper there will be false prediction errors which are not-anomaly. So make sure to provide enough training, and most important shuffle training frames and consider all variations.

Resources