sklearn - label points of PCA - python-3.x

I am generating a PCA which uses scikitlearn, numpy and matplotlib. I want to know how to label each point (row in my data). I found "annotate" in matplotlib, but this seems to be for labeling specific coordinates, or just putting text on arbitrary points by the order they appear. I'm trying to abstract away from this but struggling due to the PCA sections that appear before the matplot stuff. Is there a way I can do this with sklearn, while I'm still generating the plot, so I don't lose its connection to the row I got it from?
Here's my code:
# Create a Randomized PCA model that takes two components
randomized_pca = decomposition.RandomizedPCA(n_components=2)
# Fit and transform the data to the model
reduced_data_rpca = randomized_pca.fit_transform(x)
# Create a regular PCA model
pca = decomposition.PCA(n_components=2)
# Fit and transform the data to the model
reduced_data_pca = pca.fit_transform(x)
# Inspect the shape
reduced_data_pca.shape
# Print out the data
print(reduced_data_rpca)
print(reduced_data_pca)
def rand_jitter(arr):
stdev = .01*(max(arr)-min(arr))
return arr + np.random.randn(len(arr)) * stdev
colors = ['red', 'blue']
for i in range(len(colors)):
w = reduced_data_pca[:, 0][y == i]
z = reduced_data_pca[:, 1][y == i]
plt.scatter(w, z, c=colors[i])
targ_names = ["Negative", "Positive"]
plt.legend(targ_names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title("PCA Scatter Plot")
plt.show()

PCA is a projection, not a clustering (you tagged this as clustering).
There is no concept of a label in PCA.
You can draw texts onto a scatterplot, but usually it becomes too crowded. You can find answers to this on stackoverflow already.

Related

Visualizing multiple linear regression predictions using a heatmap

I am using multiple linear regression to predict the temperature in every region of a field where wireless sensors are deployed, the sensors are as follows : 42 sensors deployed in a 1000x600 m² surface and collecting the temperature in each of these 42 locations per hour, see picture:
Sensors placement
We have here two features ( the location aka : x and y ), and the output which is the temperature, so i fit my model according to 70% of the dataset, for the sake of later accuracy computations, however after fitting my model I want to have the temperatures prediction over all the surface, specifically a heat map that gives me the temperature as a function of x and y ( see picture : Heatmap)
I am stuck in the part of visualization, as my dataset contains the 42 known locations and their respective temperatures, how can i plot predictions for every x in [0,1000] and y in [0,600]
Do i have to make an nx2 matrix iterating over all the values of x and y and then feeding it my fitted model ? or is there a simpler way
You can use np.meshgrid to create a grid of points, then use your model to predict on this grid of points.
import numpy as np
import matplotlib.pyplot as plt
grid_x, grid_y = np.meshgrid(np.linspace(0, 1000, 100),
np.linspace(0, 600, 60))
X = np.stack([grid_x.ravel(), grid_y.ravel()]).T
y_pred = model.predict(X) # use your scikit-learn model here
image = np.reshape(y_pred, grid_x.shape)
plt.imshow(image, origin="lower")
plt.colorbar()
plt.show()

How to interpolate 2D spatial data with kriging in Python?

I have a spatial 2D domain, say [0,1]×[0,1]. In this domain, there are 6 points where some scalar quantity of interest has been observed (e.g., temperature, mechanical stress, fluid density, etc.). How can I predict the quantity of interest at unobserved points? In other words, how may I interpolate spatial data in Python?
For example, consider the following coordinates for points in the 2D domain (inputs) and corresponding observations of the quantity of interest (outputs).
import numpy as np
coordinates = np.array([[0.0,0.0],[0.5,0.0],[1.0,0.0],[0.0,1.0],[0.5,1.],[1.0,1.0]])
observations = np.array([1.0,0.5,0.75,-1.0,0.0,1.0])
The X and Y coordinates can be extracted with:
x = coordinates[:,0]
y = coordinates[:,1]
The following script creates a scatter plot where yellow (resp. blue) represents high (resp. low) output values.
import matplotlib.pyplot as plt
fig = plt.figure()
plt.scatter(x, y, c=observations, cmap='viridis')
plt.colorbar()
plt.show()
I would like to use Kriging to predict the scalar quantity of interest on a regular grid within the 2D input domain. How can I do this in Python?
In OpenTURNS, the KrigingAlgorithm class can estimate the hyperparameters of a Gaussian process model based on the known output values at specific input points. The getMetamodel method of KrigingAlgorithm, then, returns a function which interpolates the data.
First, we need to convert the Numpy arrays coordinates and observations to OpenTURNS Sample objects:
import openturns as ot
input_train = ot.Sample(coordinates)
output_train = ot.Sample(observations, 1)
The array coordinates has shape (6, 2), so it is turned into a Sample of size 6 and dimension 2. The array observations has shape (6,), which is ambiguous: Is it going to be a Sample of size 6 and dimension 1, or a Sample of size 1 and dimension 6? To clarify this, we specify the dimension (1) in the call to the Sample class constructor.
In the following, we define a Gaussian process model with constant trend function and squared exponential covariance kernel:
inputDimension = 2
basis = ot.ConstantBasisFactory(inputDimension).build()
covariance_kernel = ot.SquaredExponential([1.0]*inputDimension, [1.0])
algo = ot.KrigingAlgorithm(input_train, output_train,
covariance_kernel, basis)
We then fit the value of the trend and the parameters of the covariance kernel (amplitude parameter and scale parameters) and obtain a metamodel:
# Fit
algo.run()
result = algo.getResult()
krigingMetamodel = result.getMetaModel()
The resulting krigingMetamodel is a Function which takes a 2D Point as input and returns a 1D Point. It predicts the quantity of interest. To illustrate this, let us build the 2D domain [0,1]×[0,1] and discretize it with a regular grid:
# Create the 2D domain
myInterval = ot.Interval([0.0, 0.0], [1.0, 1.0])
# Define the number of interval in each direction of the box
nx = 20
ny = 10
myIndices = [nx - 1, ny - 1]
myMesher = ot.IntervalMesher(myIndices)
myMeshBox = myMesher.build(myInterval)
Using our krigingMetamodel to predict the values taken by the quantity of interest on this mesh can be done with the following statements. We first get the vertices of the mesh as a Sample, and then evaluate the predictions with a single call to the metamodel (there is no need for a for loop here):
# Predict
vertices = myMeshBox.getVertices()
predictions = krigingMetamodel(vertices)
In order to see the result with Matplotlib, we first have to create the data required by the pcolor function:
# Format for plot
X = np.array(vertices[:, 0]).reshape((ny, nx))
Y = np.array(vertices[:, 1]).reshape((ny, nx))
predictions_array = np.array(predictions).reshape((ny,nx))
The following script produces the plot:
# Plot
import matplotlib.pyplot as plt
fig = plt.figure()
plt.pcolor(X, Y, predictions_array)
plt.colorbar()
plt.show()
We see that the predictions of the metamodel are equal to the observations at the observed input points.
This metamodel is a smooth function of the coordinates: its smoothness increases with covariance kernel smoothness and squared exponential covariance kernels happen to be smooth.

PySpark: Get Threshold (cuttoff) values for each point in ROC curve

I'm starting with PySpark, building binary classification models (logistic regression), and I need to find the optimal threshold (cuttoff) point for my models.
I want to use the ROC curve to find this point, but I don't know how to extract the threshold value for each point in this curve. Is there a way to find this values?
Things I've found:
This post shows how to extract the ROC curve, but only the values for the TPR and FPR. It's useful for plotting and for selecting the optimal point, but I can't find the threshold value.
I know I can find the threshold values for each point in the ROC curve using H2O (I've done it before), but I'm working on Pyspark.
Here is a post describing how to do it with R... but, again, I need to do it with Pyspark
Other facts
I'm using Apache Spark 2.4.0.
I'm working with Data Frames (I really don't know - yet - how to work with RDDs, but I'm not afraid to learn ;) )
If you specifically need to generate ROC curves for different thresholds, one approach could be to generate a list of threshold values you're interested in and fit/transform on your dataset for each threshold. Or you could manually calculate the ROC curve for each threshold point using the probability field in the response from model.transform(test).
Alternatively, you can use BinaryClassificationMetrics to extract a curve plotting various metrics (F1 score, precision, recall) by threshold.
Unfortunately it appears the PySpark version doesn't implement most of the methods the Scala version does, so you'd need to wrap the class to do it in Python.
For example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
# Returns as a list (false positive rate, true positive rate)
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
Results in:
Here's an example of an F1 score curve by threshold value if you aren't married to ROC:
One way is to use sklearn.metrics.roc_curve.
First use your fitted model to make predictions:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label", featuresCol="features")
model = lr.fit(trainingData)
predictions = model.transform(testData)
Then collect your scores and labels1:
preds = predictions.select('label','probability')\
.rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))\
.collect()
Now transform preds to work with roc_curve
from sklearn.metrics import roc_curve
y_score, y_true = zip(*preds)
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label = 1)
Notes:
I am not 100% certain that the probabilities vector will always be ordered such that the positive label will be at index 1. However in a binary classification problem, you'll know right away if your AUC is less than 0.5. In that case, just take 1-p for the probabilities (since the class probabilities sum to 1).

Plotting residuals of masked values with `statsmodels`

I'm using statsmodels.api to compute the statistical parameters for an OLS fit between two variables:
def computeStats(x, y, yName):
'''
Takes as an argument an array, and a string for the array name.
Uses Ordinary Least Squares to compute the statistical parameters for the
array against log(z), and determines the equation for the line of best fit.
Returns the results summary, residuals, statistical parameters in a list, and the
best fit equation.
'''
# Mask NaN values in both axes
mask = ~np.isnan(y) & ~np.isnan(x)
# Compute model parameters
model = sm.OLS(y, sm.add_constant(x), missing= 'drop')
results = model.fit()
residuals = results.resid
# Compute fit parameters
params = stats.linregress(x[mask], y[mask])
fit = params[0]*x + params[1]
fitEquation = '$(%s)=(%.4g \pm %.4g) \\times redshift+%.4g$'%(yName,
params[0], # slope
params[4], # stderr in slope
params[1]) # y-intercept
return results, residuals, params, fit, fitEquation
The second part of the function (using stats.linregress) plays nicely with the masked values, but statsmodels does not. When I try to plot the residuals against the x values with plt.scatter(x, resids), the dimensions do not match:
ValueError: x and y must be the same size
because there are 29007 x-values, and 11763 residuals (that's how many y-values made it through the masking process). I tried changing the model variable to
model = sm.OLS(y[mask], sm.add_constant(x[mask]), missing= 'drop')
but this had no effect.
How can I scatter-plot the residuals against the x-values they match with?
Hi #jim421616 Since statsmodels dropped few missing values, you should use the model's exog variable to plot the scatter as shown.
plt.scatter(model.model.exog[:,1], model.resid)
For reference a complete dummy example
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
#generate data
x = np.random.rand(1000)
y =np.sin( x*25)+0.1*np.random.rand(1000)
# Make some as NAN
y[np.random.choice(np.arange(1000), size=100)]= np.nan
x[np.random.choice(np.arange(1000), size=80)]= np.nan
# fit model
model = sm.OLS(y, sm.add_constant(x) ,missing='drop').fit()
print model.summary()
# plot
plt.scatter(model.model.exog[:,1], model.resid)
plt.show()

Apply power fit to data by using levenberg-marquardt algorithm in python

Hy everybody!
I am a beginer in python and data analysis, and meet with a problem, during fitting a power function to my data.
Here I plotted my dataset as a scatterplot
I want to plot a power function with expontent arround -1 , but after I apply the levenberg-marquardt method, using lmfit library in python, I get the following faulty image. I tried to modify the initial parameters, but it didn't help.
Here is my code:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lmfit import minimize, Parameters, Parameter, report_fit
be = pd.read_table('...',
skipinitialspace=True,
names = ["CoM", "slope", "slope2"])
x=be["CoM"]
data=be["slope"]
def fcn2min(params, x, data):
n2 = params['n2'].value
n1 = params['n1'].value
model = n1 * x ** n2
return model - data #that's what you want to minimize
# create a set of Parameters
# 'value' is the initial condition
params = Parameters()
params.add('n2', value= -1.00)
params.add('n1',value= 23.0)
# do fit, here with leastsq model
result = minimize(fcn2min, params, args=(be["CoM"],be["slope"]))
#calculate final result
final = data + result.residual
resid = result.residual
# write error report
report_fit(result)
#plot results
xplot = x
yplot = result.params['n1'].value * x ** result.params['n2'].value
plt.figure(figsize=(15,6))
plt.ylabel('OD-slope',fontsize=18, color='blue')
plt.xlabel('CoM height_Sz [m]',fontsize=18, color='blue')
plt.plot(be["CoM"],be["slope"],"o", label="slope_flat")
plt.plot(be["CoM"],be["slope2"],"+",color='r', label="slope_curv")
plt.plot(xplot,yplot)
plt.legend()
plt.savefig('plot2')
plt.show()
I don't quite understand what is the problem with this, so if you have any observations, thank you very much.
It's a little hard to tell what the question is. t looks to me like the fit completed and gave a reasonably good fit, but you don't provide the fit statistics or report of the parameters.
If you're asking about all the green lines for the "COM" array (the best fit?), this is almost certainly because the starting x axis "height_Sz" data was not sorted to be strictly increasing. That's OK for the fit, but plotting an X-Y trace with a line expects the data to be in order.

Resources