Kmeans - Find the closest distance between two points of two cluster - python-3.x

I'm using a csv file to divide into 2 clusters. The blue one is the benign URL and the red one is the malicious URL. The problem is my instructor want to know the closest point of two clusters and the distance between them. I tried to use the distance between two centroids but it didn't work!
Thank you for your help!
Here the code and and the clusters:
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("major_combined.csv")
X_mal = data.iloc[:,[1, 5, 6, 11, 13, 15, 19, 20, 23]]
kmeans_mal = KMeans(n_clusters=2, init='random', max_iter=300, n_init=10, random_state = 0)
y_kmeans_mal = kmeans_mal.fit_predict(X_mal)
plt.figure(figsize=(10,10))
plt.scatter(X_mal.iloc[y_kmeans_mal == 0,4], X_mal.iloc[y_kmeans_mal == 0,8], s= 100, color = 'blue', label ='0')
plt.scatter(X_mal.iloc[y_kmeans_mal == 1,4], X_mal.iloc[y_kmeans_mal == 1,8], s= 100, color = 'red', label ='1')
plt.scatter(kmeans_mal.cluster_centers_[:,4], kmeans_mal.cluster_centers_[:,8], s= 300, color = 'yellow', label ='Centroid')
plt.title('k-means clustering')
plt.xlabel('document')
plt.ylabel('window')
plt.legend()
plt.grid()
plt.show()

Related

Use for loop for multi row column plot

I am attempting to run a for loop in order to plot multiple scatter plots. For the code that I have, I only get one plot at the end. How to go about generating the correct row x column plots to save?
I have checked out some of the answers given here and here, but it does not work for me. Is there a more optimum way to generate these plots?
Here is my code:
from sklearn.datasets import make_classification
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
# Generate noisy Data
num_trainsamples = 500
num_testsamples = 50
X_train,y_train = make_classification(n_samples=num_trainsamples,
n_features=240,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=9,
flip_y=0.2,
#weights=[0.5,0.5],
random_state=17)
n_components=2
n_neighbours=[1, 5, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30]
local_connectivity=2
min_dist=0.15
target_names = ['t1', 't2', 't3', 't4', 't5', 't6', 't7', 't8', 't9', 't10']
plt.figure(figsize=(15,15))
for i in range(0, len(n_neighbours)):
plt.subplot(3,5,i+1)
plt.clf()
plt.scatter(
X_train[:, 0],
X_train[:, 1],
s = 20,
c=y_train,
cmap=plt.cm.nipy_spectral,
edgecolor="k",
linewidths=0.75,
label=y_train,
alpha=0.45,
)
plt.title(f'n_components = {n_components}, n_neighbors = {n_neighbours[i]}, local_conn = {local_connectivity}, min_dist = {min_dist}')
cbar = plt.colorbar(boundaries=np.arange(11)-0.5)
cbar.set_ticks(np.arange(10))
cbar.set_ticklabels(target_names)
The reason for you seeing just one plot is the line plt.clf(). This command tells matplotlib to clear current figure. So, each time you loop through the code, it clear the previous figure and so, you see just the last one. Commenting that line will give you the below figure, which is what I think you are looking for...
PLOT

creating a rainfall colormap for points inside a watershed polygon

I really appreciate your help in developing my code since I am not an expert in python. I attempt to write a code to be able to:
Read all the points (longitude, latitude, cumulative forecasted rainfall for 24, 48, and 72 hours) from a csv file (Mean_PCP_REPS_12_20220809_Gridded.csv).
Read the polygon representing the watershed boundary (NelsonRiverBasin.shp).
Mask/remove the points outside of the watershed polygon.
Create a rainfall colormap image or raster for the points inside the watershed polygon.
Color boundaries should be based on rainfall value. I defined the rainfall range for each color in my code.
I tried many ways but I was not successful in creating an image or raster with desired color map (please click here as an example of the intended image). My python code is as follows. It creates and saves "New_ras.tiff" but my code cannot remap the colors of this image based on the range of rainfall after its creation.
from __future__ import division
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import Point, Polygon, MultiPolygon
import operator
#extending the code
import os
from matplotlib.patches import Patch
from matplotlib.colors import ListedColormap
import matplotlib.colors as colors
import seaborn as sns
import numpy as np
import rioxarray as rxr
import earthpy as et
import earthpy.plot as ep
from scipy.interpolate import griddata #added code up to here
import rasterio
# load the data that should be cropped by the polygon
# this assumes that the csv file already includes
# a geometry column with point data as performed below
dat_gpd = pd.read_csv(r'Mean_PCP_REPS_12_20220809_Gridded.csv')
# make shapely points out of the X and Y coordinates
point_data = [Point(xy) for xy in zip(dat_gpd.iloc[:,0], dat_gpd.iloc[:,1])]
all_pts = list(zip(dat_gpd.iloc[:,0], dat_gpd.iloc[:,1]))
# assign shapely points as geometry to a geodataframe
# Like this you can also inspect the individual points if needed
arr_gpd = gpd.GeoDataFrame(dat_gpd, crs=4269, geometry=point_data)
# assign defined polygon to a new dataframe
nlpoly = gpd.read_file('NelsonRiverBasin.shp')
nlpoly = nlpoly.to_crs('epsg:4269')
mask = [nlpoly.contains(Point(p)).any() for p in all_pts]
# define a new dataframe from the spatial join of the dataframe with the data to be cropped
# and the dataframe with the polygon data, using the within function.
#dat_fin = gpd.sjoin(arr_gpd, nlpoly[['OCEAN_EN', 'COUNT', 'geometry']], predicate = 'within')
#dat_fin = dat_fin.to_crs('epsg:4326')
#dat_fin.plot(column= 'Hr72')
#plt.savefig('Raster2.tiff')
data = dat_gpd[['Long', 'Lat', 'Hr72']]
pts = list(zip(data.Long, data.Lat))
print (pts)
print(type(pts))
pts2 = [pts[i] for i in range(len(pts)) if mask[i]]
print(pts2)
print(type(pts2))
pts_val = data.Hr72.values
pts_val2 = [pts_val[i] for i in range(len(pts_val)) if mask[i]]
new_pts = [Point(xy) for xy in pts2]
print(type(pts_val2[1]))
pts3=[]
for tup, j in zip(pts2,range(len(pts_val2))):
pts3.append(list(tup)+[pts_val2[j]])
print(type(pts3))
masked_pts = pd.DataFrame(pts3)
print(masked_pts)
masked_pts.columns = pd.Series(['Long', 'Lat', 'Hr72'])
new_arr_gpd = gpd.GeoDataFrame(masked_pts, crs = 4269, geometry = new_pts)
new_arr_gpd.plot(column = 'Hr72')
plt.savefig('new_ras.tiff')
rRes = 0.01
#xRange = np.arange(data.Long.min(), data.Long.max(), rRes)
#yRange = np.arange(data.Lat.min(), data.Lat.max(), rRes)
#print(xRange[:5],yRange[:5])
#gridX, gridY = np.meshgrid(xRange, yRange)
#grid_pcp = griddata(pts2, pts_val2, (gridX, gridY), method = 'linear')
#Extending the code
sns.set(font_scale = 1, style = "white")
lidar_chm = rxr.open_rasterio(r'new_ras.tiff', masked=True).squeeze()
# Define the colors you want
cmap = ListedColormap(["white", "lightskyblue","dodgerblue","mediumblue","lawngreen","limegreen", "forestgreen","darkgreen", "yellow", "orange","darkorange", "chocolate", "red", "maroon", "indianred","lightpink", "pink", "lightgray", "whitesmoke" ])
# Define a normalization from values -> colors
norm = colors.BoundaryNorm([0, 1, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 150, 200, 250], 19)
fig, ax = plt.subplots(figsize=(9, 5))
chm_plot = ax.imshow(np.squeeze(r'new_ras.tiff'),cmap=cmap,norm=norm)
#print(chm_plot)
map_title = input ("Enter a title for this map (for ex. 72-hr accumulated forecast map):")
ax.set_title("Hydrologic Forecast Centre (MTI)\n" + map_title)
# Add a legend for labels
legend_labels = {"white": "0-1", "lightskyblue": "1-5","dodgerblue": "5-10","mediumblue": "10-15","lawngreen": "15-20","limegreen": "20-25", "forestgreen": "25-30","darkgreen": "30-40", "yellow": "40-50", "orange": "50-60","darkorange": "60-70", "chocolate": "70-80", "red": "80-90", "maroon": "90-100","indianred": "100-110", "lightpink": "110-120", "pink": "120-150", "lightgray": "150-200", "whitesmoke": "200-250"}
patches = [Patch(color=color, label=label) for color, label in legend_labels.items()]
ax.legend(handles=patches,bbox_to_anchor=(1.2, 1),facecolor="white")
ax.set_axis_off()
plt.show()

How to visualize a list of strings on a colorbar in matplotlib

I have a dataset like
x = 3,4,6,77,3
y = 8,5,2,5,5
labels = "null","exit","power","smile","null"
Then I use
from matplotlib import pyplot as plt
plt.scatter(x,y)
colorbar = plt.colorbar(labels)
plt.show()
to make a scatter plot, but cannot make colorbar showing labels as its colors.
How to get this?
I'm not sure, if it's a good idea to do that for scatter plots in general (you have the same description for different data points, maybe just use some legend here?), but I guess a specific solution to what you have in mind, might be the following:
from matplotlib import pyplot as plt
# Data
x = [3, 4, 6, 77, 3]
y = [8, 5, 2, 5, 5]
labels = ('null', 'exit', 'power', 'smile', 'null')
# Customize colormap and scatter plot
cm = plt.cm.get_cmap('hsv')
sc = plt.scatter(x, y, c=range(5), cmap=cm)
cbar = plt.colorbar(sc, ticks=range(5))
cbar.ax.set_yticklabels(labels)
plt.show()
This will result in such an output:
The code combines this Matplotlib demo and this SO answer.
Hope that helps!
EDIT: Incorporating the comments, I can only think of some kind of label color dictionary, generating a custom colormap from the colors, and before plotting explicitly grabbing the proper color indices from the labels.
Here's the updated code (I added some additional colors and data points to check scalability):
from matplotlib import pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
import numpy as np
# Color information; create custom colormap
label_color_dict = {'null': '#FF0000',
'exit': '#00FF00',
'power': '#0000FF',
'smile': '#FF00FF',
'addon': '#AAAAAA',
'addon2': '#444444'}
all_labels = list(label_color_dict.keys())
all_colors = list(label_color_dict.values())
n_colors = len(all_colors)
cm = LinearSegmentedColormap.from_list('custom_colormap', all_colors, N=n_colors)
# Data
x = [3, 4, 6, 77, 3, 10, 40]
y = [8, 5, 2, 5, 5, 4, 7]
labels = ('null', 'exit', 'power', 'smile', 'null', 'addon', 'addon2')
# Get indices from color list for given labels
color_idx = [all_colors.index(label_color_dict[label]) for label in labels]
# Customize colorbar and plot
sc = plt.scatter(x, y, c=color_idx, cmap=cm)
c_ticks = np.arange(n_colors) * (n_colors / (n_colors + 1)) + (2 / n_colors)
cbar = plt.colorbar(sc, ticks=c_ticks)
cbar.ax.set_yticklabels(all_labels)
plt.show()
And, the new output:
Finding the correct middle point of each color segment is (still) not good, but I'll leave this optimization to you.

How to plot vertical lines in plotly offline?

How would one plot a vertical line in plotly offline, using python? I want to add lines at x=20, x=40, and x=60, all in the same plot.
def graph_contracts(self):
trace1 = go.Scatter(
x=np.array(range(len(all_prices))),
y=np.array(all_prices), mode='markers', marker=dict(size=10, color='rgba(152, 0, 0, .8)'))
data = [trace1]
layout = go.Layout(title='Market Contracts by Period',
xaxis=dict(title='Contract #',
titlefont=dict(family='Courier New, monospace', size=18, color='#7f7f7f')),
yaxis=dict(title='Prices ($)',
titlefont=dict(family='Courier New, monospace', size=18, color='#7f7f7f')))
fig = go.Figure(data=data, layout=layout)
py.offline.plot(fig)
You can add lines via shape in layout, e.g.
import plotly
plotly.offline.init_notebook_mode()
import random
x=[i for i in range(100)]
trace = plotly.graph_objs.Scatter(x=x,
y=[random.random() for _ in x],
mode='markers')
shapes = list()
for i in (20, 40, 60):
shapes.append({'type': 'line',
'xref': 'x',
'yref': 'y',
'x0': i,
'y0': 0,
'x1': i,
'y1': 1})
layout = plotly.graph_objs.Layout(shapes=shapes)
fig = plotly.graph_objs.Figure(data=[trace],
layout=layout)
plotly.offline.plot(fig)
would give you
This is my example. The most important instruction is this.
fig.add_trace(go.Scatter(x=[12, 12], y=[-300,300], mode="lines", name="SIGNAL"))
The most important attribute is MODE='LINES'.
Actually this example is about a segment with x=12
EXAMPLE
import pandas as pd
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import numpy as np
import plotly.tools as tls
df1 = pd.read_csv('./jnjw_f8.csv')
layout = go.Layout(
xaxis = go.layout.XAxis(
tickmode = 'linear',
tick0 = 1,
dtick = 3
),
yaxis = go.layout.YAxis(
tickmode = 'linear',
tick0 = -100,
dtick = 3
))
fig = go.Figure(layout = layout)
fig.add_trace(go.Scatter(x = df1['x'], y =
df1['y1'],name='JNJW_sqrt'))
fig.add_trace(go.Scatter(x=[12, 12], y=[-300,300],
mode="lines", name="SIGNAL"))
fig.show()
Look here too.
how to plot a vertical line with plotly
A feature for vertical and horizontal lines is implemented with Plotly.py 4.12 (released 11/20). It works for plotly express and graph objects. See here: https://community.plotly.com/t/announcing-plotly-py-4-12-horizontal-and-vertical-lines-and-rectangles/46783
Simple example:
import plotly.express as px
df = px.data.stocks(indexed=True)
fig = px.line(df)
fig.add_vline(x='2018-09-24')
fig.show()
fig.add_vline(x=2.5, line_width=3, line_dash="dash", line_color="green")

Smooth curves in Python Plots [duplicate]

I've got the following simple script that plots a graph:
import matplotlib.pyplot as plt
import numpy as np
T = np.array([6, 7, 8, 9, 10, 11, 12])
power = np.array([1.53E+03, 5.92E+02, 2.04E+02, 7.24E+01, 2.72E+01, 1.10E+01, 4.70E+00])
plt.plot(T,power)
plt.show()
As it is now, the line goes straight from point to point which looks ok, but could be better in my opinion. What I want is to smooth the line between the points. In Gnuplot I would have plotted with smooth cplines.
Is there an easy way to do this in PyPlot? I've found some tutorials, but they all seem rather complex.
You could use scipy.interpolate.spline to smooth out your data yourself:
from scipy.interpolate import spline
# 300 represents number of points to make between T.min and T.max
xnew = np.linspace(T.min(), T.max(), 300)
power_smooth = spline(T, power, xnew)
plt.plot(xnew,power_smooth)
plt.show()
spline is deprecated in scipy 0.19.0, use BSpline class instead.
Switching from spline to BSpline isn't a straightforward copy/paste and requires a little tweaking:
from scipy.interpolate import make_interp_spline, BSpline
# 300 represents number of points to make between T.min and T.max
xnew = np.linspace(T.min(), T.max(), 300)
spl = make_interp_spline(T, power, k=3) # type: BSpline
power_smooth = spl(xnew)
plt.plot(xnew, power_smooth)
plt.show()
Before:
After:
For this example spline works well, but if the function is not smooth inherently and you want to have smoothed version you can also try:
from scipy.ndimage.filters import gaussian_filter1d
ysmoothed = gaussian_filter1d(y, sigma=2)
plt.plot(x, ysmoothed)
plt.show()
if you increase sigma you can get a more smoothed function.
Proceed with caution with this one. It modifies the original values and may not be what you want.
See the scipy.interpolate documentation for some examples.
The following example demonstrates its use, for linear and cubic spline interpolation:
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import interp1d
# Define x, y, and xnew to resample at.
x = np.linspace(0, 10, num=11, endpoint=True)
y = np.cos(-x**2/9.0)
xnew = np.linspace(0, 10, num=41, endpoint=True)
# Define interpolators.
f_linear = interp1d(x, y)
f_cubic = interp1d(x, y, kind='cubic')
# Plot.
plt.plot(x, y, 'o', label='data')
plt.plot(xnew, f_linear(xnew), '-', label='linear')
plt.plot(xnew, f_cubic(xnew), '--', label='cubic')
plt.legend(loc='best')
plt.show()
Slightly modified for increased readability.
One of the easiest implementations I found was to use that Exponential Moving Average the Tensorboard uses:
def smooth(scalars: List[float], weight: float) -> List[float]: # Weight between 0 and 1
last = scalars[0] # First value in the plot (first timestep)
smoothed = list()
for point in scalars:
smoothed_val = last * weight + (1 - weight) * point # Calculate smoothed value
smoothed.append(smoothed_val) # Save it
last = smoothed_val # Anchor the last smoothed value
return smoothed
ax.plot(x_labels, smooth(train_data, .9), x_labels, train_data)
I presume you mean curve-fitting and not anti-aliasing from the context of your question. PyPlot doesn't have any built-in support for this, but you can easily implement some basic curve-fitting yourself, like the code seen here, or if you're using GuiQwt it has a curve fitting module. (You could probably also steal the code from SciPy to do this as well).
Here is a simple solution for dates:
from scipy.interpolate import make_interp_spline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as dates
from datetime import datetime
data = {
datetime(2016, 9, 26, 0, 0): 26060, datetime(2016, 9, 27, 0, 0): 23243,
datetime(2016, 9, 28, 0, 0): 22534, datetime(2016, 9, 29, 0, 0): 22841,
datetime(2016, 9, 30, 0, 0): 22441, datetime(2016, 10, 1, 0, 0): 23248
}
#create data
date_np = np.array(list(data.keys()))
value_np = np.array(list(data.values()))
date_num = dates.date2num(date_np)
# smooth
date_num_smooth = np.linspace(date_num.min(), date_num.max(), 100)
spl = make_interp_spline(date_num, value_np, k=3)
value_np_smooth = spl(date_num_smooth)
# print
plt.plot(date_np, value_np)
plt.plot(dates.num2date(date_num_smooth), value_np_smooth)
plt.show()
It's worth your time looking at seaborn for plotting smoothed lines.
The seaborn lmplot function will plot data and regression model fits.
The following illustrates both polynomial and lowess fits:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
T = np.array([6, 7, 8, 9, 10, 11, 12])
power = np.array([1.53E+03, 5.92E+02, 2.04E+02, 7.24E+01, 2.72E+01, 1.10E+01, 4.70E+00])
df = pd.DataFrame(data = {'T': T, 'power': power})
sns.lmplot(x='T', y='power', data=df, ci=None, order=4, truncate=False)
sns.lmplot(x='T', y='power', data=df, ci=None, lowess=True, truncate=False)
The order = 4 polynomial fit is overfitting this toy dataset. I don't show it here but order = 2 and order = 3 gave worse results.
The lowess = True fit is underfitting this tiny dataset but may give better results on larger datasets.
Check the seaborn regression tutorial for more examples.
Another way to go, which slightly modifies the function depending on the parameters you use:
from statsmodels.nonparametric.smoothers_lowess import lowess
def smoothing(x, y):
lowess_frac = 0.15 # size of data (%) for estimation =~ smoothing window
lowess_it = 0
x_smooth = x
y_smooth = lowess(y, x, is_sorted=False, frac=lowess_frac, it=lowess_it, return_sorted=False)
return x_smooth, y_smooth
That was better suited than other answers for my specific application case.

Resources