I have a pandas dataframe of size (1280,2). The head of the data looks as follows:
I'm using a clustering based anomaly detection method using k-means. It creates 'k' similar clusters of data points. Data points that fall outside of these groups are marked as anomalies.
def getDistanceByPoint(data, model):
distance = pd.Series()
for i in range(0,len(data)):
Xa = np.array(data.loc[i])
Xb = model.cluster_centers_[model.labels_[i]-1]
distance.set_value(i, np.linalg.norm(Xa-Xb))
return distance
kmeans = KMeans(n_clusters=9).fit(data)
outliers_fraction = 0.01
distance = getDistanceByPoint(data, kmeans)
number_of_outliers = int(outliers_fraction*len(distance))
threshold = distance.nlargest(number_of_outliers).min()
(0:normal, 1:anomaly)
df['anomaly1'] = (distance >= threshold).astype(int)
I want to plot data frame with the x-axis as time elapsed and the y-axis as value. I would like to plot the normal data values in blue and the anomaly values in red. How could I plot this?
This is what you need. Remember to change time and value to your column name accordingly.
fig, ax = plt.subplots()
a = df.loc[df['anomaly'] == 1, ['time', 'value']]
ax.plot(df['time'], df['value'], color='blue')
ax.scatter(a['time'], a['value'], color='red')
plt.show()
Check this notebook out for more information.
Related
I am trying to create custom cross sections of archived HRRR Grib2 output data. I had been following the cross section example provided here and followed up on all issues I had with the file format itself also on the unidata site here. I have produced some cross section plots such as the following where my x-axis utilizes latitude and y-axis utilizes isobaric pressure as seen in the plot here:
My goal is to output my plots with the x-axis showing distance from the start of my transect line to the end of the transect line. This would help me determine the horizontal scale of certain near-surface meteorological features including lakebreeze, outflow, etc. An example of what I am hoping to do is in the following photo, where the x axis indicates distance along the transect line in meters or km instead of gps coordinates:
How can I convert the coordinates to distances for my plot?
My Code:
#input points for the transect line via longitude/latitude coordinates
startpoint = (42.857, -85.381)
endpoint = (43.907, -83.910)
# Import
grib = pygrib.open('file.grib2')
#use grib message to apply lat/long info to data set
msg = grib.message(1)
#Convert grib file into xarray and assign x and y coordinate values
#(HRRR utilizes lambert_conformal_conic by default remember this)
ds = xr.open_dataset(file, engine="cfgrib",
backend_kwargs={'filter_by_keys':{'typeOfLevel': 'isobaricInhPa'}})
ds = ds.metpy.assign_crs(CRS(msg.projparams).to_cf()).metpy.assign_y_x()
#metpy cross section function to create a transect line and create cross section.
cross = cross_section(ds, startpoint, endpoint).set_coords(('latitude', 'longitude'))
#create variables
temperature = cross['t']
pressure = cross['isobaricInhPa']
cross['Potential_temperature'] = mpcalc.potential_temperature(cross['isobaricInhPa'],cross['t'])
cross['u_wind'] = cross['u'].metpy.convert_units('knots')
cross['v_wind'] = cross['v'].metpy.convert_units('knots')
cross['t_wind'], cross['n_wind'] = mpcalc.cross_section_components(cross['u_wind'],cross['v_wind'])
cross['qv'] = cross['q'] *1000* units('g/kg')
#HRRR plot test
fig = plt.figure(1, figsize=(20,9))
ax = plt.axes()
#levels = np.linspace(325,365,50)
temp = ax.contourf(cross['latitude'], cross['isobaricInhPa'], cross['qv'], 100, cmap='rainbow')
clb = fig.colorbar(temp)
clb.set_label('g $\mathregular{kg^{-1}}$')
theta_contour = ax.contour(cross['latitude'], cross['isobaricInhPa'], cross['Potential_temperature'],
400, colors='k', linewidths=2)
theta_contour.clabel(theta_contour.levels[1::2], fontsize=8, colors='k', inline=1,
inline_spacing=8, fmt='%i', rightside_up=True, use_clabeltext=True)
ax.set_ylim(775,1000)
ax.invert_yaxis()
plt.title('HRRR contour fill of Mixing ratio(g/kg), contour of Potential Temperature (K),\n Tangential/Normal Winds (knots)')
plt.title('Run: '+date+'', loc='left', fontsize='small')
plt.title('Valid: '+date+' '+f_hr, loc='right', fontsize='small')
plt.xlabel('Latitude')
plt.ylabel('Pressure (hPa)')
wind_slc_vert = list(range(0, 19, 2)) + list(range(19, 29))
wind_slc_horz = slice(5, 100, 5)
ax.barbs(cross['latitude'][wind_slc_horz], cross['isobaricInhPa'][wind_slc_vert],
cross['t_wind'][wind_slc_vert, wind_slc_horz],
cross['n_wind'][wind_slc_vert, wind_slc_horz], color='k')
# Adjust y-axis to log scale
ax.set_yscale('symlog')
ax.set_yticklabels(np.arange(1000, 775,-100))
#ax.set_ylim(cross['isobaricInhPa'].max(), cross['isobaricInhPa'].min())
ax.set_yticks(np.arange(1000, 775,-100))
plt.show()
You should be able to do this using PyPROJ's Geod class, which is what MetPy uses under the hood to calculate e.g. lat_lon_grid_deltas. Something like this I think will work:
import pyproj
geod = pyproj.Geod(ellps='sphere')
_, _, dist = geod.inv(cross['longitude'][0], cross['latitude'][0],
cross['longitude'], cross['latitude'])
I haven't actually tried that code, so you may need to convert the xarray DataArrays into numpy plain arrays.
I am trying to fill the area between two vertical curves(RHOB and NPHI) using matplotlib.pyplot. Both RHOB and NPHI are having different scale of x-axis.
But when i try to plot i noticed that the fill_between is filling the area between RHOB and NPHI in the same scale.
#well_data is the data frame i am reading to get my data
#creating my subplot
fig, ax=plt.subplots(1,2,figsize=(8,6),sharey=True)
ax[0].get_xaxis().set_visible(False)
ax[0].invert_yaxis()
#subplot 1:
#ax01 to house the NPHI curve (NPHI curve are having values between 0-45)
ax01=ax[0].twiny()
ax01.set_xlim(-15,45)
ax01.invert_xaxis()
ax01.set_xlabel('NPHI',color='blue')
ax01.spines['top'].set_position(('outward',0))
ax01.tick_params(axis='x',colors='blue')
ax01.plot(well_data.NPHI,well_data.index,color='blue')
#ax02 to house the RHOB curve (RHOB curve having values between 1.95,2.95)
ax02=ax[0].twiny()
ax02.set_xlim(1.95,2.95)
ax02.set_xlabel('RHOB',color='red')
ax02.spines['top'].set_position(('outward',40))
ax02.tick_params(axis='x',colors='red')
ax02.plot(well_data.RHOB,well_data.index,color='red')
# ax03=ax[0].twiny()
# ax03.set_xlim(0,50)
# ax03.spines['top'].set_position(('outward',80))
# ax03.fill_betweenx(well_data.index,well_data.RHOB,well_data.NPHI,alpha=0.5)
plt.show()
ax03=ax[0].twiny()
ax03.set_xlim(0,50)
ax03.spines['top'].set_position(('outward',80))
ax03.fill_betweenx(well_data.index,well_data.RHOB,well_data.NPHI,alpha=0.5)
above is the code that i tried, but the end result is not what i expected.
it is filling area between RHOB and NPHI assuming RHOB and NPHI is in the same scale.
How can i fill the area between the blue and the red curve?
Since the data are on two different axes, but each artist needs to be on one axes alone, this is hard. What would need to be done here is to calculate all data in a single unit system. You might opt to transform both datasets to display-space first (meaning pixels), then plot those transformed data via fill_betweenx without transforming again (transform=None).
import numpy as np
import matplotlib.pyplot as plt
y = np.linspace(0, 22, 101)
x1 = np.sin(y)/2
x2 = np.cos(y/2)+20
fig, ax1 = plt.subplots()
ax2 = ax1.twiny()
ax1.tick_params(axis="x", colors="C0", labelcolor="C0")
ax2.tick_params(axis="x", colors="C1", labelcolor="C1")
ax1.set_xlim(-1,3)
ax2.set_xlim(15,22)
ax1.plot(x1,y, color="C0")
ax2.plot(x2,y, color="C1")
x1p, yp = ax1.transData.transform(np.c_[x1,y]).T
x2p, _ = ax2.transData.transform(np.c_[x2,y]).T
ax1.autoscale(False)
ax1.fill_betweenx(yp, x1p, x2p, color="C9", alpha=0.4, transform=None)
plt.show()
We might equally opt to transform the data from the second axes to the first. This has the advantage that it's not defined in pixel space and hence circumvents a problem that occurs when the figure size is changed after the figure is created.
x2p, _ = (ax2.transData + ax1.transData.inverted()).transform(np.c_[x2,y]).T
ax1.autoscale(False)
ax1.fill_betweenx(y, x1, x2p, color="grey", alpha=0.4)
I have plotted a box and whiskers plot for my data
My Code:
red_diamond = dict(markerfacecolor='r', marker='D')
fig3, ax3 = plt.subplots()
ax3.set_title('Changed Outlier Symbols')
ax3.boxplot(maximum.values[:,1], flierprops=red_diamond)
and I obtained a plot as follows:
What I want to do: Print the values of the whiskers, the outliers (red diamonds), the quartiles and the median on the plot itself.
ax.boxplot returns a dictionary with all the lines that are plotted in the making of the box and whisker plot. One option would be to interrogate this dictionary, and create labels from the information it contains. The relevant keys are:
boxes for the IQR
medians for the median
caps for the whiskers
fliers for the outliers
Note, the function below only really works for a single boxplot (if you create multiple boxes in one go, you will need to be more careful how you grab the information from the dictionary).
An alternative would be to find the information from the data array itself (finding the median and IQR is easy). I'm not sure exactly how matplotlib determines what a flier is and where the caps should go. If you want do that, it should be easy enough to modify the function below.
import matplotlib.pyplot as plt
import numpy as np
# Make some dummy data
np.random.seed(1)
dummy_data = np.random.lognormal(size=40)
def make_labels(ax, boxplot):
# Grab the relevant Line2D instances from the boxplot dictionary
iqr = boxplot['boxes'][0]
caps = boxplot['caps']
med = boxplot['medians'][0]
fly = boxplot['fliers'][0]
# The x position of the median line
xpos = med.get_xdata()
# Lets make the text have a horizontal offset which is some
# fraction of the width of the box
xoff = 0.10 * (xpos[1] - xpos[0])
# The x position of the labels
xlabel = xpos[1] + xoff
# The median is the y-position of the median line
median = med.get_ydata()[1]
# The 25th and 75th percentiles are found from the
# top and bottom (max and min) of the box
pc25 = iqr.get_ydata().min()
pc75 = iqr.get_ydata().max()
# The caps give the vertical position of the ends of the whiskers
capbottom = caps[0].get_ydata()[0]
captop = caps[1].get_ydata()[0]
# Make some labels on the figure using the values derived above
ax.text(xlabel, median,
'Median = {:6.3g}'.format(median), va='center')
ax.text(xlabel, pc25,
'25th percentile = {:6.3g}'.format(pc25), va='center')
ax.text(xlabel, pc75,
'75th percentile = {:6.3g}'.format(pc75), va='center')
ax.text(xlabel, capbottom,
'Bottom cap = {:6.3g}'.format(capbottom), va='center')
ax.text(xlabel, captop,
'Top cap = {:6.3g}'.format(captop), va='center')
# Many fliers, so we loop over them and create a label for each one
for flier in fly.get_ydata():
ax.text(1 + xoff, flier,
'Flier = {:6.3g}'.format(flier), va='center')
# Make the figure
red_diamond = dict(markerfacecolor='r', marker='D')
fig3, ax3 = plt.subplots()
ax3.set_title('Changed Outlier Symbols')
# Create the boxplot and store the resulting python dictionary
my_boxes = ax3.boxplot(dummy_data, flierprops=red_diamond)
# Call the function to make labels
make_labels(ax3, my_boxes)
plt.show()
I am generating a PCA which uses scikitlearn, numpy and matplotlib. I want to know how to label each point (row in my data). I found "annotate" in matplotlib, but this seems to be for labeling specific coordinates, or just putting text on arbitrary points by the order they appear. I'm trying to abstract away from this but struggling due to the PCA sections that appear before the matplot stuff. Is there a way I can do this with sklearn, while I'm still generating the plot, so I don't lose its connection to the row I got it from?
Here's my code:
# Create a Randomized PCA model that takes two components
randomized_pca = decomposition.RandomizedPCA(n_components=2)
# Fit and transform the data to the model
reduced_data_rpca = randomized_pca.fit_transform(x)
# Create a regular PCA model
pca = decomposition.PCA(n_components=2)
# Fit and transform the data to the model
reduced_data_pca = pca.fit_transform(x)
# Inspect the shape
reduced_data_pca.shape
# Print out the data
print(reduced_data_rpca)
print(reduced_data_pca)
def rand_jitter(arr):
stdev = .01*(max(arr)-min(arr))
return arr + np.random.randn(len(arr)) * stdev
colors = ['red', 'blue']
for i in range(len(colors)):
w = reduced_data_pca[:, 0][y == i]
z = reduced_data_pca[:, 1][y == i]
plt.scatter(w, z, c=colors[i])
targ_names = ["Negative", "Positive"]
plt.legend(targ_names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title("PCA Scatter Plot")
plt.show()
PCA is a projection, not a clustering (you tagged this as clustering).
There is no concept of a label in PCA.
You can draw texts onto a scatterplot, but usually it becomes too crowded. You can find answers to this on stackoverflow already.
I want to calculate and plot the cumulative distribution function (CDF) of a given sample, new_dO18 and then overlay the CDF of a normal distribution with a given mean and standard deviation on the same plot. I am having problems normalizing the CDF. I should have values ranging between 0 and 1 on the x axis. Can someone guide me as to where I went wrong. I'm sure it's a simple fix but I'm very new to Python. I've included my steps so far. Thanks!
# Use np.histogram to get counts in each bin. See the help page or
# documentation on how to use this function, and what it returns.
# normalize the data new_dO18 using a for loop
norm_newdO18 = []
for element in new_dO18:
x = element
y = (x - np.mean(new_dO18))/np.std(new_dO18)
norm_newdO18.append(y)
print ('normalized dO18 values, excluding outliers:', norm_newdO18)
print()
# Use the histogram function to bin the data
num_bins = 20
counts, bin_edges = np.histogram(norm_newdO18, bins=num_bins, normed=0)
# Calculate and plot CDF of sample
cdf = np.cumsum(counts)
scale = 1.0/cdf[-1]
norm_cdf = scale * cdf
plt.plot(bin_edges[1:], norm_cdf, label = 'dO18 values')
plt.legend(bbox_to_anchor=(0, 1), loc='upper left', ncol=1)
plt.xlabel('normalized dO18 data')
plt.ylabel('frequency')
# Calculate and overlay the CDF of a normal distribution with sample mean and std
# as parameters.
# specific normally distributed function with mean and st. dev
mu, sigma = np.mean(new_dO18), np.std(new_dO18)
norm_theoretical = np.random.normal(mu, sigma, 1000)
# Calculate and plot CDF of theoretical sample
counts1, bin_edges1 = np.histogram(norm_theoretical, bins=20, normed=0)
cdft= np.cumsum(counts1)
scale = 1.0/cdft[-1]
norm_cdft = scale * cdf
plt.plot(bin_edges[1:], norm_cdft, label = 'theoretical values')
plt.legend(bbox_to_anchor=(0, 1), loc='upper left', ncol=1)
plt.show()