Capping the outliers - python-3.x

I have a data frame with 3 numerical variables for which I am trying to cap the outliers between 0.01 to 0.99 percentile but it's not working.
df[['TotalVisits', 'Total Time Spent on Website',
'Page Views Per Visit']].describe(percentiles=[.25, .5, .75, .90, .95, .99])
This is the output
Now I tried to cap the outliers as this:
q_l = df['TotalVisits'].quantile(0.00)
q_h = df['TotalVisits'].quantile(0.99)
df['TotalVisits'][df['TotalVisits']<= q_l] = q_l
df['TotalVisits'][df['TotalVisits']>= q_h] = q_h
But the output remains same instead of max value changing to 17.

You are doing the chain slice assign which will failed
Fix your code
q_l = df['TotalVisits'].quantile(0.00)
q_h = df['TotalVisits'].quantile(0.99)
df.loc[df['TotalVisits']<= q_l, 'TotalVisits'] = q_l
df.loc[df['TotalVisits']>= q_h], 'TotalVisits'] = q_h
And use pandas function improve it clip
df['TotalVisits'] = df['TotalVisits'].clip(lower = q_l, upper = q_h)

Related

How to find the shortest distance between two line segments capturing the sign values with python

I have a pandas dataframe of the form:
benchmark_x benchmark_y ref_point_x ref_point_y
0 525039.140 175445.518 525039.145 175445.539
1 525039.022 175445.542 525039.032 175445.568
2 525038.944 175445.558 525038.954 175445.588
3 525038.855 175445.576 525038.859 175445.576
4 525038.797 175445.587 525038.794 175445.559
5 525038.689 175445.609 525038.679 175445.551
6 525038.551 175445.637 525038.544 175445.577
7 525038.473 175445.653 525038.459 175445.594
8 525038.385 175445.670 525038.374 175445.610
9 525038.306 175445.686 525038.289 175445.626
I am trying to find the shortest distance from the line to the benchmark such that if the line is above the benchmark the distance is positive and if it is below the benchmark the distance is negative. See image below:
I used the KDTree from scipy like so:
from scipy.spatial import KDTree
tree=KDTree(df[["benchmark_x", "benchmark_y"]])
test = df.apply(lambda row: tree.query(row[["ref_point_x", "ref_point_y"]]), axis=1)
test=test.apply(pd.Series, index=["distance", "index"])
This seems to work except that it fails to capture the negative values as a result that the line is below the benchmark.
# recreating your example
columns = "benchmark_x benchmark_y ref_point_x ref_point_y".split(" ")
data = """525039.140 175445.518 525039.145 175445.539
525039.022 175445.542 525039.032 175445.568
525038.944 175445.558 525038.954 175445.588
525038.855 175445.576 525038.859 175445.576
525038.797 175445.587 525038.794 175445.559
525038.689 175445.609 525038.679 175445.551
525038.551 175445.637 525038.544 175445.577
525038.473 175445.653 525038.459 175445.594
525038.385 175445.670 525038.374 175445.610
525038.306 175445.686 525038.289 175445.626"""
data = [float(x) for x in data.replace("\n"," ").split(" ") if len(x)>0]
arr = np.array(data).reshape(-1,4)
df = pd.DataFrame(arr, columns=columns)
# adding your two new columns to the df
from scipy.spatial import KDTree
tree=KDTree(df[["benchmark_x", "benchmark_y"]])
df["distance"], df["index"] = tree.query(df[["ref_point_x", "ref_point_y"]])
Now to compare if one line is above the other or not, we have to evaluate y at the same x position. Therefore we need to interpolate the y points for the x positions of the other line.
df = df.sort_values("ref_point_x") # sorting is required for interpolation
xy_refpoint = df[["ref_point_x", "ref_point_y"]].values
df["ref_point_y_at_benchmark_x"] = np.interp(df["benchmark_x"], xy_refpoint[:,0], xy_refpoint[:,1])
And finally your criterium can be evaluated and applied:
df["distance"] = np.where(df["ref_point_y_at_benchmark_x"] < df["benchmark_y"], -df["distance"], df["distance"])
# or change the < to <,>,<=,>= as you wish

Multiple time frame moving average showing consistent colors

The goal is to have different time frames of the same MA period colored based on the criteria below. Exmaple i want the MA from the 5min chart to be plotted with the proper colors regardless of what happens
problem is the script splits up a higher time frame into my current one. 5min is split into five 1min bars. these 1min bars' colors do not correlate with the 5min TF color when plotted. See the picture attached, the circled squares are in regular intervals but there are grey squares in between that i want to either color green or ideally get rid of. im on a 1 min time frame and the circles are of the 5min TF enter image description here
here is the code for the 1min, the higher TFs replace '1' with '5' where needed.
//overlay is false
timeframe1 = input(1, "Timeframe 1", minval=1)
calculateBg(timeframe) =>
sma = security(syminfo.tickerid, tostring(timeframe), sma(close, 20))
uptrend = sma > sma[1] and close > sma
downtrend = sma < sma[1] and close < sma
downtrend ? color.red : uptrend ? color.green : color.new(#B2B5BE, 50)
plotshape(1.00, "1", shape.square, location.absolute, color=calculateBg(timeframe1), size=size.tiny, transp=0)
Your calculations must be done in security()'s HTF context. Default HTF TF was changed to 5min:
//#version=4
study("")
timeframe1 = input(5, "Timeframe 1", minval=1)
calculateBg(timeframe) =>
sma = sma(close, 20)
uptrend = sma > sma[1] and close > sma
downtrend = sma < sma[1] and close < sma
[up, dn] = security(syminfo.tickerid, tostring(timeframe), [uptrend, downtrend])
dn ? color.red : up ? color.green : color.new(#B2B5BE, 50)
plotshape(1.00, "1", shape.square, location.absolute, color=calculateBg(timeframe1), size=size.tiny, transp=0)

Obtaining hyperpolarization depth from electrophysiological graph

I am working on electrophysiological data which is in .abf format.
I want to obtain the hyperpolarization depth as indicated above in the figure. This is what I have done so far;
import matplotlib.pyplot as plt
import pyabf
import pandas as pd
abf = pyabf.ABF("test.abf")
abf.setSweep(10) # I can access a given sweep. Here sweep 10
df = pd.DataFrame({'time': abf.sweepX, 'current':abf.sweepY})
df1 = df.loc[15650:15800]
df1.plot(x='time', y='current')
I am thinking to apply change in derivative to find the first point of interest (x1,y1) and then lower point (x2,y2), but it looks complex. I would appreciate if someone give some hint or procedure.
The dataset as follow,
time current
0.7825 -63.323975
0.78255 -63.171387
0.7826 -62.89673
0.78265 -62.713623
0.7827 -62.469482
0.78275 -62.37793
0.7828 -62.10327
0.78285 -61.950684
0.7829 -61.76758
0.78295 -61.584473
0.783 -61.401367
0.78305 -61.24878
0.7831 -61.035156
0.78315 -60.85205
0.7832 -60.72998
0.78325 -60.516357
0.7833 -60.455322
0.78335 -60.2417
0.7834 -60.08911
0.78345 -59.96704
0.7835 -59.814453
0.78355 -59.661865
0.7836 -59.509277
0.78365 -59.417725
0.7837 -59.23462
0.78375 -59.11255
0.7838 -58.95996
0.78385 -58.86841
0.7839 -58.685303
0.78395 -58.59375
0.784 -58.441162
0.78405 -58.34961
0.7841 -58.19702
0.78415 -58.044434
0.7842 -57.922363
0.78425 -57.769775
0.7843 -57.678223
0.78435 -57.434082
0.7844 -57.34253
0.78445 -56.9458
0.7845 -56.274414
0.78455 -54.96216
0.7846 -53.253174
0.78465 -51.208496
0.7847 -48.950195
0.78475 -46.325684
0.7848 -43.09082
0.78485 -38.42163
0.7849 -31.036377
0.78495 -22.033691
0.785 -13.397217
0.78505 -6.072998
0.7851 -0.61035156
0.78515 2.7160645
0.7852 3.9367676
0.78525 3.4179688
0.7853 1.3427734
0.78535 -1.4953613
0.7854 -5.0964355
0.78545 -9.185791
0.7855 -13.641357
0.78555 -18.249512
0.7856 -23.132324
0.78565 -27.98462
0.7857 -32.714844
0.78575 -37.261963
0.7858 -41.47339
0.78585 -45.22705
0.7859 -48.553467
0.78595 -51.54419
0.786 -53.985596
0.78605 -56.18286
0.7861 -58.013916
0.78615 -59.539795
0.7862 -60.760498
0.78625 -61.88965
0.7863 -62.652588
0.78635 -63.323975
0.7864 -63.934326
0.78645 -64.2395
0.7865 -64.60571
0.78655 -64.78882
0.7866 -65.00244
0.78665 -64.971924
0.7867 -65.093994
0.78675 -65.03296
0.7868 -64.971924
0.78685 -64.819336
0.7869 -64.78882
0.78695 -64.66675
0.787 -64.48364
0.78705 -64.42261
0.7871 -64.2395
0.78715 -64.11743
0.7872 -63.964844
0.78725 -63.842773
0.7873 -63.659668
0.78735 -63.568115
0.7874 -63.446045
0.78745 -63.26294
0.7875 -63.171387
0.78755 -62.98828
0.7876 -62.89673
0.78765 -62.74414
0.7877 -62.713623
0.78775 -62.530518
0.7878 -62.438965
0.78785 -62.37793
0.7879 -62.25586
0.78795 -62.164307
0.788 -62.042236
0.78805 -62.01172
0.7881 -61.88965
0.78815 -61.88965
0.7882 -61.73706
0.78825 -61.706543
0.7883 -61.645508
0.78835 -61.61499
0.7884 -61.523438
0.78845 -61.462402
0.7885 -61.431885
0.78855 -61.340332
0.7886 -61.37085
0.78865 -61.279297
0.7887 -61.279297
0.78875 -61.157227
0.7888 -61.187744
0.78885 -61.09619
0.7889 -61.157227
0.78895 -61.12671
0.789 -61.09619
0.78905 -61.12671
0.7891 -61.00464
0.78915 -61.00464
0.7892 -60.97412
0.78925 -60.97412
0.7893 -60.943604
0.78935 -61.00464
0.7894 -60.913086
0.78945 -60.97412
0.7895 -60.943604
0.78955 -60.913086
0.7896 -60.943604
0.78965 -60.85205
0.7897 -60.85205
0.78975 -60.821533
0.7898 -60.88257
0.78985 -60.88257
0.7899 -60.913086
0.78995 -60.88257
0.79 -60.913086
We can plot the difference in current between consecutive points (which essentially is to a constant factor the derivative, since times are evenly spaced). First chart shows the actual diffs. Based on this we can set some threshold, such as 0.3, and apply it to filter the main DataFrame. The filtered values are shown in orange on the second chart:
fig, ax = plt.subplots(2, figsize=(8,8))
# plot derivative
df['current'].diff().plot(ax=ax[0])
# current
threshold = 0.4
df['filtered'] = df.loc[df['current'].diff().abs() > threshold]
df.plot(ax=ax[1])
# add spans
x = df['filtered'].dropna()
ax[1].axhspan(x.iloc[0], x.iloc[-1], alpha=0.3, edgecolor='skyblue', facecolor="none", hatch='////')
ax[1].axvspan(x.index.min(), x.index.max(), alpha=0.3, edgecolor='orange', facecolor="none", hatch='\\\\')
Output:
If you're interested in range values, you can dropna values in the filtered subset and find min and max from the index:
print('min', df['filtered'].dropna().index.min())
print('max', df['filtered'].dropna().index.max())
Output:
min 0.78445
max 0.7865
For the value of the gap you can use:
abs(df['filtered'].dropna().iloc[-1] - df['filtered'].dropna().iloc[0])
Output:
7.6599100000000035
Note: We can alternatively also get left edges of these spans as points where diff in the point is lower than the threshold and diff in the next point is higher than the threshold, and similarly for the right edges. This would also work in case we have multiple peaks:
threshold = 0.3
x = df['current'].diff().abs()
spanA = df.loc[(x < threshold) & (x.shift(-1) >= threshold)]
spanB = df.loc[(x >= threshold) & (x.shift(-1) < threshold)]
print(spanA)
current
time
0.7844 -57.34253
print(spanB)
current
time
0.7865 -64.60571

Python 3D image segmentation find local peaks in distance map for watershed

I am trying to segment 3d tomographs of porous networks in python. I am able to calculate the distance map with ndimage.distance_transform_edt and the peaks with feature.peak_local_max. when I apply the watershed algorithm a get an acceptable result, but the markers of the peaks are not located at the visible peaks, see image, of the distance map
Thanks in advance
Here the code a is the image
D = ndimage.distance_transform_edt(a)
localMax = feature.peak_local_max(D, indices=False, min_distance=50,
labels=a)
localMax2 = feature.peak_local_max(D, indices=True, min_distance=50,
labels=a)
markers = ndimage.label(localMax, structure=np.ones((3,3,3)))[0]
labels = morphology.watershed(-D,markers,mask=a)
I found a way:
i had to exclude the borders and apply a threshold
D = ndimage.distance_transform_edt(a)
localMax = feature.peak_local_max(D, indices=False, min_distance=30,
labels=a,threshold_abs=9,exclude_border=1)
localMax2 = feature.peak_local_max(D, indices=True, min_distance=30,
labels=a,threshold_abs=9,exclude_border=1)
#markers = ndimage.label(localMax, structure=np.ones((3,3,3)))[0]
markers = ndimage.label(localMax, structure=np.ones((3,3,3)))[0]
labels = morphology.watershed(-D,markers,mask=a)
regions=measure.regionprops(labels,intensity_image=a)

how to calculate the feature vectors approximately for storing in excel sheet?

In my GUI I am using this matlab code to store the values in excel sheet.This code is calculating the glcm six features.
function [Contrast,cor,ener,homo,Var,Entropy] = glcm_feature_extraction(I1)
Contrast = graycoprops(graycomatrix(rgb2gray(I1)),'Contrast')
cor= graycoprops(graycomatrix(rgb2gray(I1)), 'Correlation')
ener = graycoprops(graycomatrix(rgb2gray(I1)), 'Energy')
homo = graycoprops(graycomatrix(rgb2gray(I1)), 'Homogeneity')
img = double(I1);
Var = var((img(:)))
Entropy=entropy(I1)
Here suppose I get these values after calculation:
[0.603606103 : 0.785092239 : 0.271728411 : 0.855429408 :1889.578963 : 7.178149206]
But iI need only approx value like:
[0.6 : 0.7 : .2 ....]
How to modify this code to get this result?
For example, lets consider Contrast=0.603606103
And you wanted to make approximately as 0.6 then use the following :
sprintf('%.1f',Contrast);
which should give you result exactly Contrast=0.6
Similarly do it for all remaining 5 variables.

Resources