Robust statistics linear regression in seaborn pairplot - statistics

Trying to implement robust statistics instead of ordinary least squares (OLS) fitting so that outliers aren't such a problem to my fits.
I was hoping to implement this in the pairplot function of seaborn and can't see and easy way to add this from the AP documentation as there doesn't seem to be a key word argument for the fit.
From: scipy lectures They suggest using the following but I guess thats for regplot where you can define the fit using
`fit = statsmodels.formula.api.rlm()`
Here is some sample code
import seaborn as sns; sns.set(style="ticks", color_codes=True)
import matplotlib.pyplot as plt
%matplotlib inline
iris = sns.load_dataset("iris")
sns.pairplot(iris, kind="reg")#, robust = True)
plt.show()
Thanks in advance!
Edit: I found a workaround, but loose the 'hue' function apparently that could be done on the pairplot. Would be a nice feature to add robust option to pairplot.
Code:
def corrfunc(x, y, **kws):
r, _ = stats.pearsonr(x, y)
ax = plt.gca()
ax.annotate("r = {:.2f}".format(r), xy=(.1, .9), xycoords=ax.transAxes)
g = sns.PairGrid(df1, palette=["red"])
g.map_upper(sns.regplot, robust = True)
g.map_diag(sns.distplot, kde=True)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_lower(corrfunc)

Extra keywords, such as "robust = True" can be passed to regplot via the plot_kws argument:
sns.pairplot(df1,kind='reg',hue='species',plot_kws=dict(robust=True,n_boot=50))
NB: In this example I have also decreased n_boot to reduce the runtime (see "robust" in regplot documentation), so the confidence intervals might be incorrect).

Related

Is there a way using librosa's waveplot to store the coordinates of the graph rather than show the image of the waveplot?

I am working on an audio project where I am using Librosa and have the following code from an example online. Rather than opening up an image with a graph of the amplitude versus time, I want to be able to store the coordinates that make up the graph in an array. I have tried a lot of different examples found on stackoverflow as well as other websites with no luck. I am relatively new to python and this is my first question on stackoverflow so please be kind.
import librosa.display
import matplotlib.pyplot as plt
from IPython.display import display, Audio
filename = 'queen2.mp3'
samples, sampleRate = librosa.load(filename)
display(Audio(filename))
plt.figure(figsize=(12, 4))
librosa.display.waveplot(y, sr=None, max_points=200)
plt.show()
librosa is open-source (under the ISC license), so you can look at the code to see how it does this. The documentation for functions has a handy [source] link which takes you do the code. For librosa.display.waveplot you will see that it calls a function __envelope() to compute the envelope. Presumably it is these coordinates you are after.
hop_length = 1
y = __envelope(y, hop_length)
y_top = y[0]
y_bottom = -y[-1]
import numpy as np
def __envelope(x, hop):
'''Compute the max-envelope of non-overlapping frames of x at length hop
x is assumed to be multi-channel, of shape (n_channels, n_samples).
'''
x_frame = np.abs(util.frame(x, frame_length=hop, hop_length=hop))
return x_frame.max(axis=1)

Scipy.detrend: Function changes range of values

I am trying to detrend this one dimensional array:
array([13.64352283, 13.48914862, 13.00767009, 13.35416524, 13.60143818,
13.40895156, 13.48349417, 13.65703125, 13.4959721 , 13.28891263,
12.97999066, 13.01112397, 12.79519705, 13.32030445, 13.19949068,
12.88691975, 13.32079707])
The function runs without errors but changes the range of values from ~[12,14] to ~[-0.4,0.4].
I believe it is due to the small std dev of the values that this happens.
Any ideas how to fix this, so I can plot the array with trend and the detrended one into one plot?
Normalization is not an option.
Please help.
Well, that is exactly what detrend does: it subtracts the values of the least square linear approximation to the input.
Here is a plot to illustrate what happens:
from scipy import signal
import numpy as np
import matplotlib.pyplot as plt
y = np.array([13.64352283, 13.48914862, 13.00767009, 13.35416524, 13.60143818,
13.40895156, 13.48349417, 13.65703125, 13.4959721, 13.28891263,
12.97999066, 13.01112397, 12.79519705, 13.32030445, 13.19949068,
12.88691975, 13.32079707])
plt.plot(y, color='dodgerblue')
plt.plot(signal.detrend(y), color='limegreen')
plt.plot(y - signal.detrend(y), color='crimson')
plt.show()
The red line in the plot is the linear approximation that got subtracted from the original data to obtain detrend(y).

Using weighted adjacency matrices to calculate global efficiency of said matrix using networkx

I have been trying to study the impact on a network by looking at deletions of different combinations of nodes.
To study this I have used the networkx graph theory metric, global efficiency. But, I figured that the networkx code ignores weight when calculating global efficiency. So, I went in and changed the source code and added weight as a metric. It seems to be working and is giving me different values than the non-weighted approach but is exceptionally slow (about 20 times).
How can I speed up these computations?
##The code I am running
import networkx
import numpy as np
from networkx import algorithms
from networkx.algorithms import efficiency
from networkx.algorithms.efficiency import global_efficiency
import pandas
data=pandas.read_csv("ones.csv")
lol = data.values.tolist()
data=pandas.read_csv("twos.csv")
lol2 = data.values.tolist()
combo=[["10pp", "10d"]]
GE_list=[]
for row in combo:
values = row
datasafe=pandas.read_csv("b1.csv", index_col=0)
datasafe.loc[values, :] = 0
datasafe[values] = 0
g=networkx.from_pandas_adjacency(datasafe)
ge=global_efficiency(g)
GE_list.append(ge)
extra=[""]
extra2=["full"]
combo.append(extra)
combo.append(extra2)
datasafe=pandas.read_csv("b1.csv", index_col=0)
g=networkx.from_pandas_adjacency(datasafe)
ge=global_efficiency(g)
GE_list.append(ge)
values = ["s6-8","p9-46v","p47r","p10p","IFSp","IFSa",'IFJp','IFJa','i6-8','a9-46v','a47r','a10p','9p','9a','9-46d','8C','8BL','8AV','8AD','47s','47L','10pp','10d','46','45','44']
datasafe=pandas.read_csv("b1.csv", index_col=0)
datasafe.loc[values, :] = 0
datasafe[values] = 0
g=networkx.from_pandas_adjacency(datasafe)
ge=global_efficiency(g)
GE_list.append(ge)
output=pandas.DataFrame(list(zip(combo, GE_list)))
output.to_csv('delete 1.csv',index=None)
##The change I made to the original networkx code
try:
eff = 1 / nx.shortest_path_length(G, u, v)
## changed to
try:
eff = 1 / nx.shortest_path_length(G, u, v, weight='weight')
Previously with my unweighted graphs I was able to process my data in 2 hours, currently its taking the same time to do a twentieth of the data. Please do suggest any improvements to my code or any other pieces of code that I can run.
Ps-I don't have a great understanding of python, so please do bear with me :)
Using weights, you exchange breadth-first search with Dijkstra algorithm, which increases the runtime by log|V|, see second comment of https://stackoverflow.com/a/25449911
If you have problem with the runtime, you should rather exchange networkx, which is implemented in python, with a C implementation like graph-tool or igraph, see e.g. for a (probably biased) comparison of performance: https://graph-tool.skewed.de/performance

Reprojecting Xarray Dataset

I'm trying to reproject a Lambert Conformal dataset to Plate Carree. I know that this can easily be done visually using cartopy. However, I'm trying to create a new dataset rather than just show a reprojected image. Below is methodology I have mapped out but I'm unable to subset the dataset properly (Python 3.5, MacOSx).
from siphon.catalog import TDSCatalog
import xarray as xr
from xarray.backends import NetCDF4DataStore
import numpy as np
import cartopy.crs as ccrs
from scipy.interpolate import griddata
import numpy.ma as ma
from pyproj import Proj, transform
import metpy
# Declare bounding box
min_lon = -78
min_lat = 36
max_lat = 40
max_lon = -72
boundinglat = [min_lat, max_lat]
boundinglon = [min_lon, max_lon]
# Load the dataset
cat = TDSCatalog('https://thredds.ucar.edu/thredds/catalog/grib/NCEP/HRRR/CONUS_2p5km/latest.xml')
dataset_name = sorted(cat.datasets.keys())[-1]
dataset = cat.datasets[dataset_name]
ds = dataset.remote_access(service='OPENDAP')
ds = NetCDF4DataStore(ds)
ds = xr.open_dataset(ds)
# parse the temperature at 850 and # 0z reftime
tempiso = ds.metpy.parse_cf('Temperature_isobaric')
t850 = tempiso[0][2]
# transform bounding lat/lons to src_proj
src_proj = tempiso.metpy.cartopy_crs #aka lambert conformal conical
extents = src_proj.transform_points(ccrs.PlateCarree(), np.array(boundinglon), np.array(boundinglat))
# subset the data using the indexes of the closest values to the src_proj extents
t850_subset = t850[(np.abs(tempiso.y.values - extents[1][0])).argmin():(np.abs(tempiso.y.values - extents[1][1])).argmin()][(np.abs(tempiso.x.values - extents[0][1])).argmin():(np.abs(tempiso.x.values - extents[0][0])).argmin()]
# t850_subset should be a small, reshaped dataset, but it's shape is 0x2145
# now use nplinspace, npmeshgrid & scipy interpolate to reproject
My transform point > find nearest value subsetting isn't working. It's claiming the closest points are outside the realm of the dataset. As noted, I plan to use nplinspace, npmeshgrid and scipy interpolate to create a new, square lat/lon dataset from t850_subset.
Is there an easier way to resize & reproject an xarray dataset?
Your easiest path forward is to take advantage of xarray's ability to do pandas-like data selection; this is IMO the best part of xarray. Replace your last two lines with:
# By transposing the result of transform_points, we can unpack the
# x and y coordinates into individual arrays.
x_lim, y_lim, _ = src_proj.transform_points(ccrs.PlateCarree(),
np.array(boundinglon), np.array(boundinglat)).T
t850_subset = t850.sel(x=slice(*x_lim), y=slice(*y_lim))
You can find more information in the documentation on xarray's selection and indexing functionality. You would probably also be interested in xarray's built-in support for interpolation. And if interpolation methods beyond SciPy's are of interest, MetPy also has a suite of other interpolation methods.
We have various "regridding" methods in Iris, if that isn't too much of a context switch for you.
Xarray explains its relationship to Iris here, and provides a to_iris() method.

Python equivalent of Pulse Integration (pulsint) MATLAB function?

I am surfing online to find Python equivalent of function pulsint of MATLAB , and till now I haven't found anything significantly close to it.
Any heads up in this regards will be really helpful!
You can easily create your own pulsint function.
The pulsint formula:
You need the numpy library to keep the things simple
import numpy as np
import matplotlib as mpl
# Non coherent integration
def pulsint(x):
return np.sqrt(np.sum(np.power(np.absolute(x),2),0))
npulse = 10;
# Random data (100x10 vector)
x = np.matlib.repmat(np.sin(2*np.pi*np.arange(0,100)/100),npulse,1)+0.1*np.random.randn(npulse,100)
# Plot the result
mpl.pyplot.plot(pulsint(x))
mpl.pyplot.ylabel('Magnitude')

Resources