Remove outliers in multiple columns from a spark dataframe - apache-spark

I have a dataset of around 10 integer features and I wish to remove outliers from my dataset, from each feature.
What I have done in the past, is compute average and standard deviation for each feature and do a pass on the dataset, with discarding rows that qualify as outliers. Doing it on each column/ feature, helps me get rid of rows having at least one outlier feature.
Since parsing the dataset multiple times is not the optimal way, I was looking for ways to do this in a computation efficient manner. Can someone propose a better way so that the dataset can be parsed once and one can get rid of all outlier rows?

Related

Regression analysis with All text data

I want to know what is the best approach to handle a regression analysis on all text data type. I have the following data set.
my feature columns are: Strength, area of development, leadership, satisfactory
values of these columns are predefined set of texts eg. "Continuous Improvement,Self-Development,Coaching and Mentoring,Creativity,Adaptability"
based on the value in these columns I want to predict the label (overall Performance) - Outstanding or Exceeding Expectation or Meeting Expectation.
what should be the best approach to deal with this dataset ?

Pearson correlation in python data.corr()

I have a matrix with the following shape (20, 17) with rows being the time and columns the number of variables.
When i compute the correlation matrix using data.corr() naturally i get a (17 , 17) matrix.
My questions:
Is there a way to normalise the variables directly within the .corr() function? (i know i can do that before hand and then apply the function)
My correlation matrix is large and I have trouble viewing everything in one go (i have to scroll down to do the necessary comparison). Is there a way to present the results in a concise way (like a heat map) where i can easily spot the highest from the lowest correlation?
Many Thanks
You could use matplotlib's imshow() to see a heatmap of any matrix.
Also, consider using pandas dataframes, this way you could sort by correlation strength and keep the labels of each row and col.

PCA results on imbalanced data with duplicates

I am using sklearn IPCA decomposition and surprised that if I delete duplicates from my dataset, the result differs from the "unclean" one.
What is the reason? As I think, the variance is the same.
The answer is simple. The duplicates from the dataset change the variance.
https://stats.stackexchange.com/a/381983/230117

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Bootstrapping with Replacement

I'm reading a paper and am confused with their described Bootstrap Method. the text says:
the uncertainties associated with each stacked flux density are
obtained via the bootstrap method, during which random subsamples
(with replacement) of sources are chosen and re-stacked. The number of
sources in each subsample is equal to the original number of sources
in the stack. This process is repeated 10000 times in order to
determine the representative spread in the properties of the
population being stacked.
So, say I have 50 values. I find the average of these values. According to this method, I would get a subsample from this original 50 population and find that average, and repeat this 10,000 times. Now, how would I get a subsample "equal to the original number of sources in the stack" without my subsample BEING EXACTLY THE SAME AS THE ORIGINAL, AND THUS THE EXACT SAME MEAN, WHICH WOULD TELL US NOTHING!?
you can reuse values. So if i have ABCDE as my values, i can bootstrap with AABCD, etc. I can use values twice, that is the key

Resources