Pyspark: Doing a count on sample of dataframe instead whole dataframe

Pyspark: Doing a count on sample of dataframe instead whole dataframe - apache-spark

I currently have some code that computes the overall time taken to run the count operation on a dataframe. I have another implementation which measures the time taken to run count on a sampled version of this dataframe.
sampled_df = df.sample(withReplacement=False, fraction=0.1)
sampled_df.count()
I then extrapolate the overall count from the sampled count. But I do not see an overall decrease in the time taken for calculating this sampled count when compared to doing a count on the whole dataset. Both seem to take around 40 seconds. Is there a reason this happens? Also, is there an improvement in terms of memory when using a sampled count over count on whole dataframe?

You can use countApprox. This lets you choose how long your willing to wait for an approximate count/confidence interval.
Sample still needs to access all partitions to create a sample that is uniform. You aren't really saving anytime using a sample.

Related

Binomial Options Pricing Calculation in PowerQuery

Im trying to build an excel sheet that calculates synthetic options prices and greeks for time series data to model intraday options pricing, input is simply intraday price data, say Tick level to 5 minute interval. I found this https://www.thebiccountant.com/2021/12/28/black-scholes-option-pricing-with-power-query-in-power-bi/ which provides for powerBI and Black Scholes but possibly not very accurately. I prefer the Binomial method (I have used this excellent tutuorial to build a manual version for a large number of strikes but it takes a long time to calculate and is very very complex and also inaccurate due to not being able to calculate many steps before topping excel out: https://www.macroption.com/binomial-option-pricing-excel/).
Does anyone have any idea if this is possible to create an entire column in Power Query that will calculate bionomially derived options pricing using >100 even up to 1000 steps? The reason is intraday pricing using high resolution data 5min, 1min, Seconds and Tick I think needs a large number of steps to properly converge. This is just about doing a good enough model that can be used for visualising the progress of a trade on a given day.
Any pointers on how this could be done and calculated using M Language would be much appreciated and useful!

Weighted percentile calculation from group of percentiles

Can we calculate the overall kth percentile if we have kth percentile over 1 minute window for the same time period?
The underlying data is not available. Only the kth percentile and count of underlying data is available.
Are there any existing algorithms available for this?
How approximate will the calculated kth percentile be?

No. If you have only one percentile (and count) for every time period, then you cannot reasonably estimate that same percentile for the entire time period.
This is because percentiles are only semi-numerical measures (like Means) and don't implicitly tell you enough about their distributions above and below their measured values at each measurement time. There are a couple of exceptions to the above.
If the percentile that you have is the 50th percentile (i.e., the Mean), then you can do some extrapolation to the Mean of the whole time, but it's a bit sketchy and I'm not sure how bad the variance would be.
If all of your percentile measure are very close together (compared to the actual range of the measured population), then obviously you can use that as a reasonable estimate of the overall percentile.
If you can assume with high assurance that every minute's data is an independent sampling of the exact same population distribution (i.e., there is no time-dependence), then you may be able to combine them, possibly even if the exact distribution is not fully known (has parameter that are unknown, but still known to be fixed over the time-period). Again I am not sure what the valid functions and variance calculations are for this.
If the distribution is known (or can be assumed) to be a specific function or shape with some unknown value or values and where time-dependence has a known role in that function, then you should be able to using weighting and time-adjustments to transform into the same situation as #3 above. So for instance if the distributions were a time-varying exponential distribution of the form pdf(k,t) = (k*t)e^-(k*t) then I believe that you could derive an overall percentile estimate by estimating the value of k for by adjust it for each different minute (t).
Unfortunately I am not a professional statistician. I have Math/CS background, enough to have some idea of what's mathematically possible/reasonable, but not enough to tell exactly how to do it. If you think that your situation falls into one of the above categories, then you might be able to take it to https://stats.stackexchange.com but you will need to also provide the information I mentioned in those categories and/or detailed and specific information about what you are measuring and how you are measuring it.

Based on statistical instincts ,The error rate will be proportional to Standard Deviation of the total set. If you are creating a approximation for a longer time span , that includes the discrete chunks of kth percentile . [ clarification may be need for proving this theory.]

Counting How Many Times a "Peak" Happens

I'm in excel trying to count how many "peaks" there are in my data. Initially I thought "find the median, set a threshold value, count the number of values above said threshold value." So I ended up using =median(),=max() and =ifcount().
The problem is is that for a peak, there may be data-points that end up making the "slope" of said peak which are higher than the threshold value.
Wondering if there's an easy way in excel to count said peaks, or if I have to figure out a way to convert the data to a function and take a second derivative to find local maxima points.

Simple clustering method for following statement

I have a dataset:
I want to apply the clustering technique to create clusters of every 5 minutes' data and want to calculate the average of the last column i.e. percentage congestion.
How to create such clusters of every 5 minutes? I want to use this analysis further for decision making. The decision will be made on the basis of average percentage calculated.

That is a simple aggregation, and not clustering.
Use a loop, read one record at a time, and every 5 minutes output the average and reinitialize the accumulators.
Or round ever time to the 5 minute granularity. Then take the average of now identical keys. That would be a SQL GROUP_BY.

Spotfire- calculated column with row ratios based on condition

I’m having trouble understanding if Spotfire allows for conditional computations between arbitrary rows containing numerical data repeated over data groups. I could not find anything to cue me onto a right solution.
Context (simplified): I have data from a sensor reporting state of a process and this data is grouped into bursts/groups representing a measurement taking several minutes each.
Within each burst the sensor is measuring a signal and if a predefined feature (signal shape) was detected the sensor outputs some calculated value, V quantifying this feature and also reports a RunTime at which this happened.
So in essence I have three columns: Burst number, a set of RTs within this burst and Values associated with these RTs.
I need to add a calculated column to do a ratio of Values for rows where RT is equal to a specific number, let’s say 1.89 and 2.76.
The high level logic would be:
If a Value exists at 1.89 Run Time and a Value exists at 2.76 Run Time then compute the ratio of these values. Repeat for every Burst.
I understand I can repeat the computation over groups using OVER operator but I’m struggling with logic within each group...
Any tips would be appreciated.
Many thanks!

The first thing you need to do here is apply an order to your dataset. I assume the sample data is complete and encompasses the cases in your real data, thus, we create a calculated column:
RowID() as [ROWID]
Once this is done, we can create a calculated column which will compute your ratio over it's respective groups. Just a note, your B4 example is incorrect compared to the other groups. That is, you have your numerator and denominator reversed.
If(([RT]=1.89) or ([RT]=2.76),[Value] / Max([Value]) OVER (Intersect([Burst],Previous([ROWID]))))
Breaking this down...
If(([RT]=1.89) or ([RT]=2.76), limits the rows to those where the RT = 1.89 or 2.76.
Next comes the evaluation if the above condition is TRUE
[Value] / Max([Value]) OVER (Intersect([Burst],Previous([ROWID])))) This takes the value for the row and divides it by the Max([Value]) over the grouping of [Burst] and AllPrevious([ROWID]). This is noted by the Intersect() function. So, the denominator will always be the previous value for the grouping. Note that Max() was a simple aggregate used, but any should do for this case since we are only expecting a single value. All Over() functions require and aggregate to limit the result set to a single row.
RESULTS

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string