Julia - describe() function display incomplete summary statistics - statistics

I'm trying basic data analysis with Julia
I'm following this tutorial with the train datasets that can be found here (the one named train_u6lujuX_CVtuZ9i.csv) with the following code:
using DataFrames, RDatasets, CSV, StatsBase
train = CSV.read("/Path/to/train_u6lujuX_CVtuZ9i.csv");
describe(train[:LoanAmount])
and get this output:
Summary Stats:
Length: 614
Type: Union{Missing, Int64}
Number Unique: 204
instead of the output of the tutorial:
Summary Stats:
Mean: 146.412162
Minimum: 9.000000
1st Quartile: 100.000000
Median: 128.000000
3rd Quartile: 168.000000
Maximum: 700.000000
Length: 592
Type: Int64
% Missing: 3.583062
Which also corresponds to the output of StatsBase.jl that the describe() function should give

This is how it is currently (in the current release) implemented in StatsBase.jl. In short train.LoanAmount does not have eltype that is subtype of Real and then StatsBase.jl uses a fallback method that only prints length, eltype and number of unique values. You can write describe(collect(skipmissing(train.LoanAmount))) to get summary statistics (except number of missings of course).
Actually, however, I would recommend you to use another approach. If you want to get a more verbose output on a single column use:
describe(train, :all, cols=:LoanAmount)
you will get an output that additionally is returned as a DataFrame so that you can not only see the statistics but also access them.
Option :all will print all statistics please refer to describe docstring in DataFrames.jl to see available options.
You can find some examples of using this function on a current release of DataFrames.jl here.

Related

PromQL: increase over counter

Here my current counter value:
method_timed_seconds_count{method="getByUId"} ---> 68
After having fetch a http request to my service, this counter is iscremented:
method_timed_seconds_count{method="getByUId"} ---> 69
I want to get how this counter has increased inside a 30s window, using this:
increase(method_timed_seconds_count{method="getByUId"}[30s]) ---> 2
However, I'm getting value 2!
Why? I was expecting to get 1!
Scrape interval is 15s.
Any ideas?
Prometheus has the following issues with increase() calculations:
It extrapolates increase() results - see this issue.
It doesn't take into account the increase between the last raw sample before the specified lookbehind window in square brackets and the first raw sample inside the lookbehind window. See this design doc for details.
It misses the increase for the first raw sample in a time series. For example, if a time series starts from 5 and has the following samples: 5 6 9 12, then increase over these samples would return something around 12-5=7 instead of the expected 12.
That's why it isn't recommended to use increase() in Prometheus for calculating the exact counter increases.
P.S. If you need calculating the exact counter increases over the specified lookbehind window, then try VictoriaMetrics - Prometheus-like monitoring system I work on. It provides increase() function, which is free from the issues mentioned above.

Does spark use memorization when applying a transformation?

For example : assume we have the following dataset :
Student
Grade
Bob
10
Sam
30
Tom
30
Vlad
30
when spark executes the following transformation :
df.withColumn("Grade_minus_average", df("Grade") - lit(average) )
will spark compute "30 - average" 3 times or will it reuse the computation ?
(let's assume there is only one partition)
No.
An excellent source: https://www.linkedin.com/pulse/catalyst-tungsten-apache-sparks-speeding-engine-deepak-rajak/
Constant folding is the process of recognizing and evaluating constant expressions at compile time rather than computing them at
runtime. This is not in any particular way specific to Catalyst. It is
just a standard compilation technique and its benefits should be
obvious. It is better to compute expression once than repeat this for
each row.
As you have a constant and a variable, it will be done for each row and there is no process within Spark, that at run-time track if the previous invocation had the same input, nor a cache of previous values outcomes.

In Weka 3.8.3 get different result when set the probaabilityEstimates true or false

I use the same training data set and testing data set.
And I choose Weka classifiers-> functions-> LibSVM , and use default parameters.
I use default parameter and get the result:
https://imgur.com/aIq90wP
When I set the parameter probabilityEstimates to true, I get this result:
https://imgur.com/NGVY5No
The default parameters set are like this:
https://imgur.com/GOfLnVd
Why am I getting different results?
Maybe it's a silly question but I'll be grateful if someone can answer this.
Thanks!
This seems to be related to the random number process.
I used the same libSVM, all defaults, with diabetes.arff (comes with the software).
Run 1: no probabilityEstimates, 500 correct
Run 2: same, 500 correct
Run 3: probabilityEstimates, 498 correct
Run 4: same, 498 correct (so, with identical parameters, the process replicates)
Run 5: probabilityEstimates, but change seed from 1 to 55, 500 correct.
Run 6: probabilityEstimates, but change seed from 55 to 666, 498 correct.
Run 7: probabilityEstimates, but change seed from 666 to 1492, 499 correct.
The algorithm needs, for whatever reason, a different amount of random numbers or uses them in a different order, resulting in slight perturbations in the number correct when probabilityEstimates are requested. We get the same effect if we change the random number seed (which tells the random number generator where to start).

What factors influence the "Avoids Enormous Network Payloads" message?

According to the official documentation, the target is 1600KB:
https://developers.google.com/web/tools/lighthouse/audits/network-payloads
However, even when the payload is larger than the target, it sometimes still shows as a passed audit:
payload of 2405KB still passes
What are the other conditions that allow large payloads to pass the audit?
Lighthouse scores are 0–100 based on some log normal distribution math.
1600KB is not a passing score, it is approximately a maximum possible 100 score.
As of right now the values used for distribution calculation are 2500KB point of diminishing returns and 4000KB median, which would correspond to scores of about 93 and 50 respectively.
That puts 2405KB result at ~94 score which is sufficient to pass.

Find sub-sequence of events from a stream of events

I am giving a miniature version of my issue below
I have 2 different sensors sending 1/0 values as a stream. I am able to consume the stream using Kafka and bring it to spark for processing. Please note a sample stream I have given below.
Time --------------> 1 2 3 4 5 6 7 8 9 10
Sensor Name --> A A B B B B A B A A
Sensor Value ---> 1 0 1 0 1 0 0 1 1 0
I want to identify a sub sequence pattern occurring in this stream. For eg- if A =0 and the very next value (based on time) in the stream is B =1 then I want to push an alert. In the example above I have highlighted 2 places – where I want to give an alert. In general it will be like
“If a set of sensor-event combination happens within a time interval,
raise an alert”.
I am new to spark and don’t know Scala. I am currently doing my coding using python.
My actual problem contains more sensors and each sensor can have different value combinations. Meaning my subsequence and event stream
I have tried Couple of options without success
Window Functions – Can be useful for moving avgs cumulative sums
etc. not for this usecase
Bring spark Dataframes /RDDs to local python structure like list
and panda Dataframes and do sub-sequencing – it take lots of
shuffles and spark event streams queued after some iterations
UpdateStatewithKey – Tried couple of ways and not able to understand
fully how this works and whether this is applicable for this use
case.
Anyone looking for a solution to this question can use my solution:
1- To keep them connected, you need to gather events with collect_list.
2- It's best to sort your event on the collect_list, but be cautious because it arranges data by the first column, so it's important to put the DateTime in that column.
3- I dropped DateTime from collect_list, as an example.
4- Finally, you should contact all elements to explore it with string functions like contain to find your subsequence.
.agg(expr("array_join(TRANSFORM(array_sort(collect_list((Time , Sensor Value))), a -> a.Time ),'')")as "MySequence")
after this agg function, you can use any regular expression or string function to detect your pattern.
check this link for more information about collect_list:
collect list
check this link for more information about sorting a collect_list:
sort a collect list

Resources