Calculate mean deviation with Splunk - statistics

I have a list of values in Splunk. I can use this list to calcualte avg(vals) and stdev(vals). How do I calculate the mean deviation.
The mean deviation is the average absolute difference between the mean and each value in the list.
(Sum_x |mean-x|) / N

The following SPL can be used to calculate the mean deviation of all values.
| eventstats mean(value) as mean | eval distance=abs(mean-value) | stats avg(distance) as mean_deviation
For example, this will generate 10 random values and then calculate the mean deviation.
| makeresults count=10 | eval value=random()%10 | eventstats mean(value) as mean | eval distance=abs(mean-value) | stats avg(distance) as mean_deviation
eventstats is used to calculate the mean all the values, and add this new field to each event. Then, eval disatnace is used to calculate the absolute distance away each value is from the mean. The final stats is just used to determine the average of this value.
Look here for documentation around eventstats https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchReference/Eventstats , and a good blog post around the differences between stats, eventstats and streamstats can be found at https://www.splunk.com/en_us/blog/tips-and-tricks/search-command-stats-eventstats-and-streamstats-2.html

Related

How to convert an array of values so that each value is closer the mean, but with a similarly shaped distribution (i.e. reduce the stdev) in PySpark

I hope I've described the job I need to do in the correct terms. Essentially, I need to 'compress' a series of values so that all the values are closer to the mean, but their values should be reduced (or increased) relative to their distance from the mean...
The dataframe looks like this:
>>> df[['population', 'postalCode']].show(10)
+----------+----------+
|population|postalCode|
+----------+----------+
| 1464| 96028|
| 465| 96015|
| 366| 96016|
| 5490| 96101|
| 183| 96068|
| 569| 96009|
| 366| 96054|
| 90| 96119|
| 557| 96006|
| 233| 96116|
+----------+----------+
only showing top 10 rows
>>> df.describe().show()
+-------+------------------+------------------+
|summary| population| postalCode|
+-------+------------------+------------------+
| count| 1082| 1082|
| mean|23348.511090573014| 93458.60813308688|
| stddev|21825.045923603615|1883.6307236060127|
+-------+------------------+------------------+
The population mean is about right for my purposes, but I need the variance around it to be smaller...
Hope that makes sense, any help performing this job either in pyspark or node.js greatly appreciated.
The general idea is to:
translate the mean to zero.
rescale to the new standard deviation
translate to the desired mean (in this case, the original mean)
In pseudo-code, if your values are stored in the variable x:
x.scaled = new.mean + (x - mean(x)) * new.SD/sd(x)
Or, for the specific case of, say, SD=1000 and no change to the mean:
x.scaled = mean(x) + (x - mean(x)) * 1000/sd(x)

Calculate the standard deviation of a cluster of datapoints

So, I have a list of data points where all of them belong to a cluster(Each item is a numpy array with 3 features(represnting a point)). I compute their centroid (mean of the points). I want to calculate the standard deviation of a point from the centroid. To put it more precisely, I want to find out how many standard deviations away is a point from the centroid of the cluster. Please help me in coding it.
My list of data points looks something like this
([-5.75204079 8.78545302 8.00800119],....)
Assuming data points in a cluster are stored in a list called data, the following code will calculate standard deviation of that set of data.
# Calculate mean
mean = sum(data)/len(data)
# Calculate sum of square of difference
# of data points from mean
dev = 0
for rec in data:
dev += pow((rec - mean),2)
# Calculate variance
var = dev/len(data)
# Calculate standard deviation
std_dev = math.sqrt(var)

How to do a calculation by evaluating several conditions?

I have this table in access called invoices.
It has a field that performs calculations based on two fields. that is field_price*field_Quantity. Now I added two new columns containing the unit of measure for field_weight and field_quantity named field_priceUnit and field_quantityUnit.
Now instead of just taking the multiplication to perform the calculation, I want it to see if the units of measures match, it doesn't match then it should do a convertion of the field_quantity into the unit of measure of field_priceUnit.
example:
Row1:
ID:1|Field_Quantity:23|field_quantityUnit:LB|Field_weight:256|field_priceunit:KG|field_price:24| Calculated_Column:
the calculated_column should do the calculation this way.
1. if field_quantityunit=LB and field_priceunit=LB then field_quantity*field_price
else
if field_quantityUnit=LB and field_priceUnit=KG
THEN ((field_quantity/0.453592)*field_price) <<
Please help me.
I have to do this for multiple conditions.
Field_priceunit may have values as LB,KG, and MT
Field_quantityUnit may have field as LB,KG, and MT
if both units don't match, then I want to do the conversion and calculate based on the new convetion as seen in the example.
Thank you
The following formula should get you running if your units are only lb and kg and you only have to check one direction:
iif(and(field_quantityunit='LB', field_priceunit='LB'), field_quantity*field_price, (field_quantity/0.453592)*field_price)
This doesn't scale well though as you may have to convert field_price or you may add other units. This iif formula will grow WAY out of hand quickly.
Instead create a new table called unit_conversion or whatever you like:
unit | conversion
lb | .453592
kg | 1
g | 1000
mg | 1000000
Now in your query join:
LEFT OUTER JOIN unit_conversion as qty_conversion
ON field_quantityunit = qty_conversion.unit
LEFT OUTER JOIN unit_conversion as price_conversion
On field_priceUnit = price_conversion.unit
Up in your SELECT portion of the query you can now just do:
(field_quantity * qty_conversion.conversion) * (field_price * price_conversion.conversion)
And you don't have to worry what the units are. They will all convert over to a kilogram and get multiplied.
You could convert everything over to a pound or really any unit of weight here if you want but the nice thing is, you only need to add new units to your conversion table to handle them in any sql that you write this way so it's very scalable.

Geospatial to count NN within Radius - in Big Data

I have successfully done this in python using cKDTree and multiprocessing over 20+ threads, but it really hits a wall when trying to scale it up to millions of rows. I'm wondering how someone could do this in Hive or Spark.
1 dataset is a dataset of all customers and their GPS coords. Another dataset is of all fire hydrants and their GPS coords. Like so:
tblCustomer (100MM rows)
+-------------+-------------------+--------------------+
| customer_id | customer_latitude | customer_longitude |
+-------------+-------------------+--------------------+
| 123 | 42.123456 | -56.123456 |
+-------------+-------------------+--------------------+
| 456 | 44.123456 | -55.123456 |
+-------------+-------------------+--------------------+
tblFireHydrants (50MM rows)
+----------------+------------------+-------------------+
| firehydrant_id | hydrant_latitude | hydrant_longitude |
+----------------+------------------+-------------------+
| 123456 | 42.987654 | -55.984657 |
+----------------+------------------+-------------------+
| 456233 | 45.569841 | -55.978946 |
+----------------+------------------+-------------------+
The goal is to query how many fire hydrants are with radius r (meters) of each customer_id.
After I repeat it for a few times for various distances, the final result will look like this:
+-------------+----------------------+----------------------+-----------------------+
| customer_id | hydrants_within_100m | hydrants_within_500m | hydrants_within_1000m |
+-------------+----------------------+----------------------+-----------------------+
| 123456 | 0 | 1 | 6 |
+-------------+----------------------+----------------------+-----------------------+
| 456233 | 1 | 1 | 9 |
+-------------+----------------------+----------------------+-----------------------+
I could start from scratch and try to use KDTrees in scala, or I noticed there are some geospatial UDF's for Hive-MapReduce that might work. I'm not really sure.
Looking for any suggestion of where to start, hopefully not starting from scratch.
If I understand correctly, you shouldn't use kNN-queries.
kNN-Queries return the closest 'k' hydrants. The problem is that you have to set 'k' quite high to be sure that you don't miss out any hydrants within the minimum distance. However, using a high 'k' may return lots of hydrants that are not within your required distance.
I suggest using window queries or, if available, range queries.
Range queries just return everything within a given range. This would be ideal, this feature is often not supported in kd-trees.
Window queries would be a good alternative. A window query typically returns all data within a specified axis-aligned rectangle. You have to make the window large enough to ensure that it returns all possible hydrants in the required distance. Afterwards, you would have to filter out all hydrants that are in the rectangle, but exceed your maximum distance, i.e. hydrants that are near the corners of the query window.
Range and window queries should be much more efficient than kd-queries, even if it may return additional results (in the case of window queries).
To put it another way, a naive kNN algorithm can be thought of as a series of window queries. The algorithm starts with a small window and then makes it larger until it finds 'k' elements. In the best case, the algorithm performs just one window query and find the right number of results (not too few, not too many), but in the typical case it has to perform many such queries.
I don't know how the cKDTree works, but I strongly would assume that window/range queries are much cheaper than kNN queries.

Summing up a related table's values in PowerPivot/DAX

Say I have two tables. attrsTable:
file | attribute | value
------------------------
A | xdim | 5
A | ydim | 6
B | xdim | 7
B | ydim | 3
B | zdim | 2
C | xdim | 1
C | ydim | 7
sizeTable:
file | size
-----------
A | 17
B | 23
C | 34
I have these tables related via the 'file' field. I want a PowerPivot measure within attrsTable, whose calculation uses size. For example, let's say I want xdim+ydim/size for each of A, B, C. The calculations would be:
A: (5+6)/17
B: (7+3)/23
C: (1+7)/34
I want the measure to be generic enough so I can use slicers later on to slice by file or attribute. How do I accomplish this?
I tried:
dimPerSize := CALCULATE([value]/SUM(sizeTable[size])) # Calculates 0
dimPerSize := CALCULATE([value]/SUM(RELATED(sizeTable[size]))) # Produces an error
Any idea what I'm doing wrong? I'm probably missing some fundamental concepts here of how to use DAX with relationships.
Hi Redstreet,
taking a step back from your solution and the one proposed by Jacob, I think it might be useful to create another table that would aggregate all the calculations (especially given you probably have more than 2 tables with file-specific attributes).
So I have created one more table that contains (only) unique file names, and thus the relationships could be visualized this way:
It's much simpler to add necessary measures (no need for calculated columns). I have actually tested 2 scenarios:
1) create simple SUM measures for both Attribute Value and File Size. Then divide those two measures and job done :-).
2) use SUMX functions to have a bit more universal solution. Then the final formula for DimPerSize calculation could look like this:
=DIVIDE(
SUMX(DISTINCT(fileTable[file]),[Sum of AttrValue]),
SUMX(DISTINCT(fileTable[file]),[Sum of FileSize]),
BLANK()
)
With [Sum of AttrValue] being:
=SUM(attrsTable[value])
And Sum of FileSize being:
=SUM(sizeTable[size])
This worked perfectly fine, even though SUMX in both cases goes over all instances of given file name. So for file B it also calculates with zdim (if there is a need to filter this out, then use simple calculate / filter combination). In case of file size, I am using SUMX as well, even though it's not really needed since the table contains only 1 record for each file name. If there would be 2 instances, then use SUMX or AVERAGEX depending on the desired outcome.
This is the link to my source file in Excel (2010).
Hope this helps.
You look to have the concept of relationships OK but you aren't on the right track in terms of CALCULATE() either in terms of the structure or the fact that you can't simply use 'naked' numerical columns, they need to be packaged in some way.
Your desired approach is correct in that once you get a simple version of the thing running, you will be able to slice and dice it over any of your related dimensions.
Best practice is probably to build this up using several measures:
[xdim] = CALCULATE(SUM('attrstable'[value]), 'attrstable'[attribute] = "xdim")
[ydim] = CALCULATE(SUM('attrstable'[value]), 'attrstable'[attribute] = "ydim")
[dimPerSize] = ([xdim] + [ydim]) / VALUES('sizeTable'[size])
But depending on exactly how your pivot is set up, this is likely to also throw an error because it will try and use the whole 'size' column in your totals. There are two main strategies for dealing with this:
Use an 'iterative' formula such as SUX() or AVERAGEX() to iterate individually over the 'file' field and then adds up or averages for the total e.g.
[ItdimPerSize] = AVERAGEX(VALUES('sizeTable'[file]), [dimPerSize])
Depending on the maths you want to use, you might find that produce a useful average that you need to use SUMX but devide by the number of cases i.e. COUNTROWS('sizeTable'[file]).
You might decide that the totals are irrelevant and simply introduce an error handling element that will make them blank e.g.
[NtdimPerSize] = IF(HASONEVALUE('sizeTable'[file]),[dimPerSize],BLANK())
NB, all of this assumes that when you are creating your pivot that you are 'dragging in' the file field from the 'sizetable'.

Resources