Calculating median values in HIVE - statistics

I have the following table t1:
key value
1 38.76
1 41.19
1 42.22
2 29.35182
2 28.32192
3 33.66
3 33.47
3 33.35
3 33.47
3 33.11
3 32.98
3 32.5
I want to compute the median for each key group. According to the documentation, the percentile_approx function should work for this. The median values for each group are:
1 41.19
2 28.83
3 33.35
However, the percentile_approx function returns these:
1 39.974999999999994
2 28.32192
3 33.23.0000000000004
Which clearly are not the median values.
This was the query I ran:
select key, percentile_approx(value, 0.5, 10000) as median
from t1
group by key
It seems to be not taking into account one value per group, resulting in a wrong median. Ordering does not affect the result. Any ideas?

In Hive, median cannot be calculated directly by using available built-in functions. Below query is used to find the median.
set hive.exec.parallel=true;
select temp1.key,temp2.value
from
(
select key,cast(sum(rank)/count(key) as int) as final_rank
from
(
select key,value,
row_number() over (partition by key order by value) as rank
from t1
) temp
group by key )temp1
inner join
( select key,value,row_number() over (partition by key order by value) as rank
from t1 )temp2
on
temp1.key=temp2.key and
temp1.final_rank=temp3.rank;
Above query finds the row_number for each key by ordering the values for the key. Finally it will take the middle row_number of each key which gives the median value. Also I have added one more parameter "hive.exec.parallel=true;" which enables to run the independent tasks in parallel.

Related

STRING_SPLIT on INNER JOIN takes different indexes and bad performance

I have a query using STRING_SPLIT on an INNER JOIN and it looks very different behaviours deppending on how i join.
EXAMPLE 1 (11 seconds - 161K rows)
DECLARE #SucursalFisicaIDs AS TABLE (value int)
INSERT INTO #SucursalFisicaIDs
SELECT value FROM STRING_SPLIT('16531,16532,16533,16534,16536,16537,16538,16539,16541,16543,16591,16620,17071',',')
SELECT ArticuloID, SUM(Existencias) AS Existencias
FROM ArticulosExistenciasSucursales_VIEW WITH (NOEXPAND)
INNER JOIN #SucursalFisicaIDs AS SucursalFisicaIDs ON SucursalFisicaIDs.value = ArticulosExistenciasSucursales_VIEW.SucursalFisicaID
GROUP BY ArticuloID
EXAMPLE 2 (2 seconds - 161K rows)
SELECT ArticuloID, SUM(Existencias) AS Existencias
FROM ArticulosExistenciasSucursales_VIEW WITH (NOEXPAND)
WHERE SucursalFisicaID NOT IN (16531,16532,16533,16534,16536,16537,16538,16539,16541,16543,16591,16620,17071)
GROUP BY ArticuloID
Both Queries read from the view (it is indexed).
EXAMPLE 3 (3 seconds - 161K rows)
SELECT ArticuloID, SUM(Existencias) AS Existencias
FROM ArticulosExistenciasSucursales_VIEW WITH (NOEXPAND)
INNER JOIN STRING_SPLIT('16531,16532,16533,16534,16536,16537,16538,16539,16541,16543,16591,16620,17071',',') AS SucursalFisicaIDs ON SucursalFisicaIDs.value = ArticulosExistenciasSucursales_VIEW.SucursalFisicaID
GROUP BY ArticuloID
EXAMPLE 4 (6 seconds - 161K rows)
DECLARE #SucursalFisicaIDs AS TABLE (value int)
INSERT INTO #SucursalFisicaIDs
SELECT value FROM STRING_SPLIT('16531,16532,16533,16534,16536,16537,16538,16539,16541,16543,16591,16620,17071',',')
SELECT ArticuloID, SUM(Existencias) AS Existencias
FROM ArticulosExistenciasSucursales_VIEW WITH (NOEXPAND)
WHERE SucursalFisicaID NOT IN (Select Value From #SucursalFisicaIDs)
GROUP BY ArticuloID
Can anyopne tell me why doesnt the first example work? It shoul be the same and better than the others. Is anything on the types i shlould do?
Note in example 4, the tablescan on SucursalFisicaIDs Table, the number of rows. (3M???)
Regards.

Query optimization in Cassandra

I have a cassandra database that I need to query
My table looks like this:
Cycle Parameters Value
1 a 999
1 b 999
1 c 999
2 a 999
2 b 999
2 c 999
3 a 999
3 b 999
3 c 999
4 a 999
4 b 999
4 c 999
I need to get values for parameters "a" and "b" for two cycles , no matter which "cycle" it is
Example results:
Cycle Parameters Value
1 a 999
1 b 999
2 a 999
2 b 999
or
Cycle Parameters Value
1 a 999
1 b 999
3 a 999
3 b 999
Since the database is quite huge, every query optimization is welcome..
My requirements are:
I want to do everything in 1 query
Would be a plus a answer with no nested query
So far, I was able to accomplish these requirements with something like this:
select * from table where Parameters in ('a','b') sort by cycle, parameters limit 4
However, this query needs a "sortby" operation that causes huge processing in the database...
Any clues on how to do it? ....limit by partition maybe?
EDIT:
The table schema is:
CREATE TABLE cycle_data (
cycle int,
parameters text,
value double,
primary key(parameters,cycle)
)
"parameters" is the partition key and "cycle" is the clustering column
You can't query like this without ALLOW FILTERING, don't use allow filtering in production Only use it for development!
Read the datastax doc about using ALLOW FILTERING https://docs.datastax.com/en/cql/3.3/cql/cql_reference/select_r.html?hl=allow,filter
I assume your current schema is :
CREATE TABLE data (
cycle int,
parameters text,
value double,
primary key(cycle, parameters)
)
And you need another table or change your table schema to query like these
CREATE TABLE cycle_data (
cycle int,
parameters text,
value double,
primary key(parameters,cycle)
)
Now you can query
SELECT * FROM cycle_data WHERE parameters in ('a','b');
These result will automatically sorted in ascending order by cycle for every parameters

Cassandra aggregation

I have a Cassandra cluster with 4 table and data inside.
I want to make request with aggregation function ( sum, max ...) but I've read here that it's impossible :
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/cql_function_r.html
Is there a way to make sum , average, group by, without buying the enterprise version, can I use presto , or other solutions?
Thanks
Aggregate functions will be available as part of Cassandra 3.0
https://issues.apache.org/jira/browse/CASSANDRA-4914
Sure it does:
Native aggregates
Count
The count function can be used to count the rows returned by a query.
Example:
SELECT COUNT (*) FROM plays;
SELECT COUNT (1) FROM plays;
It also can be used to count the non null value of a given column:
SELECT COUNT (scores) FROM plays;
Max and Min
The max and min functions can be used to compute the maximum and the
minimum value returned by a query for a given column. For instance:
SELECT MIN (players), MAX (players) FROM plays WHERE game = 'quake';
Sum
The sum function can be used to sum up all the values returned by a
query for a given column. For instance:
SELECT SUM (players) FROM plays;
Avg
The avg function can be used to compute the average of all the values
returned by a query for a given column. For instance:
SELECT AVG (players) FROM plays;
You can also create your own aggregates, more documentation on aggregates here: http://cassandra.apache.org/doc/latest/cql/functions.html?highlight=aggregate

Get the average value between a specified set of rows

Everyone, I'm building a report using Visual Studio 2012
I want to be able to average a group of values between a specific set of rows.
What I have so far is something like this.
=(Count(Fields!SomeField.Value))*.1
and
=(Count(Fields!SomeField.Value))*.9
I want to use those two values to get the Average of Fields!SomeField.Value between those to numbers of the row. Basically I'm removing the top and bottom 10% of the data and get the middle 80% to average out. Maybe there is a better way to do this? Thanks for any help.
Handle it in SQL itself.
Method 1:
Use NTILE function. Go through this link to learn more about NTILE.
Try something like this
WITH someCTE AS (
SELECT SomeField, NTILE(10) OVER (ORDER BY SomeField) as percentile
FROM someTable)
SELECT AVG(SomeField) as myAverage
FROM someCTE WHERE percentile BETWEEN 2 and 9
if your dataset is bigger
WITH someCTE AS (
SELECT SomeField, NTILE(100) OVER (ORDER BY SomeField) as percentile
FROM someTable)
SELECT AVG(SomeField) as myAverage
FROM someCTE WHERE percentile BETWEEN 20 and 90
METHOD 2:
SELECT Avg(SomeField) myAvg
From someTable
Where SomeField NOT IN
(
SELECT Top 10 percent someField From someTable order by someField ASC
UNION ALL
SELECT Top 10 percent someField From someTable order by someField DESC
)
Note:
Test for boundary conditions to make sure you are getiing what you need. If needed tweak above sql code.
For NTILE: Make sure you NTILE parameter is less(or equal) than the number of rows in the table.

How to compute median value of sales figures?

In MySQL while there is an AVG function, there is no Median. So I need to create a way to compute Median value for sales figures.
I understand that Median is the middle value. However, it isn't clear to me how you handle a list that isn't an odd numbered list. How do you determine which value to select as the Median, or is further computation needed to determine this? Thanks!
I'm a fan of including an explicit ORDER BY statement:
SELECT t1.val as median_val
FROM (
SELECT #rownum:=#rownum+1 as row_number, d.val
FROM data d, (SELECT #rownum:=0) r
WHERE 1
-- put some where clause here
ORDER BY d.val
) as t1,
(
SELECT count(*) as total_rows
FROM data d
WHERE 1
-- put same where clause here
) as t2
WHERE 1
AND t1.row_number=floor(total_rows/2)+1;

Resources