Computing skew in Hive? - statistics

This paper defines sample skew as
s = E[X-E(X)]^3 / [Var(X)]^3/2
What's the easiest way to compute this in Hive?
I imagine a two pass algorithm: 1 gets E(X) and Var(X), the other computes E[X-(X)]^3 and rolls it up.

I think you are on the right track with a two step approach, especially if you are strictly using Hive. Here is one way to accomplish this in two steps or one query and one subquery:
Calculate E(X) using the OVER () clause so we can avoid aggregating the data (this is so we can later calculate E[X-E(X)]):
select x, avg(x) over () as e_x
from table;
Using the above as a subquery, calculate Var(x) and E[X-E(X)] which will aggregate the data and produce the final statistic:
select pow(avg(x - e_x), 3)/sqrt(pow(variance(x), 3))
from (select x, avg(x) over () as e_x
from table) tb
;

The above formula isn't correct at least for Pearson's skew.
The following works at least with Impala:
with d as (select somevar as x from yourtable where what>2),
agg as (select avg(x) as m,STDDEV_POP(x) as s,count(*) as n from d),
sk as (select avg(pow(((x-m)/s),3)) as skew from d,agg)
select skew,m,s,n from agg,sk;
I tested it via:
with dual as (select 1.0 as x),
d as (select 1*x as x from dual union select 2*x from dual union select 4*x from dual union select 8*x from dual union select 16*x from dual union select 32*x from dual), -- This generates 1,2,4,8,16,32
agg as (select avg(x) as m,STDDEV_POP(x) as s,count(*) as n from d),
sk as (select avg(pow(((x-m)/s),3)) as skew from d,agg)
select skew,m,s,n from agg,sk;
And it gives the same answer as R:
require(moments)
skewness(c(1,2,4,8,16,32)) #gives 1.095221
See https://en.wikipedia.org/wiki/Skewness#Pearson.27s_moment_coefficient_of_skewness

Related

Error Snowflake - Unsupported subquery type cannot be evaluated

I am facing an error in snowflake saying "Unsupported subquery type cannot be evaluated" after for example executing the below statement. How should write this statement to avoid this error?
select A
from (
select b
, c
FROM test_table
) ;
The outer query column list needs to be within the column list of the subquery. example: select b from (select b,c from test_table);
ignoring "columns" the query you have shown will never trigger this error.
You would get it from this form though:
select A.*
from tableA as A
where a.x = (select b.y FROM test_table as b where b.z = a.z)
this form assuming there is only 1 b.y per b.z can be turned into a inner join like
select A.*
from tableA as A
join test_table as b
on b.z = a.z and a.x = b.y
other forms of this pattern do the likes of max(b.y) and those can be made into a sub-select like:
select A.*
from tableA as A
join (
select c.z, max(c.y) from test_table as c group by 1
) as b
on b.z = a.z and a.x = b.y
but the general pattern is, in other databases there is no "cost" to do row-by-row queries, where-as Snowflake is more optimal with pre-building tables of similar data, and then equi-joining those results together. So both the "how-to-write" example pivot from a for-each-row thinking to a build the set of all possible answers, and then join that. This allows for the most parallel processing of the data possible. And while it means you the develop need to understand your data to get he best performance out of it, in general if you are doing large scale data processing, you should be understanding your data. So this costs, is rather acceptable imho.
If you are trying to Match Two Attributes on the Subquery.
Use like below:
If both need to matched:
select * from Table WHERE a IN ( select b FROM test_table ) AND a IN ( select c FROM test_table )
If any one need to matched:
select * from Table WHERE a IN ( select b FROM test_table ) OR a IN ( select c FROM test_table )

Spark filter pushdown with multiple values in subquery

I have a small table adm with one column x that contains only 10 rows. Now I want to filter another table big that is partitioned by y with the values from adm using partition pruning.
While here
select * from big b
where b.y = ( select max(a.x) from adm a)
the partition filter pushdown works, but unfortunately this:
select * from big b
where b.y IN (select a.x from adm a )
results in a broadcast join between a and b
How can the subquery be pushed down as a partition filter even when I use IN
This is happening because the result of your subquery by itself is an RDD, so Spark deals with it in a truly distributed fashion -- via broadcast and join -- as it would if it were any other column, not necessarily partition.
To work around this, you will need to execute subquery separately, collect the result and format it into a value usable in IN clause.
scala> val ax = spark.sql("select a.x from adm a")
scala> val inclause = ax.as(Encoders.STRING).map(x => "'"+x+"'").collectAsList().asScala.mkString(",")
scala> spark.sql("select * from big b where b.y IN (" + inclause + ")")
(This assumes x and y are strings.)

how to sort field with char and decimal value

i am stuck with the scenario where i have to sort
'a-2.3'
'a-1.1' and
'a-1.02'.
how do we do this using jpql in spring data jpa or using sql query. I would appreciate your personal experience and idea.
the sorting expected in ascending order based on the numerical value after a- .
This will depend on the database you're using, but e.g. in Oracle, you could call TO_NUMBER(SUBSTR(col, 3)) with SQL:
WITH t (col) AS (
SELECT 'a-2.3' FROM dual UNION ALL
SELECT 'a-1.1' FROM dual UNION ALL
SELECT 'a-1.02' FROM dual
)
SELECT col
FROM t
ORDER BY to_number(substr(col, 3))
This yields:
a-1.02
a-1.1
a-2.3
Of course, you'll have to adapt the parsing in case your prefix isn't always exactly a-, but something dynamic.
In JPQL, this could be feasible: CAST(SUBSTRING(col, 3, LENGTH(col) - 2) AS NUMBER)

Using Large Look up table

Problem Statement :
I have two tables - Data (40 cols) and LookUp(2 cols) . I need to use col10 in data table with lookup table to extract the relevant value.
However I cannot make equi join . I need a join based on like/contains as values in lookup table contain only partial content of value in Data table not complete value. Hence some regex based matching is required.
Data Size :
Data Table : Approx - 2.3 billion entries (1 TB of data)
Look up Table : Approx 1.4 Million entries (50 MB of data)
Approach 1 :
1.Using the Database ( I am using Google Big Query) - A Join based on like take close to 3 hrs , yet it returns no result. I believe Regex based join leads to Cartesian join.
Using Apache Beam/Spark - I tried to construct a Trie for the lookup table which will then be shared/broadcast to worker nodes. However with this approach , I am getting OOM as I am creating too many Strings. I tried increasing memory to 4GB+ per worker node but to no avail.
I am using Trie to extract the longest matching prefix.
I am open to using other technologies like Apache spark , Redis etc.
Do suggest me on how can I go about handling this problem.
This processing needs to performed on a day-to-day basis , hence time and resources both needs to be optimized .
However I cannot make equi join
Below is just to give you an idea to explore for addressing in pure BigQuery your equi join related issue
It is based on an assumption I derived from your comments - and covers use-case when y ou are looking for the longest match from very right to the left - matches in the middle are not qualified
The approach is to revers both url (col10) and shortened_url (col2) fields and then SPLIT() them and UNNEST() with preserving positions
UNNEST(SPLIT(REVERSE(field), '.')) part WITH OFFSET position
With this done, now you can do equi join which potentially can address your issue at some extend.
SO, you JOIN by parts and positions then GROUP BY original url and shortened_url while leaving only those groups HAVING count of matches equal of count of parts in shorteded_url and finally you GROUP BY url and leaving only entry with highest number of matching parts
Hope this can help :o)
This is for BigQuery Standard SQL
#standardSQL
WITH data_table AS (
SELECT 'cn456.abcd.tech.com' url UNION ALL
SELECT 'cn457.abc.tech.com' UNION ALL
SELECT 'cn458.ab.com'
), lookup_table AS (
SELECT 'tech.com' shortened_url, 1 val UNION ALL
SELECT 'abcd.tech.com', 2
), data_table_parts AS (
SELECT url, x, y
FROM data_table, UNNEST(SPLIT(REVERSE(url), '.')) x WITH OFFSET y
), lookup_table_parts AS (
SELECT shortened_url, a, b, val,
ARRAY_LENGTH(SPLIT(REVERSE(shortened_url), '.')) len
FROM lookup_table, UNNEST(SPLIT(REVERSE(shortened_url), '.')) a WITH OFFSET b
)
SELECT url,
ARRAY_AGG(STRUCT(shortened_url, val) ORDER BY weight DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT url, shortened_url, COUNT(1) weight, ANY_VALUE(val) val
FROM data_table_parts d
JOIN lookup_table_parts l
ON x = a AND y = b
GROUP BY url, shortened_url
HAVING weight = ANY_VALUE(len)
)
GROUP BY url
with result as
Row url shortened_url val
1 cn457.abc.tech.com tech.com 1
2 cn456.abcd.tech.com abcd.tech.com 2

How to compute median value of sales figures?

In MySQL while there is an AVG function, there is no Median. So I need to create a way to compute Median value for sales figures.
I understand that Median is the middle value. However, it isn't clear to me how you handle a list that isn't an odd numbered list. How do you determine which value to select as the Median, or is further computation needed to determine this? Thanks!
I'm a fan of including an explicit ORDER BY statement:
SELECT t1.val as median_val
FROM (
SELECT #rownum:=#rownum+1 as row_number, d.val
FROM data d, (SELECT #rownum:=0) r
WHERE 1
-- put some where clause here
ORDER BY d.val
) as t1,
(
SELECT count(*) as total_rows
FROM data d
WHERE 1
-- put same where clause here
) as t2
WHERE 1
AND t1.row_number=floor(total_rows/2)+1;

Resources