STRING_SPLIT on INNER JOIN takes different indexes and bad performance - string

I have a query using STRING_SPLIT on an INNER JOIN and it looks very different behaviours deppending on how i join.
EXAMPLE 1 (11 seconds - 161K rows)
DECLARE #SucursalFisicaIDs AS TABLE (value int)
INSERT INTO #SucursalFisicaIDs
SELECT value FROM STRING_SPLIT('16531,16532,16533,16534,16536,16537,16538,16539,16541,16543,16591,16620,17071',',')
SELECT ArticuloID, SUM(Existencias) AS Existencias
FROM ArticulosExistenciasSucursales_VIEW WITH (NOEXPAND)
INNER JOIN #SucursalFisicaIDs AS SucursalFisicaIDs ON SucursalFisicaIDs.value = ArticulosExistenciasSucursales_VIEW.SucursalFisicaID
GROUP BY ArticuloID
EXAMPLE 2 (2 seconds - 161K rows)
SELECT ArticuloID, SUM(Existencias) AS Existencias
FROM ArticulosExistenciasSucursales_VIEW WITH (NOEXPAND)
WHERE SucursalFisicaID NOT IN (16531,16532,16533,16534,16536,16537,16538,16539,16541,16543,16591,16620,17071)
GROUP BY ArticuloID
Both Queries read from the view (it is indexed).
EXAMPLE 3 (3 seconds - 161K rows)
SELECT ArticuloID, SUM(Existencias) AS Existencias
FROM ArticulosExistenciasSucursales_VIEW WITH (NOEXPAND)
INNER JOIN STRING_SPLIT('16531,16532,16533,16534,16536,16537,16538,16539,16541,16543,16591,16620,17071',',') AS SucursalFisicaIDs ON SucursalFisicaIDs.value = ArticulosExistenciasSucursales_VIEW.SucursalFisicaID
GROUP BY ArticuloID
EXAMPLE 4 (6 seconds - 161K rows)
DECLARE #SucursalFisicaIDs AS TABLE (value int)
INSERT INTO #SucursalFisicaIDs
SELECT value FROM STRING_SPLIT('16531,16532,16533,16534,16536,16537,16538,16539,16541,16543,16591,16620,17071',',')
SELECT ArticuloID, SUM(Existencias) AS Existencias
FROM ArticulosExistenciasSucursales_VIEW WITH (NOEXPAND)
WHERE SucursalFisicaID NOT IN (Select Value From #SucursalFisicaIDs)
GROUP BY ArticuloID
Can anyopne tell me why doesnt the first example work? It shoul be the same and better than the others. Is anything on the types i shlould do?
Note in example 4, the tablescan on SucursalFisicaIDs Table, the number of rows. (3M???)
Regards.

Related

looping string list and get no record count from table

I have string values get from a table using listagg(column,',')
so I want to loop this string list and set into where clause for another table
then I want to get a count when no any records in the table (Number of times with no record)
I'm writing this inside the plsql procedure
order_id
name
10
test1
20
test2
22
test3
25
test4
col_id
product
order_id
1
pro1
10
2
pro2
30
3
pro2
38
expected result : count(Number of times with no record) in 2nd table
count = 3
because there is no any record of 20,22,25 order ids in 2nd table
only have record for order_id - 10
my queries
SELECT listagg(ord.order_id,',')
into wk_orderids
from orders ord,
where ord.id_no = wk_id_no;
loop
-- do my stuff
end loop
wk_orderids values = ('10','20','22','25')
I want to loop this one(wk_orderids) and set it one by one into a select query where clause
then want to get the count Number of times with no record
If you want to count ORDER_IDs in the 2nd table that don't exist in ORDER_ID column of the 1st table, then your current approach looks as if you were given a task to do that in the most complicated way. Aggregating values, looping through them, adding values into a where clause (which then requires dynamic SQL) ... OK, but - why? Why not simply
select count(*)
from (select order_id from first_table
minus
select order_id from second_table
);

Correct way to get the last value for a field in Apache Spark or Databricks Using SQL (Correct behavior of last and last_value)?

What is the correct behavior of the last and last_value functions in Apache Spark/Databricks SQL. The way I'm reading the documentation (here: https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html) it sounds like it should return the last value of what ever is in the expression.
So if I have a select statement that does something like
select
person,
last(team)
from
(select * from person_team order by date_joined)
group by person
I should get the last team a person joined, yes/no?
The actual query I'm running is shown below. It is returning a different number each time I execute the query.
select count(distinct patient_id) from (
select
patient_id,
org_patient_id,
last_value(data_lot) data_lot
from
(select * from my_table order by data_lot)
where 1=1
and org = 'my_org'
group by 1,2
order by 1,2
)
where data_lot in ('2021-01','2021-02')
;
What is the correct way to get the last value for a given field (for either the team example or my specific example)?
--- EDIT -------------------
I'm thinking collect_set might be useful here, but I get the error shown when I try to run this:
select
patient_id,
last_value(collect_set(data_lot)) data_lot
from
covid.demo
group by patient_id
;
Error in SQL statement: AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
Aggregate [patient_id#89338], [patient_id#89338, last_value(collect_set(data_lot#89342, 0, 0), false) AS data_lot#91848]
+- SubqueryAlias spark_catalog.covid.demo
The posts shown below discusses how to get max values (not the same as last in a list ordered by a different field, I want the last team a player joined, the player may have joined the Reds, the A's, the Zebras, and the Yankees, in that order timewise, I'm looking for the Yankees) and these posts get to the solution procedurally using python/r. I'd like to do this in SQL.
Getting last value of group in Spark
Find maximum row per group in Spark DataFrame
--- SECOND EDIT -------------------
I ended up using something like this based upon the accepted answer.
select
row_number() over (order by provided_date, data_lot) as row_num,
demo.*
from demo
You can assign row numbers based on an ordering on data_lots if you want to get its last value:
select count(distinct patient_id) from (
select * from (
select *,
row_number() over (partition by patient_id, org_patient_id, org order by data_lots desc) as rn
from my_table
where org = 'my_org'
)
where rn = 1
)
where data_lot in ('2021-01','2021-02');

How to optimize a join?

I have a query to join the tables. How do I optimize to run it faster?
val q = """
| select a.value as viewedid,b.other as otherids
| from bm.distinct_viewed_2610 a, bm.tets_2610 b
| where FIND_IN_SET(a.value, b.other) != 0 and a.value in (
| select value from bm.distinct_viewed_2610)
|""".stripMargin
val rows = hiveCtx.sql(q).repartition(100)
Table descriptions:
hive> desc distinct_viewed_2610;
OK
value string
hive> desc tets_2610;
OK
id int
other string
the data looks like this:
hive> select * from distinct_viewed_2610 limit 5;
OK
1033346511
1033419148
1033641547
1033663265
1033830989
and
hive> select * from tets_2610 limit 2;
OK
1033759023
103973207,1013425393,1013812066,1014099507,1014295173,1014432476,1014620707,1014710175,1014776981,1014817307,1023740250,1031023907,1031188043,1031445197
distinct_viewed_2610 table has 1.1 million records and i am trying to get similar id's for that from table tets_2610 which has 200 000 rows by splitting second column.
for 100 000 records it is taking 8.5 hrs to complete the job with two machines
one with 16 gb ram and 16 cores
second with 8 gb ram and 8 cores.
Is there a way to optimize the query?
Now you are doing cartesian join. Cartesian join gives you 1.1M*200K = 220 billion rows. After Cartesian join it filtered by where FIND_IN_SET(a.value, b.other) != 0
Analyze your data.
If 'other' string contains 10 elements in average then exploding it will give you 2.2M rows in table b. And if suppose only 1/10 of rows will join then you will have 2.2M/10=220K rows because of INNER JOIN.
If these assumptions are correct then exploding array and join will perform better than Cartesian join+filter.
select distinct a.value as viewedid, b.otherids
from bm.distinct_viewed_2610 a
inner join (select e.otherid, b.other as otherids
from bm.tets_2610 b
lateral view explode (split(b.other ,',')) e as otherid
)b on a.value=b.otherid
And you do not need this :
and a.value in (select value from bm.distinct_viewed_2610)
Sorry I cannot test query, do it yourself please.
If you are using orc formate change to parquet as per your data i woud say choose range partition.
Choose proper parallization to execute fast.
I have answred on follwing link may be help you.
Spark doing exchange of partitions already correctly distributed
Also please read it
http://dev.sortable.com/spark-repartition/

Informix - Count and make pagination at same time

Currently I´m doing this (get pagination and count), in Informix:
select a.*, b.total from (select skip 0 first 10 * from TABLE) a,(select count(*) total from TABLE) b
The problem is that I´m repeating the same pattern - I get the first ten results from all and then I count all the results.
I want to make something like this:
select *, count(*) from TABLE:
so I can make my query much faster. It is possible?

Calculating median values in HIVE

I have the following table t1:
key value
1 38.76
1 41.19
1 42.22
2 29.35182
2 28.32192
3 33.66
3 33.47
3 33.35
3 33.47
3 33.11
3 32.98
3 32.5
I want to compute the median for each key group. According to the documentation, the percentile_approx function should work for this. The median values for each group are:
1 41.19
2 28.83
3 33.35
However, the percentile_approx function returns these:
1 39.974999999999994
2 28.32192
3 33.23.0000000000004
Which clearly are not the median values.
This was the query I ran:
select key, percentile_approx(value, 0.5, 10000) as median
from t1
group by key
It seems to be not taking into account one value per group, resulting in a wrong median. Ordering does not affect the result. Any ideas?
In Hive, median cannot be calculated directly by using available built-in functions. Below query is used to find the median.
set hive.exec.parallel=true;
select temp1.key,temp2.value
from
(
select key,cast(sum(rank)/count(key) as int) as final_rank
from
(
select key,value,
row_number() over (partition by key order by value) as rank
from t1
) temp
group by key )temp1
inner join
( select key,value,row_number() over (partition by key order by value) as rank
from t1 )temp2
on
temp1.key=temp2.key and
temp1.final_rank=temp3.rank;
Above query finds the row_number for each key by ordering the values for the key. Finally it will take the middle row_number of each key which gives the median value. Also I have added one more parameter "hive.exec.parallel=true;" which enables to run the independent tasks in parallel.

Resources