How to sort huge dataset with Spark Sql - apache-spark

I have a partitioned table with about 2 billion rows in hive like:
id, num, num_partition
1, 1253742321.53124121, 12
4, 1253742323.53124121, 12
2, 1353742324.53124121, 13
3, 1253742325.53124121, 12
And I want to have a table like:
id, rank,rank_partition
89, 1, 0
...
1, 1253742321,12
7, 1253742322,12
4, 1253742323,12
8, 1253742324,12
3, 1253742325,12
...
2, 1353742324,13
...
I have tried to do this:
df = spark.sql("select *, rank/10000000 from (select id,row_number() over(order by num asc) rank from table)t1")
It was very slow, since order by will use only 1 reducer
and I've tried to do this:
df = spark.sql("select *, rank/10000000 from (select id,row_number() over(distribute by num_partition order by num_partition asc,num asc) rank from table)t1")
But in the result the num_partition hasn't been sorted

Related

KQL change order of columns in evaluate pivot

I have the following query:
let fooTable = datatable(TIMESTAMP: datetime, list_id: int, dim_count: int) [
datetime("2022-01-17T08:00:00Z"), -1, 120,
datetime("2022-01-17T08:00:00Z"), 1, 50,
datetime("2022-01-17T08:00:00Z"), 2, 30,
datetime("2022-01-17T08:00:00Z"), 8, 30,
datetime("2022-01-17T08:00:00Z"), 2001, 30,
datetime("2022-01-17T08:00:00Z"), 4, 30,
];
fooTable
| order by TIMESTAMP desc, dim_count desc
| evaluate pivot(list_id, take_any(dim_count), TIMESTAMP)
This produces the following results:
TIMESTAMP 1 -1 2 2001 4 8
2022-01-17T08:00:00Z 50 120 30 30 30 30
Which produce almost what I need - grouping the TIMESTAMP value and creating a column for each list_id value, to use the dim_count as value.
But I expected a different order of the columns (like in the input).
TIMESTAMP -1 1 2 8 2001 4
2022-01-17T08:00:00Z 120 50 30 30 30 30
How can I order the columns in such way? (the number and value of columns are dynamic).
Or, how can I control the order of the returned columns?
In reality I have more data (with more buckets of time), and I'd want to return the columns in the order of the largest sum of the column (dim_count).
So the order of the columns in the output would like:
fooTable
| summarize sum(dim_count) by list_id
| order by sum_dim_count desc
| project list_id
Which produces
-1
1
2
8
2001
4
And this is how I'd like the order of the columns (like in my expected output).
If the order is based on the row_number, you can use it with the project-reorder operator, the client will need to parse the column names and remove the prefix, for example:
let fooTable = datatable(TIMESTAMP: datetime, list_id: int, dim_count: int) [
datetime("2022-01-17T08:00:00Z"), -1, 120,
datetime("2022-01-17T08:00:00Z"), 1, 50,
datetime("2022-01-17T08:00:00Z"), 2, 30,
datetime("2022-01-17T08:00:00Z"), 8, 30,
datetime("2022-01-17T08:00:00Z"), 2001, 30,
datetime("2022-01-17T08:00:00Z"), 4, 30,
];
fooTable
| serialize
| extend list_id = strcat(row_number(), "_", list_id)
| order by TIMESTAMP desc, dim_count desc
| evaluate pivot(list_id, take_any(dim_count), TIMESTAMP)
| project-reorder *
TIMESTAMP
1_-1
2_1
3_2
4_8
5_2001
6_4
2022-01-17 08:00:00.0000000
120
50
30
30
30
30

How to set limit for joined mysql table?

I'm trying to combine data from 3 tables with below query
SELECT `media_category`.id as cat_id,
`media_category`.category_name as cat_name,
`video`.`id` as vid_id,
`video`.`name` as vid_name,
`screenshots`.name as screenshot_name
FROM `media_category`
LEFT JOIN video
ON `media_category`.id = `video`.`category_id`
LEFT JOIN screenshots
ON `video`.id = `screenshots`.`media_id`
WHERE `screenshots`.name NOT LIKE '%\_%'
version: mysql 5.7
It workes well. But I need to limit the rows getting from video table to LIMIT 10 per category
Any idea for that?
You don't mention which MySQL version you are using so I'll assume it's MySQL 8.x.
In MySQL 8.x you can use the DENSE_RANK() function to identify the rows you want. Then a simple predicate will remove the ones you don't want.
For example, if we want to limit to 2 rows of b on each a (irrespective of rows in c), you can do:
select *
from (
select
a.id,
b.id as bid,
dense_rank() over(partition by a.id order by b.id) as drank,
c.id as cid
from a
left join b on b.aid = a.id
left join c on c.bid = b.id
) x
where drank <= 2
Result:
id bid drank cid
-- --- ----- ------
1 11 1 100
1 11 1 101
1 11 1 102
1 11 1 103
1 12 2 120
2 20 1 200
2 20 1 201
2 21 2 202
3 30 1 <null>
As you can see it shows only 11 and 12 for the id = 1, even though there are 5 total rows for it (all five are of rank 1 and 2). You can see the running example at DB Fiddle. The data script for this example is:
create table a (id int primary key not null);
insert into a (id) values (1), (2), (3);
create table b (id int primary key not null, aid int references a (id));
insert into b (id, aid) values
(11, 1), (12, 1), (13, 1), (14, 1),
(20, 2), (21, 2),
(30, 3);
create table c (id int primary key not null, bid int references b (id));
insert into c (id, bid) values
(100, 11), (101, 11), (102, 11), (103, 11),
(120, 12), (130, 13), (140, 14),
(200, 20), (201, 20),
(202, 21);

Expand ( transform ) some of the hive columns as row (record)

Is there a efficient way to transform the below hive table with shown target transformation. The column count in the source table is ~ 1500.
Using spark 2.0, source and target as dataframes.
(id, dt , source1_ColA, source1_ColB, source2_ColA, source2_ColB)
------------------------------------------------------------
(10,"2018-06-01", 10, 9, 5, 8 )
(20,"2018-06-01", 20, 12, 16, 11 )
The columns A,B are transformed as shown below
Target table
(id, dt , element_name, source1, source2 )
---------------------------------------
(10,"2018-06-01", ColA , 10 , 5 )
(10,"2018-06-01", ColB , 9 , 8 )
(20,"2018-06-01", ColA , 20 , 16 )
(20,"2018-06-01", ColB , 12 , 11 )

Expand rows by column value in Presto

I have data which has id, number like:
id, number
1, 5
2, 3
And I would like data to be:
id, number
1, 0
1, 1
1, 2
1, 3
1, 4
1, 5
2, 0
2, 1
2, 2
2, 3
select t.id
,s.n
from mytable t
cross join unnest(sequence(0,t.number)) s (n)
;

Add character to string in SQL

I have got two strings:
12, H220, H280
and
11, 36, 66, 67, H225, H319, H336
and I want to add character A to every place where there is no 'H', so the strings should look like
A12, H220, H280
and
A11, A36, A66, A67, H225, H319, H336
select REPLACE(Test,Test,'A'+Test) from (
select REPLACE(Test,', ', ',A') Test from (
select REPLACE(Test,', H',',H') Test from (
select '11, 36, 66, 67, H225, H319, H336' as Test) S) S1 ) S2
Try this:
SQL Fiddle demo
--Sample data
DECLARE #T TABLE (ID INT, COL1 VARCHAR(100))
INSERT #T (ID, COL1)
VALUES (1, '12, H220, H280'), (2, '11, 36, 66, 67, H225, H319, H336')
--Query
;WITH CTE AS
(
SELECT ID, STUFF(COL1, PATINDEX('%[^H]%', COL1), 0, 'A') COL1, 1 NUMBER
FROM #T
UNION ALL
SELECT CTE.ID, STUFF(CTE.COL1, PATINDEX('%[,][ ][^HA]%', CTE.COL1) + 2, 0, 'A'), NUMBER + 1
FROM CTE JOIN #T T
ON CTE.ID = T.ID
WHERE PATINDEX('%[,][ ][^HA]%', CTE.COL1) > 0
)
,
CTE2 AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY NUMBER DESC) rn
FROM CTE
)
SELECT ID,COL1 FROM CTE2 WHERE RN = 1
Results:
| ID | COL1 |
|----|--------------------------------------|
| 1 | A12, H220, H280 |
| 2 | A11, A36, A66, A67, H225, H319, H336 |

Resources