I'm trying to combine data from 3 tables with below query
SELECT `media_category`.id as cat_id,
`media_category`.category_name as cat_name,
`video`.`id` as vid_id,
`video`.`name` as vid_name,
`screenshots`.name as screenshot_name
FROM `media_category`
LEFT JOIN video
ON `media_category`.id = `video`.`category_id`
LEFT JOIN screenshots
ON `video`.id = `screenshots`.`media_id`
WHERE `screenshots`.name NOT LIKE '%\_%'
version: mysql 5.7
It workes well. But I need to limit the rows getting from video table to LIMIT 10 per category
Any idea for that?
You don't mention which MySQL version you are using so I'll assume it's MySQL 8.x.
In MySQL 8.x you can use the DENSE_RANK() function to identify the rows you want. Then a simple predicate will remove the ones you don't want.
For example, if we want to limit to 2 rows of b on each a (irrespective of rows in c), you can do:
select *
from (
select
a.id,
b.id as bid,
dense_rank() over(partition by a.id order by b.id) as drank,
c.id as cid
from a
left join b on b.aid = a.id
left join c on c.bid = b.id
) x
where drank <= 2
Result:
id bid drank cid
-- --- ----- ------
1 11 1 100
1 11 1 101
1 11 1 102
1 11 1 103
1 12 2 120
2 20 1 200
2 20 1 201
2 21 2 202
3 30 1 <null>
As you can see it shows only 11 and 12 for the id = 1, even though there are 5 total rows for it (all five are of rank 1 and 2). You can see the running example at DB Fiddle. The data script for this example is:
create table a (id int primary key not null);
insert into a (id) values (1), (2), (3);
create table b (id int primary key not null, aid int references a (id));
insert into b (id, aid) values
(11, 1), (12, 1), (13, 1), (14, 1),
(20, 2), (21, 2),
(30, 3);
create table c (id int primary key not null, bid int references b (id));
insert into c (id, bid) values
(100, 11), (101, 11), (102, 11), (103, 11),
(120, 12), (130, 13), (140, 14),
(200, 20), (201, 20),
(202, 21);
Related
I have a partitioned table with about 2 billion rows in hive like:
id, num, num_partition
1, 1253742321.53124121, 12
4, 1253742323.53124121, 12
2, 1353742324.53124121, 13
3, 1253742325.53124121, 12
And I want to have a table like:
id, rank,rank_partition
89, 1, 0
...
1, 1253742321,12
7, 1253742322,12
4, 1253742323,12
8, 1253742324,12
3, 1253742325,12
...
2, 1353742324,13
...
I have tried to do this:
df = spark.sql("select *, rank/10000000 from (select id,row_number() over(order by num asc) rank from table)t1")
It was very slow, since order by will use only 1 reducer
and I've tried to do this:
df = spark.sql("select *, rank/10000000 from (select id,row_number() over(distribute by num_partition order by num_partition asc,num asc) rank from table)t1")
But in the result the num_partition hasn't been sorted
Is there a efficient way to transform the below hive table with shown target transformation. The column count in the source table is ~ 1500.
Using spark 2.0, source and target as dataframes.
(id, dt , source1_ColA, source1_ColB, source2_ColA, source2_ColB)
------------------------------------------------------------
(10,"2018-06-01", 10, 9, 5, 8 )
(20,"2018-06-01", 20, 12, 16, 11 )
The columns A,B are transformed as shown below
Target table
(id, dt , element_name, source1, source2 )
---------------------------------------
(10,"2018-06-01", ColA , 10 , 5 )
(10,"2018-06-01", ColB , 9 , 8 )
(20,"2018-06-01", ColA , 20 , 16 )
(20,"2018-06-01", ColB , 12 , 11 )
I am new to Spark and I’m having difficulties wrapping my mind around this way of thinking.
The following problems seem generic, but I have no idea how I can solve them using Spark and the memory of its nodes only.
I have two lists (i.e.: RDDs):
List1 - (id, start_time, value) where the tuple (id, start_time) is unique
List2 - (id, timestamp)
First problem: go over List2 and for each (id, timestamp) find in List1 a value that has the same id and the maximal start_time that is before the timestamp.
For example:
List1:
(1, 10:00, a)
(1, 10:05, b)
(1, 10:30, c)
(2, 10:02, d)
List2:
(1, 10:02)
(1, 10:29)
(2, 10:03)
(2: 10:04)
Result:
(1, 10:02) => a
(1, 10:29) => b
(2, 10:03) => d
(2: 10:04) => d
Second problem: very similar to the first problem, but now the start_time and timestamp are fuzzy. This means that a time t may be anywhere between (t - delta) and (t + delta). Again, I need to time join the lists.
Notes:
There is a solution to the first problem using Cassandra, but I'm interested in solving it using Spark and the memory of the nodes only.
List1 has thousands of entries.
List2 has tens of millions of entries.
For brevity I have converted your time data 10:02 to decimal data 10.02. just use a function that would convert the time string to a number.
The first problem can be easily solved using SparkSQL as shown below.
val list1 = spark.sparkContext.parallelize(Seq(
(1, 10.00, "a"),
(1, 10.05, "b"),
(1, 10.30, "c"),
(2, 10.02, "d"))).toDF("col1", "col2", "col3")
val list2 = spark.sparkContext.parallelize(Seq(
(1, 10.02),
(1, 10.29),
(2, 10.03),
(2, 10.04)
)).toDF("col1", "col2")
list1.createOrReplaceTempView("table1")
list2.createOrReplaceTempView("table2")
scala> spark.sql("""
| SELECT col1,col2,col3
| FROM
| (SELECT
| t2.col1, t2.col2, t1.col3,
| ROW_NUMBER() over(PARTITION BY t2.col1, t2.col2 ORDER BY t1.col2 DESC) as rank
| FROM table2 t2
| LEFT JOIN table1 t1
| ON t1.col1 = t2.col1
| AND t2.col2 > t1.col2) tmp
| WHERE tmp.rank = 1""").show()
+----+-----+----+
|col1| col2|col3|
+----+-----+----+
| 1|10.02| a|
| 1|10.29| b|
| 2|10.03| d|
| 2|10.04| d|
+----+-----+----+
similarly the solution for the 2'nd problem can be derived by just changing the joining condition as shown below
spark.sql("""
SELECT col1,col2,col3
FROM
(SELECT
t2.col1, t2.col2, t1.col3,
ROW_NUMBER() over(PARTITION BY t2.col1, t2.col2 ORDER BY t1.col2 DESC) as rank
FROM table2 t2
LEFT JOIN table1 t1
ON t1.col1 = t2.col1
AND t2.col2 between t1.col2 - ${delta} and t1.col2 + ${delta} ) tmp // replace delta with actual value
WHERE tmp.rank = 1""").show()
How can I divide a linestring to line segments? using SAP HANA DB
for example:
'LINESTRING(0 0, 2 2, 0 2, 0 5)'
will become:
'LINESTRING(0 0, 2 2) LINESTRING(2 2, 0 2) LINESTRING(0 2, 0 5)'
You can extract individual points with the function ST_PointN().
To call it for point 1, 2, 3, ... N
I use a temp table called TMP that would store counter values:
CREATE TABLE TMP (NUM INT);
select TOP 100 ROW_NUMBER() OVER () from PUBLIC.M_TABLES ;
Then I apply ST_PointN for every point using the counter table.
select NUM, LS.ST_PointN(NUM).ST_WKT() as CUR_POINT
from(
select NEW ST_LineString('LINESTRING(0 0, 2 2, 0 2, 0 5)') as LS, NUM
from TMP
)nested1
where NUM<=LS.ST_NumPoints()
This returns
NUM| CUR_POINT
1 |POINT (0 0)
2 |POINT (2 2)
3 |POINT (0 2)
4 |POINT (0 5)
You can easily concatenate those into a ST_Multipoint geometry with an aggregation using ST_UnionAggr():
select ST_UnionAggr(CUR_POINT).ST_AsWKT()
from(
select NUM, LS.ST_PointN(NUM) as CUR_POINT
from(
select NEW ST_LineString('LINESTRING(0 0, 2 2, 0 2, 0 5)') as LS, NUM
from TMP
)nested1
where NUM<=LS.ST_NumPoints()
)nested2
This returns
MULTIPOINT ((0 0),(2 2),(0 2),(0 5))
Note: we could do a loop instead of using a counter table.
Now for your exact question, we will build 3 ST_LineString() using a window function to combine the current point with the next one. There are multiple ways to write such a query, here's one:
select NEW ST_LineString('LineString ('||START_POINT.ST_X()||' '||START_POINT.ST_Y()||','||END_POINT.ST_X()||' '||END_POINT.ST_Y()||')').ST_AsWKT() as LS
from(
select CUR_POINT as START_POINT,
NEW ST_Point(FIRST_VALUE(CUR_POINT) OVER(order by NUM asc rows BETWEEN 1 following and 1 following) )as END_POINT
from(
select NUM, LS.ST_PointN(NUM) as CUR_POINT, LS.ST_NumPoints() as NB_POINTS
from(
select NEW ST_LineString('LINESTRING(0 0, 2 2, 0 2, 0 5)') as LS, NUM
from TMP
)nested1
where NUM<=LS.ST_NumPoints()
)nested2
)nested3
where END_POINT is not null
Tada:
LS
LINESTRING (0 0,2 2)
LINESTRING (2 2,0 2)
LINESTRING (0 2,0 5)
I have got two strings:
12, H220, H280
and
11, 36, 66, 67, H225, H319, H336
and I want to add character A to every place where there is no 'H', so the strings should look like
A12, H220, H280
and
A11, A36, A66, A67, H225, H319, H336
select REPLACE(Test,Test,'A'+Test) from (
select REPLACE(Test,', ', ',A') Test from (
select REPLACE(Test,', H',',H') Test from (
select '11, 36, 66, 67, H225, H319, H336' as Test) S) S1 ) S2
Try this:
SQL Fiddle demo
--Sample data
DECLARE #T TABLE (ID INT, COL1 VARCHAR(100))
INSERT #T (ID, COL1)
VALUES (1, '12, H220, H280'), (2, '11, 36, 66, 67, H225, H319, H336')
--Query
;WITH CTE AS
(
SELECT ID, STUFF(COL1, PATINDEX('%[^H]%', COL1), 0, 'A') COL1, 1 NUMBER
FROM #T
UNION ALL
SELECT CTE.ID, STUFF(CTE.COL1, PATINDEX('%[,][ ][^HA]%', CTE.COL1) + 2, 0, 'A'), NUMBER + 1
FROM CTE JOIN #T T
ON CTE.ID = T.ID
WHERE PATINDEX('%[,][ ][^HA]%', CTE.COL1) > 0
)
,
CTE2 AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY NUMBER DESC) rn
FROM CTE
)
SELECT ID,COL1 FROM CTE2 WHERE RN = 1
Results:
| ID | COL1 |
|----|--------------------------------------|
| 1 | A12, H220, H280 |
| 2 | A11, A36, A66, A67, H225, H319, H336 |