I have got two strings:
12, H220, H280
and
11, 36, 66, 67, H225, H319, H336
and I want to add character A to every place where there is no 'H', so the strings should look like
A12, H220, H280
and
A11, A36, A66, A67, H225, H319, H336
select REPLACE(Test,Test,'A'+Test) from (
select REPLACE(Test,', ', ',A') Test from (
select REPLACE(Test,', H',',H') Test from (
select '11, 36, 66, 67, H225, H319, H336' as Test) S) S1 ) S2
Try this:
SQL Fiddle demo
--Sample data
DECLARE #T TABLE (ID INT, COL1 VARCHAR(100))
INSERT #T (ID, COL1)
VALUES (1, '12, H220, H280'), (2, '11, 36, 66, 67, H225, H319, H336')
--Query
;WITH CTE AS
(
SELECT ID, STUFF(COL1, PATINDEX('%[^H]%', COL1), 0, 'A') COL1, 1 NUMBER
FROM #T
UNION ALL
SELECT CTE.ID, STUFF(CTE.COL1, PATINDEX('%[,][ ][^HA]%', CTE.COL1) + 2, 0, 'A'), NUMBER + 1
FROM CTE JOIN #T T
ON CTE.ID = T.ID
WHERE PATINDEX('%[,][ ][^HA]%', CTE.COL1) > 0
)
,
CTE2 AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY NUMBER DESC) rn
FROM CTE
)
SELECT ID,COL1 FROM CTE2 WHERE RN = 1
Results:
| ID | COL1 |
|----|--------------------------------------|
| 1 | A12, H220, H280 |
| 2 | A11, A36, A66, A67, H225, H319, H336 |
Related
I need to create a new DAX column that will search a string from another column in the same table. It will search for any of the values in a 2nd table, and return True if any of those values are found. Simplified example:
Let's say I have a table named Sentences with 1 column:
Sentences
Col1
----------------
"The aardvark admitted it was wrong"
"The attractive peanut farmer graded the term paper"
"The awning was too tall to touch"
And another table named FindTheseWords with a list of values
FindTheseWords
Col1
----------------
peanut
aardvark
I'll be creating Col2 in the Sentences table, which should return
Sentences
Col1 Col2
---------------------------------------------------- ------------------------
"The aardvark admitted it was wrong" TRUE
"The attractive peanut farmer graded the term paper" TRUE
"The awning was too tall to touch" FALSE
The list of FindTheseWords is actually pretty long, so I can't just hardcode them and use an OR. I need to reference the table. I don't care about spaces, so a sentence with "peanuts" would also return true.
I've seen a good implementation of this in M, but the performance of my load took a pretty good hit, so I'm hoping to find a DAX option for a new column.
The M Solution, for reference: How to search multiple strings in a string?
fact table
| Column1 |
|------------------------------------------------------|
| The aardvark admitted it was wrong |
| The attractive peanut farmer graded the term paper |
| The awning was too tall to touch |
| This is text string |
| Tester is needed |
sentence table
| Column1 |
|------------|
| attractive |
| peanut |
| aardvark |
| Tester |
Calculated column
Column =
VAR _1 =
ADDCOLUMNS ( 'fact', "newColumn", SUBSTITUTE ( 'fact'[Column1], " ", "|" ) )
VAR _2 =
GENERATE (
_1,
ADDCOLUMNS (
GENERATESERIES ( 1, PATHLENGTH ( [newColumn] ) ),
"Words", PATHITEM ( [newColumn], [Value], TEXT )
)
)
VAR _3 =
ADDCOLUMNS (
_2,
"test", CONTAINS ( VALUES ( sentence[Column1] ), sentence[Column1], [Words] )
)
VAR _4 =
DISTINCT (
SELECTCOLUMNS (
FILTER ( _3, [test] = TRUE ),
"Column1", [Column1] & "",
"test", [test] & ""
)
)
VAR _5 =
DISTINCT (
SELECTCOLUMNS (
FILTER ( _3, [test] = FALSE ),
"Column1", [Column1] & "",
"test", [test] & ""
)
)
VAR _7 =
FILTER ( _5, [Column1] = MAXX ( _4, [Column1] ) )
VAR _8 =
UNION ( _4, _7 )
RETURN
MAXX (
FILTER ( _8, [Column1] = CALCULATE ( MAX ( 'fact'[Column1] ) ) ),
[test]
)
I'm trying to combine data from 3 tables with below query
SELECT `media_category`.id as cat_id,
`media_category`.category_name as cat_name,
`video`.`id` as vid_id,
`video`.`name` as vid_name,
`screenshots`.name as screenshot_name
FROM `media_category`
LEFT JOIN video
ON `media_category`.id = `video`.`category_id`
LEFT JOIN screenshots
ON `video`.id = `screenshots`.`media_id`
WHERE `screenshots`.name NOT LIKE '%\_%'
version: mysql 5.7
It workes well. But I need to limit the rows getting from video table to LIMIT 10 per category
Any idea for that?
You don't mention which MySQL version you are using so I'll assume it's MySQL 8.x.
In MySQL 8.x you can use the DENSE_RANK() function to identify the rows you want. Then a simple predicate will remove the ones you don't want.
For example, if we want to limit to 2 rows of b on each a (irrespective of rows in c), you can do:
select *
from (
select
a.id,
b.id as bid,
dense_rank() over(partition by a.id order by b.id) as drank,
c.id as cid
from a
left join b on b.aid = a.id
left join c on c.bid = b.id
) x
where drank <= 2
Result:
id bid drank cid
-- --- ----- ------
1 11 1 100
1 11 1 101
1 11 1 102
1 11 1 103
1 12 2 120
2 20 1 200
2 20 1 201
2 21 2 202
3 30 1 <null>
As you can see it shows only 11 and 12 for the id = 1, even though there are 5 total rows for it (all five are of rank 1 and 2). You can see the running example at DB Fiddle. The data script for this example is:
create table a (id int primary key not null);
insert into a (id) values (1), (2), (3);
create table b (id int primary key not null, aid int references a (id));
insert into b (id, aid) values
(11, 1), (12, 1), (13, 1), (14, 1),
(20, 2), (21, 2),
(30, 3);
create table c (id int primary key not null, bid int references b (id));
insert into c (id, bid) values
(100, 11), (101, 11), (102, 11), (103, 11),
(120, 12), (130, 13), (140, 14),
(200, 20), (201, 20),
(202, 21);
I have a partitioned table with about 2 billion rows in hive like:
id, num, num_partition
1, 1253742321.53124121, 12
4, 1253742323.53124121, 12
2, 1353742324.53124121, 13
3, 1253742325.53124121, 12
And I want to have a table like:
id, rank,rank_partition
89, 1, 0
...
1, 1253742321,12
7, 1253742322,12
4, 1253742323,12
8, 1253742324,12
3, 1253742325,12
...
2, 1353742324,13
...
I have tried to do this:
df = spark.sql("select *, rank/10000000 from (select id,row_number() over(order by num asc) rank from table)t1")
It was very slow, since order by will use only 1 reducer
and I've tried to do this:
df = spark.sql("select *, rank/10000000 from (select id,row_number() over(distribute by num_partition order by num_partition asc,num asc) rank from table)t1")
But in the result the num_partition hasn't been sorted
Is there a efficient way to transform the below hive table with shown target transformation. The column count in the source table is ~ 1500.
Using spark 2.0, source and target as dataframes.
(id, dt , source1_ColA, source1_ColB, source2_ColA, source2_ColB)
------------------------------------------------------------
(10,"2018-06-01", 10, 9, 5, 8 )
(20,"2018-06-01", 20, 12, 16, 11 )
The columns A,B are transformed as shown below
Target table
(id, dt , element_name, source1, source2 )
---------------------------------------
(10,"2018-06-01", ColA , 10 , 5 )
(10,"2018-06-01", ColB , 9 , 8 )
(20,"2018-06-01", ColA , 20 , 16 )
(20,"2018-06-01", ColB , 12 , 11 )
I am new to Spark and I’m having difficulties wrapping my mind around this way of thinking.
The following problems seem generic, but I have no idea how I can solve them using Spark and the memory of its nodes only.
I have two lists (i.e.: RDDs):
List1 - (id, start_time, value) where the tuple (id, start_time) is unique
List2 - (id, timestamp)
First problem: go over List2 and for each (id, timestamp) find in List1 a value that has the same id and the maximal start_time that is before the timestamp.
For example:
List1:
(1, 10:00, a)
(1, 10:05, b)
(1, 10:30, c)
(2, 10:02, d)
List2:
(1, 10:02)
(1, 10:29)
(2, 10:03)
(2: 10:04)
Result:
(1, 10:02) => a
(1, 10:29) => b
(2, 10:03) => d
(2: 10:04) => d
Second problem: very similar to the first problem, but now the start_time and timestamp are fuzzy. This means that a time t may be anywhere between (t - delta) and (t + delta). Again, I need to time join the lists.
Notes:
There is a solution to the first problem using Cassandra, but I'm interested in solving it using Spark and the memory of the nodes only.
List1 has thousands of entries.
List2 has tens of millions of entries.
For brevity I have converted your time data 10:02 to decimal data 10.02. just use a function that would convert the time string to a number.
The first problem can be easily solved using SparkSQL as shown below.
val list1 = spark.sparkContext.parallelize(Seq(
(1, 10.00, "a"),
(1, 10.05, "b"),
(1, 10.30, "c"),
(2, 10.02, "d"))).toDF("col1", "col2", "col3")
val list2 = spark.sparkContext.parallelize(Seq(
(1, 10.02),
(1, 10.29),
(2, 10.03),
(2, 10.04)
)).toDF("col1", "col2")
list1.createOrReplaceTempView("table1")
list2.createOrReplaceTempView("table2")
scala> spark.sql("""
| SELECT col1,col2,col3
| FROM
| (SELECT
| t2.col1, t2.col2, t1.col3,
| ROW_NUMBER() over(PARTITION BY t2.col1, t2.col2 ORDER BY t1.col2 DESC) as rank
| FROM table2 t2
| LEFT JOIN table1 t1
| ON t1.col1 = t2.col1
| AND t2.col2 > t1.col2) tmp
| WHERE tmp.rank = 1""").show()
+----+-----+----+
|col1| col2|col3|
+----+-----+----+
| 1|10.02| a|
| 1|10.29| b|
| 2|10.03| d|
| 2|10.04| d|
+----+-----+----+
similarly the solution for the 2'nd problem can be derived by just changing the joining condition as shown below
spark.sql("""
SELECT col1,col2,col3
FROM
(SELECT
t2.col1, t2.col2, t1.col3,
ROW_NUMBER() over(PARTITION BY t2.col1, t2.col2 ORDER BY t1.col2 DESC) as rank
FROM table2 t2
LEFT JOIN table1 t1
ON t1.col1 = t2.col1
AND t2.col2 between t1.col2 - ${delta} and t1.col2 + ${delta} ) tmp // replace delta with actual value
WHERE tmp.rank = 1""").show()