From these tables:
select group, ids
from some.groups_and_ids;
Result:
group | group_ids
---+----
winners | 1$4
losers | 4
others | 2$3$4
and:
select id,name from some.ids_and_names;
id | name
---+----
1 | bob
2 | robert
3 | dingus
4 | norbert
How would you go about returning something like:
winners | bob, norbert
losers | norbert
others | robert, dingus, norbert
with normalized (group_name, id) as (
select group_name, unnest(string_to_array(group_ids,'$')::int[])
from groups_and_ids
)
select n.group_name, string_agg(p.name,',' order by p.name)
from normalized n
join ids_and_names p on p.id = n.id
group by n.group_name;
The first part (the common table expression) normalizes your broken table design by creating a proper view on the groups_and_ids table. The actual query then joins the ids_and_names table to the normalized version of your groups and the aggregates the names again.
Note I renamed group to group_name because group is a reserved keyword.
SQLFiddle: http://sqlfiddle.com/#!15/2205b/2
Is it possible to redesign your database? Putting all the group_ids into one column makes life hard. If your table was e.g.
group | group_id
winners | 1
winners | 4
losers | 4
etc. this would be trivially easy. As it is, the below query would do it, although I hesitated to post it, since it encourages bad database design (IMHO)!
p.s. I took the liberty of renaming some columns, because they are reserved words. You can escape them, but why make life difficult for yourself?
select group_name,array_to_string(array_agg(username),', ') -- array aggregation and make it into a string
from
(
select group_name,theids,username
from ids_and_names
inner join
(select group_name,unnest(string_to_array(group_ids,'$')) as theids -- unnest a string_to_array to get rows
from groups_and_ids) i
on i.theids = cast(id as text)) a
group by group_name
Related
I have two tables mentioned below:
Reports
Id |status
1 |Active
Reports_details
Id| Cntry| State | City
1 | IN | UP | Delhi
1 | US | Texas | Salt lake
Now my requirement is
Select distinct r.Id from Reports r left join Reports_details rd on r.Id=rd.Id where r.status=‘Active’ and contains(city,’”Del*”’)
Note: using contains for full text search
Problem: How to add where clause on both tables Bookshelf Model simultaneously
and how to fetch above query data with pagination
Tried created 2 respective Models with belongs on and hasMany but issue comes when applying where on either Model, it’s not accepting where clause from both table-error:Invalid column name
Appreciate your suggestion on the work around. Thank You
I am trying to understand Salting techniques to tackle Skew in Spark SQL. I have done some reading online and I have come up with a very rudimentary implementation of the same in Spark SQL API.
Let's assume that table1 is Skewed on cid=1:
Table 1:
cid | item
---------
1 | light
1 | cookie
1 | ketchup
1 | bottle
2 | dish
3 | cup
As shown above, cid=1 occurs more than other keys.
Table 2:
cid | vehicle
---------
1 | taxi
1 | truck
2 | cycle
3 | plane
Now my code looks like the following:
create temporary view table1_salt as
select
cid, item, concat(cid, '-', floor(rand() * 19)) as salted_key
from table1;
create temporary view table2_salt as
select
cid, vehicle, explode(array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key
from table2;
Final Query:
select a.cid, a.item, b.name
from table1_salt a
inner join table2_salt b
on a.salted_key = concat(b.cid, '-', b.salted_key);
In the above example, I have used 20 salts/splits.
Questions:
Is there any rule of thumb to choose optimal number for the splits to
be used ? For e.g. if table1 has 10 Million records, how many bins/buckets should I use ? (In this simple test example I have used 20).
As shown above, when I am creating Table2_salt, I am hardcoding the
the salts like (0, 1, 2, 3.... thru 19). Is there a better
way to implement the same functionality, but without the
hardcoding and the clutter ? (What if I want to use 100 splits!)
Since we are replicating the second table (table2) N number of times, doesn't it mean that it will degrade the Join performance ?
Note: I need to use Spark 2.4 SQL API only.
Also, kindly let me know if there are any advanced examples available on the net. Any help is appreciated.
I have just 2 tables wherein I need to get the records from the first table (big table 10 M rows) whose transaction date is lesser than or equal to the effective date present in the second table (small table with 1 row), and this result-set will then be consumed by downstream queries.
Table Transact:
tran_id | cust_id | tran_amt | tran_dt
1234 | XYZ | 12.55 | 10/01/2020
5678 | MNP | 25.99 | 25/02/2020
5561 | XYZ | 32.45 | 30/04/2020
9812 | STR | 10.32 | 15/08/2020
Table REF:
eff_dt |
30/07/2020 |
Hence as per logic I should get back the first 3 rows and discard the last record since it is greater than the reference date (present in the REF table)
Hence, I have used a non-equi Cartesian Join between these tables as:
select
/*+ MAPJOIN(b) */
a.tran_id,
a.cust_id,
a.tran_amt,
a.tran_dt
from transact a
inner join ref b
on a.tran_dt <= b.eff_dt
However, this sql is taking forever to complete due to the cross Join with the transact table even using Broadcast hints.
So is there any smarter way to implement the same logic which will be more efficient than this ? In other words, is it possible to optimize the Theta join in this query ?
Thanks in advance.
So I wrote something like this:
Referring from https://databricks.com/session/optimizing-apache-spark-sql-joins
Can you try Bucketing on trans_dt (Bucketed on Year/Month only). And write 2 queries to do the same work
First query, trans_dt(Year/Month) < eff_dt(Year/Month). So this could help you actively picking up buckets(rather than checking each and every record trans_dt) which is less than 2020/07.
second query, trans_dt(Year/Month) = eff_dt(Year/Month) and trans_dt(Day) <= eff_dt(Day)
I have a table that has a column of list type (tags):
CREATE TABLE "Videos" (
video_id UUID,
title VARCHAR,
tags LIST<VARCHAR>,
PRIMARY KEY (video_id, upload_timestamp)
) WITH CLUSTERING ORDER BY (upload_timestamp DESC);
I have plenty of rows containing various values in the tags column, ie. ["outdoor","funny cats","funny mice"].
I want to perform a SELECT query that will return all rows that contain "funny cats" in the tags column. How can I do that?
To directly answer your question, yes there is a way to accomplish this. As of Cassandra 2.1 you can create a secondary index on a collection. First, I'll re-create your column family definition (while adding a definition for upload_timestamp timeuuid) and put some values in it.
aploetz#cqlsh:stackoverflow> SELECT * FROM videos ;
video_id | upload_timestamp | tags | title
--------------------------------------+--------------------------------------+-----------------------------------------------+---------------------------
2977b806-df76-4dd7-a57e-11d361e72ce1 | fc011080-64f9-11e4-a819-21b264d4c94d | ['sci-fi', 'action', 'adventure'] | Star Wars
ab696e1f-78c0-45e6-893f-430e88db7f46 | 8db7c4b0-64fa-11e4-a819-21b264d4c94d | ['documentary'] | The Witches of Whitewater
15e6bc0d-6195-4d8b-ad25-771966c780c8 | 1680d120-64fa-11e4-a819-21b264d4c94d | ['dark comedy', 'action', 'language warning'] | Pulp Fiction
(3 rows)
Next, I'll create a secondary index on the tags column:
aploetz#cqlsh:stackoverflow> CREATE INDEX ON videos (tags);
Now, if I want to query the videos that contain the tag "action," I can accomplish this with the CONTAINS keyword:
aploetz#cqlsh:stackoverflow> SELECT * FROM videos WHERE tags CONTAINS 'action';
video_id | upload_timestamp | tags | title
--------------------------------------+--------------------------------------+-----------------------------------------------+--------------
2977b806-df76-4dd7-a57e-11d361e72ce1 | fc011080-64f9-11e4-a819-21b264d4c94d | ['sci-fi', 'action', 'adventure'] | Star Wars
15e6bc0d-6195-4d8b-ad25-771966c780c8 | 1680d120-64fa-11e4-a819-21b264d4c94d | ['dark comedy', 'action', 'language warning'] | Pulp Fiction
(2 rows)
With this all being said, I should pass along a couple of warnings:
Secondary indexes do not perform well at scale. They exist to provide convenience, not performance. If you are expecting to have to query by tag often, then the right way to solve this would be to create a videosbytag query table, with the same data but keyed like this: PRIMARY KEY (tag,video_id)
You don't need the double-quotes in your table name. In fact, having it in quotes may cause you problems (ok, maybe minor irritations) down the road.
David, could I ask for some clarification on what you say about joins in this answer
When you say "You cannot, using the join of the relational stores, join one entry to multiple ones", does that mean in any direction?
E.g. Store 1:
| Key1 | Measure1 |
Store 2:
| Key 1 | SomeId1 | Measure2 | Measure3 |
| Key 1 | SomeId2 | Measure4 | Measure4 |
So is it not possible to join these two stores by putting the join from Store 2 to Store 1?
And if not, are you saying then that the only way to manage this is to duplicate the entries in Store 1? E.g.:
Store 1
| Key 1 | SomeId1 | Measure1 | Measure2 | Measure3 |
| Key 1 | SomeId2 | Measure1 | Measure4 | Measure4 |
The direction matters for the one-to-many : it depends which store is the "parent" one.
The relational stores includes the concept of an "ActivePivot Store" which is your main store (on which your schema is based). This store can then be joined to one or more stores, given a set of key fields, that we'll call "child" stores for simplicity. Each of these child stores can eventually be joined with other stores, and so on (you can represent it with a directed graph).
The main rule to respect is that you should never have a "parent" store entry resolving to multiple "child" store entries (neither should you have any cyclic relationship I believe).
The simplified idea behind the relational stores (as of RS 1.5.x / AP 4.4.x) is that when one entry is submitted into the "ActivePivot Store" then, starting from the ActivePivot Store, it'll recursively resolve the joins in order to retrieve maximum one entry in each of the joined stores. Depending of your schema definition, these entries will then be used to populate the fact before inserting it in the cube.
If resolving a join result in more than one entry then AP will not be able to choose which one to use in order to populate the fact and will throw an exception.
Coming back to your example you can do the join between Store 1 and Store 2 only in the case where Store 2 is your ActivePivot Store or a "parent" of Store 1 (APStore->...->Store2->Store1), which seems to be your case.
If not (Store1->Store2) you will then have to duplicate the entries of Store 1 in order to ensure that it will always find only one entry at maximum when resolving the join. Store 1 will then looks like:
| Key 1 | SomeId1 | Measure1
| Key 1 | SomeId2 | Measure1
Your join with Store 2 will then be done on the fields "Key, SomeId" instead of just "Key" and that will ensure you to find only one entry when resolving Store1->Store2