Distribute By vs Shuffle in SparkSQL Query Join - apache-spark

I believe that in Spark when there is a JOIN between two tables, both the tables are distributed to the partitions on the same join-key to co-locate the data (from both the tables) to find a match. If I am not mistaken, this action is called SHUFFLE.
However, I also read that there is a DISTRIBUTE BY clause which can be used in a sql query to also pre-distribute the data by the specified key. So logically using a distribute by on the joining tables before a Join will also give the same results as a normal SHUFFLE.
For. e.g:
create or replace temporary view cust AS
select id, name
from customers
distribute by id;
create or replace temporary view prods AS
select id, pname
from products
distribute by id;
select a.id, a.name, b.pname
from cust a
INNER JOIN prods b
ON a.id = b.id
So, if distribute by also distributes the data to co-locate the data in both the tables, how is it any different from a Shuffle ? Can a distribute-by eliminate shuffle ?
Also, how can a distribute by/cluster by be leveraged to elevate query performance.
If possible, please share an example.
Can anyone please clarify.

From the manuals:
DISTRIBUTE BY
Repartition rows in the relation based on a set of expressions. Rows with the same expression values will be hashed to the same
worker. You cannot use this with ORDER BY or CLUSTER BY.
It amounts to the same thing. I.e. shuffle occurs, that is to say you cannot eliminate it, just alternative interfaces. Of course, only possible due to 'lazy' evaluation employed.

Related

ORDER BY vs SORT BY in Spark SQL

I use Spark 2.4 and use the %sql mode to query tables.
If I am using a Window function on a large data-set, then which one between ORDER BY vs SORT BY will be more efficient from a query performance standpoint ?
I understand that ORDER BY ensures global ordering but the computation gets pushed to only 1 reducer. However, SORT BY will sort within each partition but the partitions may receive overlapping ranges.
I want to understand if SORT BY too could be used in this case ? And Which one will be more efficient while processing a large data-set (say 100 M rows) ?
For e.g.
ROW_NUMBER() OVER (PARTITION BY prsn_id ORDER BY purch_dt desc) AS RN
VS
ROW_NUMBER() OVER (PARTITION BY prsn_id SORT BY purch_dt desc) AS RN
Can anyone please help. Thanks.
It does not matter whether you use SORT BY or ORDER BY. There is a notion about Hive that you are likely referring to, but you are using Spark, that has no such issue.
For partition BY ...the 1 Reducer aspect is only an issue if you have nothing to partition by. You do have prsn_id, so not an issue.
sort by is applied at each bucket and does not guarantee that entire dataset is sorted.
But order by is applied at entire dataset (in a single reducer).
Since your query is partitioned and sorted/ordered for each partition key, the both usage returns the same output.

cassandra - SELECT result as WHERE condition

i want to use the result of select query as input of another queries condition like this:
DELETE FROM message_user WHERE id = 8a81de70-1991-11e9-a38f-9e0aa7c9f25f and group = e5b04c50-1982-11e9-abf3-b17ecbb80329 and receiver in (SELECT member FROM chat_group_member WHERE id = e5b04c50-1982-11e9-abf3-b17ecbb80329)
Cassandra is distributed database, Nested queries are type of joins. In Cassandra Data might be stored on multiple host. In order to make joint large data might need to be downloaded on single node. This might cause performance issues as all nodes are on commodity hardware (peer to peer). Hence I think its not supported.

Selecting from multiple tables in Cassandra CQL

So I have two tables in the query I am using:
SELECT
R.dst_ap, B.name
FROM airports as A, airports as B, routes as R
WHERE R.src_ap = A.iata
AND R.dst_ap = B.iata;
However it is throwing the error:
mismatched input 'as' expecting EOF (..., B.name FROM airports [as] A...)
Is there anyway I can do what I am attempting to do (which is how it works relationally) in Cassandra CQL?
The short answer, is that there are no joins in Cassandra. Period. So using SQL-based JOIN syntax will yield an error similar to what you posted above.
The idea with Cassandra (or any distributed database) is to ensure that your queries can be served by a single node (cutting down on network time). There really isn't a way to guarantee that data from different tables could be queried from a single node. For this reason, distributed joins are typically seen as an anti-pattern. To that end, Cassandra simply doesn't allow them.
In Cassandra you need to take a query-based modeling approach. So you could solve this by building a table from your post-join result set, consisting of desired combinations of dst_ap and name. You would have to find an appropriate way to partition this table, but ultimately you would want to build it based on A) the result set you expect to see and B) the properties you expect to filter on in your WHERE clause.

Is there a data architecture for efficient joins in Spark (a la RedShift)?

I have data that I would like to do a lot of analytic queries on and I'm trying to figure out if there is a mechanism I can use to store it so that Spark can efficiently do joins on it. I have a solution using RedShift, but would ideally prefer to have something that is based on files in S3 instead of having a whole RedShift cluster up 24/7.
Introduction to the data
This is a simplified example. We have 2 initial CSV files.
Person records
Event records
The two tables are linked via the person_id field. person_id is unique in the Person table. Events have a many-to-one relationship with person.
The goal
I'd like to understand how to set up the data so I can efficiently perform the following query. I will need to perform many queries like this (all queries are evaluated on a per person basis):
The query is to produce a data frame with 4 columns, and 1 row for every person.
person_id - person_id for each person in the data set
age - "age" field from the person record
cost - The sum of the "cost" field for all event records for that person where "date" is during the month of 6/2013
All current solutions I have with Spark to this problem involve reshuffling all the data, which ends up making the process slow for large amounts (hundreds of millions of people). I am happy with a solution that requires me to reshuffle the data and write it to a different format once if that can then speed up later queries.
The solution using RedShift
I can accomplish this solution using RedShift in a fairly straightforward way:
Each both files are loaded in as RedShift tables, with DISTKEY person_id, SORTKEY person_id. This distributes the data so that all the data for a person is on a single node. The following query will produce the desired data frame:
select person_id, age, e.cost from person
left join (select person_id, sum(cost) as cost from events
where date between '2013-06-01' and '2013-06-30'
group by person_id) as e using (person_id)
The solution using Spark/Parquet
I have thought of several potential ways to handle this in Spark, but none accomplishes what I need. My ideas and the issues are listed below:
Spark Dataset write 'bucketBy' - Read the CSV files and then rewrite them out as parquet files using "bucketBy". Queries on these parquet files could then be very fast. This would produce a data setup similar to RedShift, but parquet files don't support bucketBy.
Spark parquet partitioning - Parquet does support partitioning. Because parquet creates a separate set of files for each partition key, you have to create a computed column to partition on and use a hash of person_id to create the partitionKey. However, when you later join these tables in spark based on "partition_key" and "person_id", the query plan still does a full hash partition. So this approach is no better than just reading the CSVs and shuffling every time.
Stored in some other data format besides parquet - I am open to this, but don't know of another data source that will work.
Using a compound record format - Parquet supports hierarchical data formats, so can prejoin both tables into a hierarchical record (where a person record has an "events" field which is an array of struct elements) and then do processing on that. When you have a hierarchical record, there are two approaches that to processing it:
** Use explode to create separate records ** - Using this approach you explode array fields into full rows, then use standard data frame operations to do analytics, and then join them back to the main table. Unfortunately, I've been unable to get this approach to efficiently compile queries.
** Use UDFs to perform operations on subrecords ** - This preserves the structure and executes without shuffles, but is an awkward and verbose way to program. Also, it requires lots of UDFs which aren't great for performance (although they beat large scale shuffling of data).
For my use cases, Spark has advantages over RedShift which aren't obvious in this simple example, so I'd prefer to do this with Spark. Please let me know if I am missing something and there is a good approach to this.
Edited per comment.
Assumptions:
Using parquet
Here's what I would try:
val eventAgg = spark.sql("""select person_id, sum(cost) as cost
from events
where date between '2013-06-01' and '2013-06-30'
group by person_id""")
eventAgg.cache.count
val personDF = spark.sql("""SELECT person_id, age from person""")
personDF.cache.count // cache is less important here, so feel free to omit
eventAgg.join(personDF, "person_id", "left")
I just did this with some of my data and here's how it went (9
node/140 vCPUs cluster, ~600GB RAM):
27,000,000,000 "events" (aggregated to 14,331,487 "people")
64,000,000 "people" (~20 columns)
aggregated events building and caching took ~3 min
people caching took ~30 seconds (pulling from network, not parquet)
left joining took several seconds
Not caching the "people" led to the join taking a few seconds longer. Then forcing spark to broadcast the couple hundred MB aggregated events made the join take under 1 second.

Real time complex queries on Cassandra

We're looking for a tool (preferably open source) which helps us to perform complex queries (advanced filtering and joins, no need full SQL) in real time.
Assume that all the data needed fits in memory, and we want to avoid, if possible, the overhead of map reduce tools.
To be more specific, we need to load n partitions of a single table, and join them by clustering column.
Variables Table:
Variable ID: Partition key
Person ID: Clustering key
Variable Value
Desired output columns:
Person ID, Variable 1 Value, Variable 2 Vale, ..., Variable N Value
We can achieve it by an in-memory load-filter-join process, but we were wondering if there's any tool out there with this use case covered out of the box and with a fair performance.
We've tested Spark, but the partitioning of Spark C* connector is based on the primary key, so each Variable ID would be loaded in a different Spark node, and the join process would be really slow (all the data would travel all over the Spark cluster).
Any tips? known tools?
I believe that you have a number of options to perform this task:
Rethink your database schema, denormalize it. var_id:person_id:value rows are not the best table schema if you want to query by person_id (and it smells really bad as an entity-attribute-value db antipattern):
EAV gives a flexibility to the developer to define the schema as needed and this is good in some circumstances. On the other hand it performs very poorly in the case of an ill-defined query and can support other bad practices. In other words, EAV gives you enough rope to hang yourself and in this industry, things should be designed to the lowest level of complexity because the guy replacing you on the project will likely be an idiot.
You can use schema with multiple columns (cassandra can handle a lot of them):
create table person_data (
person_id int primary key,
var1 text,
var2 text,
var3 text,
var4 text,
....
);
If you don't have a predefined set of variables, you can use cql3 collections like map for storing the data in a more flexible way.
Create a secondary index on person_id (even it's a clustering key already). You can query for all data for a specific user without using joins, but with some issues:
As your query will hit multiple partitions, it will require not a single disk seek, but a series of them, so your query latency may be higher than you're expecting.
secondary indexes are not free: C* must perform more work under the hood if you insert a row to a table with indexed columns.
Use external index like ElasticSearch/Solr if you plan to have a lot of complex queries which do not fit well into cql3.

Resources