ORDER BY vs SORT BY in Spark SQL - apache-spark

I use Spark 2.4 and use the %sql mode to query tables.
If I am using a Window function on a large data-set, then which one between ORDER BY vs SORT BY will be more efficient from a query performance standpoint ?
I understand that ORDER BY ensures global ordering but the computation gets pushed to only 1 reducer. However, SORT BY will sort within each partition but the partitions may receive overlapping ranges.
I want to understand if SORT BY too could be used in this case ? And Which one will be more efficient while processing a large data-set (say 100 M rows) ?
For e.g.
ROW_NUMBER() OVER (PARTITION BY prsn_id ORDER BY purch_dt desc) AS RN
VS
ROW_NUMBER() OVER (PARTITION BY prsn_id SORT BY purch_dt desc) AS RN
Can anyone please help. Thanks.

It does not matter whether you use SORT BY or ORDER BY. There is a notion about Hive that you are likely referring to, but you are using Spark, that has no such issue.
For partition BY ...the 1 Reducer aspect is only an issue if you have nothing to partition by. You do have prsn_id, so not an issue.

sort by is applied at each bucket and does not guarantee that entire dataset is sorted.
But order by is applied at entire dataset (in a single reducer).
Since your query is partitioned and sorted/ordered for each partition key, the both usage returns the same output.

Related

row_number and orderby window function narrow transformation equivalent in spark

df.withColumn("col1", functions.row_number().over(Window.partitionBy("col2").orderBy("col3")));
This operation I think is a wide transformation in nature where it will sort and partition and for every partition together. Due to which a lot of shuffling is occurring and causing performance issues.
I have a use case where data in each partition is independent of other partitions. So I want to run the above row number function for each partition only i.e I need a narrow transformation and I couldn't find a way to do it.
Is there a way to achieve it ?
I read orderBy vs sort methods which are wide and narrow respectively. Does that hold true in this case also?

Distribute By vs Shuffle in SparkSQL Query Join

I believe that in Spark when there is a JOIN between two tables, both the tables are distributed to the partitions on the same join-key to co-locate the data (from both the tables) to find a match. If I am not mistaken, this action is called SHUFFLE.
However, I also read that there is a DISTRIBUTE BY clause which can be used in a sql query to also pre-distribute the data by the specified key. So logically using a distribute by on the joining tables before a Join will also give the same results as a normal SHUFFLE.
For. e.g:
create or replace temporary view cust AS
select id, name
from customers
distribute by id;
create or replace temporary view prods AS
select id, pname
from products
distribute by id;
select a.id, a.name, b.pname
from cust a
INNER JOIN prods b
ON a.id = b.id
So, if distribute by also distributes the data to co-locate the data in both the tables, how is it any different from a Shuffle ? Can a distribute-by eliminate shuffle ?
Also, how can a distribute by/cluster by be leveraged to elevate query performance.
If possible, please share an example.
Can anyone please clarify.
From the manuals:
DISTRIBUTE BY
Repartition rows in the relation based on a set of expressions. Rows with the same expression values will be hashed to the same
worker. You cannot use this with ORDER BY or CLUSTER BY.
It amounts to the same thing. I.e. shuffle occurs, that is to say you cannot eliminate it, just alternative interfaces. Of course, only possible due to 'lazy' evaluation employed.

How can I reduce or is it necessary to reducing partition count for large amount of data in Cassandra?

I have estimated ~500 million rows data with 5 million unique numbers. My query must get data by number and event_date. number as partition key, there will be 5 million partitions. I think it is not good that exists a lot of small partitions and timeouts occurs during query. I'm in trouble with defining partition key. I have found some synthetic sharding strategies, but couldn't apply for my model. I can define partition key by mod number, but then rows aren't distributed balanced among partitions.
How can I model this for reducing or is it necessary to reducing partition count? Is there any partition count limit?
CREATE TABLE events_by_number_and_date (
number bigint,
event_date int, /*eg. 20200520*/
event text,
col1 int,
col2 decimal
PRIMARY KEY (number, event_date)
);
For your query, change of the data model won't help, as you're using the query that is unsuitable for Cassandra. Although Cassandra supports aggregations, such as, max, count, avg, sum, ..., they are designed for work inside single partition, and not designed to work in the whole cluster. If you issue them without restriction on the partition key, coordinating node, need to reach every node in the cluster, and they will need to go through all the data in the cluster.
You can still do this kind of query, but it's better to use something like Spark to do that, as it's heavily optimized for parallel data processing, and Spark Cassandra Connector is able to correctly perform querying of the data. If you can't use Spark, you can implement your own full token range scan, using code similar to this. But in any case, don't expect that there will be a "real-time" answer (< 1sec).

How to search record using ORDER_BY without the partition keys

I'm debugging an issue and the logs should be sitting on a time range between 4/23/19~ 4/25/19
There are hundreds of millions of records on our production.
It's impossible to locate the target records using random sort.
Is there any workaround to search in a time range without partition key?
select * from XXXX.report_summary order by modified_at desc
Schema
...
"modified_at" "TimestampType" "regular"
"record_end_date" "TimestampType" "regular"
"record_entity_type" "UTF8Type" "clustering_key"
"record_frequency" "UTF8Type" "regular"
"record_id" "UUIDType" "partition_key"
First, ORDER BY is really quite superfluous in Cassandra. It can only operate on your clustering columns within a partition, and then only on the exact order of the clustering columns. The reason for this, is that Cassandra reads sequentially from the disk, so it writes all data according to the defined clustering order to begin with.
So IMO, ORDER BY in Cassandra is pretty useless, except for cases where you want to change the sort direction (ascending/descending).
Secondly, due to its distributed nature, you need to take a query-oriented approach to data modeling. In other words, your tables must be designed to support the queries you intend to run. Now you can find ways around this, but then you're basically doing a full table scan on a distributed cluster, which won't end well for anyone.
Therefore, the recommended way to go about that, would be to build a table like this:
CREATE TABLE stackoverflow.report_summary_by_month (
record_id uuid,
record_entity_type text,
modified_at timestamp,
month_bucket bigint,
record_end_date timestamp,
record_frequency text,
PRIMARY KEY (month_bucket, modified_at, record_id)
) WITH CLUSTERING ORDER BY (modified_at DESC, record_id ASC);
Then, this query will work:
SELECT * FROM report_summary_by_month
WHERE month_bucket = 201904
AND modified_at >= '2019-04-23' AND modified_at < '2019-04-26';
The idea here, is that as you care about the order of the results, you need to partition by something else to allow for sorting to work. For this example, I picked month, hence I've "bucketed" your results by month into a partition key called month_bucket. Within each month, I'm clustering on modified_at in DESCending order. This way, the most-recent results are at the "top" of the partition. Then, I threw in record_id as a tie-breaker key to help ensure uniqueness.
If you're still focused on doing this the wrong way:
You can actually run a range query on your current schema. But with "hundreds of millions of records" across several nodes, I don't have high hopes for that to work. But you can do it with the ALLOW FILTERING directive (which you shouldn't ever really use).
SELECT * FROM report_summary
WHERE modified_at >= '2019-04-23'
AND modified_at < '2019-04-26' ALLOW FILTERING;
This approach has the following caveats:
With many records across many nodes, it will likely time out.
Without being able to identify a single partition for this query, a coordinator node will be chosen, and that node has a high chance of becoming overloaded.
As this is pulling rows from multiple partitions, a sort order cannot be enforced.
ALLOW FILTERING makes Cassandra work in ways that it really wasn't designed to, so I would never use that on a production system.
If you really need to run a query like this, I recommend using an in-memory aggregation tool, like Spark.
Also, as the original question was about ORDER BY, I wrote an article a while back which better explains this topic: https://www.datastax.com/dev/blog/we-shall-have-order

Avoiding filtering with a compound partition key in Cassandra

I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries

Resources