What is the best practice of groupby in Spark SQL?

What is the best practice of groupby in Spark SQL? - apache-spark

I have a Spark SQL that groupbys multiple columns. I was wondering if the order of the columns matter to the query performance.
Does placing the column with more distinct values earlier help? I assume the groupby is based on some hash/shuffle algorithm. If the first groupby can distribute data to small subsets that can be hold in one machine, the later groupbys can be done locally. Is this true?
What is the best practice of groupby?

group by, as you assumed, uses hash function on columns to decide which set of group by keys would end up in which partition.
You can use distribute by to tell spark which columns to use - https://docs.databricks.com/spark/latest/spark-sql/language-manual/select.html
As for any other manipulation on the data (like placing more distinct values earlier), note that if have 2 group by statements in your query, you end up with 2 shuffles. And the result of the first one is obviously quite big (as it's not the final aggregation). So I would try to have as little group by statements as possible.

Related

What is the most efficient way to select distinct value from a spark dataframe?

Of the various ways that you've tried, e.g. df.select('column').distinct(), df.groupby('column').count() etc., what is the most efficient way to extract distinct values from a column?

It does not matter as you can see in this excellent reference https://www.waitingforcode.com/apache-spark-sql/distinct-vs-group-by-key-difference/read.
This is because Apache Spark has a logical optimization rule called ReplaceDistinctWithAggregate that will transform an expression with distinct keyword by an aggregation.
DISTINCT and GROUP BY in simple contexts of selecting unique values for a column, execute the same way, i.e. as an aggregation.

for larger dataset , groupby is efficient method.

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?

There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

Order of rows shown changes on selection of columns from dependent pyspark dataframe

Why does the order of rows displayed differ, when I take a subset of the dataframe columns to display, via show?
Here is the original dataframe:
Here dates are in the given order, as you can see, via show.
Now the order of rows displayed via show changes when I select a subset of predict_df by method of column selection for a new dataframe.

Because of Spark dataframe itself is unordered. It's due to parallel processing principles wich Spark uses. Different records may be located in different files (and on different nodes) and different executors may read the data in different time and in different sequence.
So You have to excplicitly specify order in Spark action using orderBy (or sort) method. E.g.:
df.orderBy('date').show()
In this case result will be ordered by date column and would be more predictible. But, if many records have equal date value then within those date subset records also would be unordered. So in this case, in order to obtain strongly ordered data, we have to perform orderBy on set of columns. And values in all rows of those set of columns must be unique. E.g.:
df.orderBy(col("date").asc, col("other_column").desc)
In general unordered datasets is a normal case for data processing systems. Even "traditional" DBMS like PostgeSQL or MS SQL Server in general return unordered records and we have to explicitly use ORDER BY clause in SELECT statement. And even if sometime we may see the same results of one query it isn't guarenteed by DBMS that by another execution result will be the same also. Especially if data reading is performed on a large amout of data.

The situation occurs because the show is an action that is called twice.
As no .cache is applied the whole cycle starts again from the start. Moreover, I tried this a few times and got the same order and not the same order as the questioner observed. Processing is non-deterministic.
As soon as I used .cache, the same result was always gotten.
This means that there is ordering preserved over a narrow transformation on a dataframe, if caching has been applied, otherwise the 2nd action will invoke processing from the start again - the basics are evident here as well. And may be the bottom line is always do ordering explicitly - if it matters.

Like #Ihor Konovalenko and #mck mentioned, Sprk dataframe is unordered by its nature. Also, looks like your dataframe doesn’t have a reliable key to order, so one solution is using monotonically_increasing_id https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html to create id and that will keep your dataframe always ordered. However if your dataframe is big, be aware this function might take some time to generate id for each row.

Does it help to filter down a dataframe before a left outer join?

I've only seen sources say that this helps for RDDs, so I was wondering if it helped for DataFrames since the Spark core and spark SQL engines optimize differently.
Let's say table 1 has 6mil records and we're joining to table 2 which has 600mil records. We are joining these two tables on table 2's primary key, 'key2'.
If we plan to do:
table 3 = table1.join(table2, 'key2', 'left_outer'),
Is it worth it to filter down table2's 600mil records with a WHERE table2.key2 IN table1.key2 before the join? And if so, what's the best way to do it? I know the DataFrame LEFT SEMI JOIN method is similar to a WHERE IN filter, but I'd like to know if there are better ways to filter it down.

TL;DR It is not possible to answer without data, but probably not.
Pre-filtering may provide performance boost if you significantly reduce number of records to be shuffled. To do that:
It has to be highly selective.
Size of the key column is << size of all columns.
The first one is obvious. If there is reduction you search for nothing.
The second is subtle - WHERE ... IN (SELECT ... FROM ...) requires a shuffle, same a join. So the keys are actually shuffled twice.
Using bloom filters might scale more gracefully (no need to shuffle).
If you have 100 fold difference in the number of records, it might be better to consider broadcast join.

Joining two result sets into one

I wanted to know if there's a way to join two or more result sets into one.
I actually need to execute more than one query and return just one result set. I can't use the UNION or the JOIN operators because I'm working with Cassandra (CQL)
Thanks in advance !

Framework like Playorm provide support for JOIN (INNER and LEFT JOINs)queries in Cassandra.
http://buffalosw.com/wiki/Command-Line-Tool/
You may see more examples at:
https://github.com/deanhiller/playorm/blob/master/src/test/java/com/alvazan/test/TestJoins.java

If your wanting to query multiple rows within the same column family you can use the IN keyword:
SELECT * FROM testCF WHERE key IN ('rowKeyA', 'rowKeyB', 'rowKeyZ') LIMIT 10;
This will get you back 10 results from each row.
If your needing to join results from different CFs, or query with differing WHERE clauses, then you need to run multiple queries and merge the results in code - cassandra doesn't cater for that kind of thing.

PlayOrm can do joins, but you may need to have PlayOrm partitioning on so you still scale. (ie. you dont' want to join 1 billion rows with 1 billion rows). Typically instead you do a join of one partition with another partition or a partition on the Account table joining a partition on the Users table. ie. make sure you design for scale still.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string