How is the Spark select-explode idiom implemented? - apache-spark

Assume we have a DataFrame with a string column, col1, and an array column, col2. I was wondering what happens behind the scenes in the Spark operation:
df.select('col1', explode('col2'))
It seems that select takes a sequence of Column objects as input, and explode returns a Column so the types match. But the column returned by explode('col2') is logically of different length than col1, so I was wondering how select knows to "sync" them when constructing its output DataFrame. I tried looking at the Column class for clues but couldn't really find anything.

The answer is simple - there is no such data structure as Column. While Spark SQL uses columnar storage for caching and can leverage data layout for some low level operations columns are just descriptions of data and transformations not data containers. So simplifying things a bit explode is yet another flatMap on the Dataset[Row].

Related

What is the most efficient way to select distinct value from a spark dataframe?

Of the various ways that you've tried, e.g. df.select('column').distinct(), df.groupby('column').count() etc., what is the most efficient way to extract distinct values from a column?
It does not matter as you can see in this excellent reference https://www.waitingforcode.com/apache-spark-sql/distinct-vs-group-by-key-difference/read.
This is because Apache Spark has a logical optimization rule called ReplaceDistinctWithAggregate that will transform an expression with distinct keyword by an aggregation.
DISTINCT and GROUP BY in simple contexts of selecting unique values for a column, execute the same way, i.e. as an aggregation.
for larger dataset , groupby is efficient method.

how a table data gets loaded into a dataframe in databricks? row by row or bulk?

I am new to databricks notebooks and dataframes. I have a requirement to load few columns(out of many) in a table of around 14million records into a dataframe. once the table is loaded, I need to create a new column based on values present in two columns.
I want to write the logic for the new column along with the select command while loading the table into dataframe.
Ex:
df = spark.read.table(tableName)
.select(columnsList)
.withColumn('newColumnName', 'logic')
will it have any performance impact? is it better to first load the table for the few columns into the df and then perform the column manipulation on the loaded df?
does the table data gets loaded all at once or row by row into the df? if row by row, then by including column manipulation logic while reading the table, am I causing any performance degradation?
Thanks in advance!!
This really depends on the underlying format of the table - is it backed by Parquet or Delta, or it's an interface to the actual database, etc. In general, Spark is trying to read only necessary data, and if, for example, Parquet is used (or Delta), then it's easier because it's column-oriented file format, so data for each column is placed together.
Regarding the question on the reading - Spark is lazy by default, so even if you put df = spark.read.table(....) as separate variable, then add .select, and then add .withColumn, it won't do anything until you call some action, for example .count, or write your results. Until that time, Spark will just check that table exists, your operations are correct, etc. You can always call .explain on the resulting dataframe to see how Spark will perform operations.
P.S. I recommend to grab a free copy of the Learning Spark, 2ed that is provided by Databricks - it will provide you a foundation for development of the code for Spark/Databricks

Spark RDD write to Cassandra

I have a below Cassandra Table schema.
ColumnA Primary Key
ColumnB Clustering Key
ColumnC
ColumnD
Now, I have a Spark RDD with columns ordered as
RDD[ColumnC, ColumnA, ColumnB, ColumnD]
So, when I am writing to the Cassandra Table, I need to make sure the ordering is correct. So, I am having specify the column ordering using SomeColumns
rdd.saveToCassandra(keyspace,table,SomeColumns("ColumnA","ColumnB","ColumnC","ColumnD))
Is there any way I Can pass all the column names as a list instead? I am asking that Cause I have around 140 Columns in my target table and cannot give all the names as part of SomeColumns. So, looking for a more cleaner approach.
PS: I cannot write it from a DataFrame, I Am looking only for solution based on RDD's.
You can use following syntax to explode sequence into list of arguments:
SomeColumns(names_as_sequence: _*)
Update:
If you have a sequence of column names as strings, then you need to do:
SomeColumns(names_as_string_seq.map(x => x.as(x)): _*)

What is the best practice of groupby in Spark SQL?

I have a Spark SQL that groupbys multiple columns. I was wondering if the order of the columns matter to the query performance.
Does placing the column with more distinct values earlier help? I assume the groupby is based on some hash/shuffle algorithm. If the first groupby can distribute data to small subsets that can be hold in one machine, the later groupbys can be done locally. Is this true?
What is the best practice of groupby?
group by, as you assumed, uses hash function on columns to decide which set of group by keys would end up in which partition.
You can use distribute by to tell spark which columns to use - https://docs.databricks.com/spark/latest/spark-sql/language-manual/select.html
As for any other manipulation on the data (like placing more distinct values earlier), note that if have 2 group by statements in your query, you end up with 2 shuffles. And the result of the first one is obviously quite big (as it's not the final aggregation). So I would try to have as little group by statements as possible.

How do you eliminate data skew when joining large tables in pyspark?

Table A has ~150M rows while Table B has about 60. In Table A, column_1 can and often does contain a large number of NULLS. This causes the data to become badly skewed and one executor ends up doing all of the work after LEFT JOINING.
I've read several posts on a solution but I've been unable to wrap my head around the different approaches that span several different versions of Spark.
What operation to do I need to take on Table A and what operation do I need to take on Table B to eliminate the skewed partitioning that occurs as a result of LEFT JOIN?
I'm using Spark 2.3.0 and writing in Python. In the code snippet below, I'm attempting to derive a new column that's devoid of NULLs (which would be used to execute the join), but I'm not sure where to take it (and I have no idea what to do with Table B)
new_column1 = when(col('column_1').isNull(), rand()).otherwise(col('column_1'))
df1 = df1.withColumn('no_nulls_here', new_column1)
df1.persist().count()

Resources