What is the best way to join multiple jdbc connection tables in spark? - apache-spark

I'm trying to migrate a query to pyspark and need to join multiple tables in it. All the tables in question are in Redshift and I'm using the jdbc connector to talk to them.
My problem is how do I do these joins optimally without reading too much data in (i.e. load table and join on key) and without just blatantly using:
spark.sql("""join table1 on x=y join table2 on y=z""")
Is there a way to pushdown the queries to Redshift but still use the Spark df API for writing the logic and also utilizing df from spark context without saving them to Redshift just for the joins?

Please find next some points to consider:
The connector will push-down the specified filters only if there is any filter specified in your Spark code e.g select * from tbl where id > 10000. You can confirm that by yourself, just check the responsible Scala code. Also here is the corresponding test which demonstrates exactly that. The test test("buildWhereClause with multiple filters") tries to verify that the variable expectedWhereClause is equal to whereClause generated by the connector. The generated where clause should be:
"""
|WHERE "test_bool" = true
|AND "test_string" = \'Unicode是樂趣\'
|AND "test_double" > 1000.0
|AND "test_double" < 1.7976931348623157E308
|AND "test_float" >= 1.0
|AND "test_int" <= 43
|AND "test_int" IS NOT NULL
|AND "test_int" IS NULL
"""
which has occurred from the Spark-filters specified above.
The driver supports also column filtering. Meaning it will load only the required columns by pushing down the valid columns to redshift. You can again verify that from the corresponding Scala test("DefaultSource supports simple column filtering") and test("query with pruned and filtered scans").
Although in your case, you haven't specified any filters in your join query hence Spark can not leverage the two previous optimisations. If you are aware of such filters please feel free to apply them though.
Last but not least and as Salim already mentioned, the official Spark connector for redshift can be found here. The Spark connector is built on top of Amazon Redshift JDBC Driver therefore it will try to use it anyway as specified on the connector's code.

Related

How can I implement pyspark Cassandra "keybased" connector?

I am using Spark 2.4.7 and I have implemented normal pyspark cassandra connector, but there is a use case where I need to implement key based connector, I am not getting useful blogs/tutorials around it, Someone please help me with it.
I have tried normal pyspark-cassandra connector and it is working good.
Now I am expecting to implement keybased connector, which I am unable to find.
Normally Cassandra Loads entire table but I want not to load entire table but run a query on source and get the required data
By keybased I want to get data using some keys i.e. using where condition like
Select *
From <table_name>
Where <column_name>!=0
should run on source and load those data only which satisfies this condition.
To have this functionality you need to understand how both Spark & Cassandra works separately & together:
When you do spark.read, Spark doesn't load all data - it just fetches metadata, like, table structure, column names & types, partitioning schema, etc.
When you perform query with condition (where or filter), Spark Cassandra Connector tries to perform so-called predicate pushdown - convert Spark SQL query into corresponding CQL query, but it really depends on the condition. And if it's not possible, then it goes through all data, and perform filtering on the Spark side. For example, if you have condition on the column that is partition key - then it will be converted into CQL expression SELECT ... FROM table where pk = XXX. Similarly, there are some optimizations for queries on the clustering columns - Spark will need to go through all partitions, but it's still will be more optimized as it may filter data based on the clustering columns. Use a link above to understand what conditions could be pushed down into Cassandra and which aren't. The rule of thumb is - if you can execute query in CQLSH without ALLOW FILTERING, then it will be pushed down.
In your specific example, you're using inequality predicate (<> or !=) that isn't supported by Cassandra, so Spark Cassandra connector will need to go through all data, and filtering will happen on the Spark side.

Repartition in Spark - SQL API

We use the SQL API of Spark to execute queries on Hive tables on the cluster. How can I perform a REPARTITION on a column in my query in SQL-API ?. Please note that we do not use the Dataframe API but instead we use the SQL API (for e.g SELECT * from table WHERE col = 1).
I understand that PySpark-SQL offers a function for the same in the Dataframe API.
However, I want to know the syntax to specify a REPARTITION (on a specific column) in a SQL query via the SQL-API (thru a SELECT statement).
Consider the following query :
select a.x, b.y
from a
JOIN b
on a.id = b.id
Any help is appreciated.
We use Spark 2.4
Thanks
You can provide hints to enable repartition in spark sql
spark.sql('''SELECT /*+ REPARTITION(colname) */ col1,col2 from table''')
You can use both, but using %sql, use from the manuals:
DISTRIBUTE BY
Repartition rows in the relation based on a set of expressions. Rows with the same expression values will be hashed to the same worker. You cannot use this with ORDER BY or CLUSTER BY.
It all amounts to the same thing. I.e. shuffle occurs, that is to say you cannot eliminate it, just alternative interfaces. Of course, only possible due to 'lazy' evaluation employed.
%sql
SELECT * FROM boxes DISTRIBUTE BY width
SELECT * FROM boxes DISTRIBUTE BY width SORT BY width
This is the alternative in %sql approach for hint as per other answer.

Pulling only required columns in Spark from Cassandra without loading all the columns

Using the spark-elasticsearch connector it is possible to directly load only the required columns from ES to Spark. However, there doesn't seem to exist such a straight forward option to do the same, using the spark-cassandra connector
Reading data from ES into Spark
-- here only required columns are being brought from ES to Spark :
spark.conf.set('es.nodes', ",".join(ES_CLUSTER))
es_epf_df = spark.read.format("org.elasticsearch.spark.sql") \
.option("es.read.field.include", "id_,employee_name") \
.load("employee_0001") \
Reading data from Cassandra into Spark
-- here all the columns' data is brought to spark and then select is applied to pull columns of interest :
spark.conf.set('spark.cassandra.connection.host', ','.join(CASSANDRA_CLUSTER))
cass_epf_df = spark.read.format('org.apache.spark.sql.cassandra') \
.options(keyspace="db_0001", table="employee") \
.load() \
.select("id_", "employee_name")
Is it possible to do the same for Cassandra? If yes, then how. If not, then why not.
Actually, connector should do that itself, without need to explicitly set anything, it's called "predicate pushdown", and cassandra-connector does it, according to documentation:
The connector will automatically pushdown all valid predicates to
Cassandra. The Datasource will also automatically only select columns
from Cassandra which are required to complete the query. This can be
monitored with the explain command.
source: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
The code which you have written is already doing that. You have written select after load and you may think first all the columns are pulled and then selected columns are filtered, but that is not the case.
Assumption : select * from db_0001.employee;
Actual : select id_, employee_name from db_0001.employee;
Spark will understand the columns which you need and query only those in Cassandra database. This feature is called predicate pushdown. This is not limited just to cassandra, many sources support this feature(this is a feature of spark, not casssandra).
For more info: https://docs.datastax.com/en/dse/6.7/dse-dev/datastax_enterprise/spark/sparkPredicatePushdown.html

How to specify the filter condition with spark DataFrameReader API for a table?

I was reading about spark on databricks documentation https://docs.databricks.com/data/tables.html#partition-pruning-1
It says
When the table is scanned, Spark pushes down the filter predicates
involving the partitionBy keys. In that case, Spark avoids reading
data that doesn’t satisfy those predicates. For example, suppose you
have a table that is partitioned by <date>. A query
such as SELECT max(id) FROM <example-data> WHERE date = '2010-10-10'
reads only the data files containing tuples whose date value matches
the one specified in the query.
How can I specify such filter condition in DataFrameReader API while reading a table?
As spark is lazily evaluated when your read the data using dataframe reader it is just added as a stage in the underlying DAG.
Now when you run SQL query over the data it is also added as another stage in the DAG.
And when you apply any action over the dataframe, then the DAG is evaluated and all the stages are optimized by catalyst optimized which in the end generated the most cost effective physical plan.
At the time of DAG evaluation predicate conditions are pushed down and only the required data is read into the memory.
DataFrameReader is created (available) exclusively using SparkSession.read.
That means it is created when the following code is executed (example of csv file load)
val df = spark.read.csv("path1,path2,path3")
Spark provides a pluggable Data Provider Framework (Data Source API) to rollout your own datasource. Basically, it provides interfaces that can be implemented for reading/writing to your custom datasource. That's where generally the partition pruning and predicate filter pushdowns are implemented.
Databricks spark supports many built-in datasources (along with predicate pushdown and partition pruning capabilities) as per https://docs.databricks.com/data/data-sources/index.html.
So, if the need is to load data from JDBC table and specify filter conditions, please see the following example
// Note: The parentheses are required.
val pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
Please refer to more details here
https://docs.databricks.com/data/data-sources/sql-databases.html

Does spark load all data from Kudu when a filter is used?

I am new to spark. Will the following code load all data or just filter data from kudu?
val df: DataFrame = spark.read.options(Map(
"kudu.master" -> kuduMaster,
"kudu.table" -> s"impala::platform.${table}")).kudu
val outPutDF = df.filter(row => {
val recordAt: Long = row.getAs("record_at").toString.toLong
recordAt >= XXX && recordAt < YYY
})
The easiest way to check if the filter is pushed down or not for a given connector is using Spark UI.
The scan nodes in Spark will have the metrics of number of records read from the datasource.(You can check this Spark UI -> SQL tab, ater running a query)
Write a query with and without an explicit predicate(on a small dataset).
Inferences
1. If the number of records in scan node is the same with and without predicate - Spark has read the data completely from datasource and filtering will be done in Spark.
2. If the numbers are different, predicate pushdown has been implemented in the datasource connector.
3. Using this experiment you could also figure which kinds of predicates are pushed down.(depends on connector implementation)
Try .explain.
Not sure I got the code, but here is an example of some code that works.
val dfX = df.map(row => XYZ{row.getAs[Long]("someVal")}).filter($"someVal" === 2)
But assuming you can get code to work, Spark "predicate pushdown" will apply in your case and filtering in Kudu Storage Manager applied. So, not all data loaded.
This is from the KUDU Guide:
<> and OR predicates are not pushed to Kudu, and instead will be
evaluated by the Spark task. Only LIKE predicates with a suffix
wildcard are pushed to Kudu, meaning that LIKE "FOO%" is pushed down
but LIKE "FOO%BAR" isn’t.
That is to say, your case is OK.

Resources