How to specify the filter condition with spark DataFrameReader API for a table? - apache-spark

I was reading about spark on databricks documentation https://docs.databricks.com/data/tables.html#partition-pruning-1
It says
When the table is scanned, Spark pushes down the filter predicates
involving the partitionBy keys. In that case, Spark avoids reading
data that doesn’t satisfy those predicates. For example, suppose you
have a table that is partitioned by <date>. A query
such as SELECT max(id) FROM <example-data> WHERE date = '2010-10-10'
reads only the data files containing tuples whose date value matches
the one specified in the query.
How can I specify such filter condition in DataFrameReader API while reading a table?

As spark is lazily evaluated when your read the data using dataframe reader it is just added as a stage in the underlying DAG.
Now when you run SQL query over the data it is also added as another stage in the DAG.
And when you apply any action over the dataframe, then the DAG is evaluated and all the stages are optimized by catalyst optimized which in the end generated the most cost effective physical plan.
At the time of DAG evaluation predicate conditions are pushed down and only the required data is read into the memory.

DataFrameReader is created (available) exclusively using SparkSession.read.
That means it is created when the following code is executed (example of csv file load)
val df = spark.read.csv("path1,path2,path3")
Spark provides a pluggable Data Provider Framework (Data Source API) to rollout your own datasource. Basically, it provides interfaces that can be implemented for reading/writing to your custom datasource. That's where generally the partition pruning and predicate filter pushdowns are implemented.
Databricks spark supports many built-in datasources (along with predicate pushdown and partition pruning capabilities) as per https://docs.databricks.com/data/data-sources/index.html.
So, if the need is to load data from JDBC table and specify filter conditions, please see the following example
// Note: The parentheses are required.
val pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
Please refer to more details here
https://docs.databricks.com/data/data-sources/sql-databases.html

Related

How to extract only specific rows from mongodb using Pyspark?

I am extracting data from mongodb collection and writing it to bigquery table using Spark python code.
below is my code snippet:
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri","mongodb_url")\
.option("database","db_name")\
.option("collection", "collection_name")\
.load()
df.write \
.format("bigquery") \
.mode("append")\
.option("temporaryGcsBucket","gcs_bucket") \
.option("createDisposition","CREATE_IF_NEEDED")\
.save("bq_dataset_name.collection_name")
This will extract all the data from mongodb collection. but i want to extract only the documents which satisfied the condition(like where condition in sql query).
One way i found was to read whole data in dataframe and use filter on that dataframe like below:
df2 = df.filter(df['date'] < '12-03-2020 10:12:40')
But as my source mongo collection has 8-10 Gb of data, i cannot afford to read whole data everytime from mongo.
How can i use filtering while reading data from mongo using spark.read?
Have you tried checking if your whole data is being scanned even after applying filters?
Assuming you are using the official connector with spark, filter/ predicate pushdown is supported.
“Predicates pushdown” is an optimization from the connector and the
Catalyst optimizer to automatically “push down” predicates to the data
nodes. The goal is to maximize the amount of data filtered out on the
data storage side before loading it into Spark’s node memory.
There are two kinds of predicates automatically pushed down by the connector to MongoDB:
the select clause (projections) as a $project
the filter clause content (where) as one or more $match
You can find the supporting code for this over here.
Note:
There have been some issues regarding predicate pushdown on nested fields but this is a bug in spark itself and affects other sources as well. This has been fixed in Spark 3.x. Check this answer.

What is the best way to join multiple jdbc connection tables in spark?

I'm trying to migrate a query to pyspark and need to join multiple tables in it. All the tables in question are in Redshift and I'm using the jdbc connector to talk to them.
My problem is how do I do these joins optimally without reading too much data in (i.e. load table and join on key) and without just blatantly using:
spark.sql("""join table1 on x=y join table2 on y=z""")
Is there a way to pushdown the queries to Redshift but still use the Spark df API for writing the logic and also utilizing df from spark context without saving them to Redshift just for the joins?
Please find next some points to consider:
The connector will push-down the specified filters only if there is any filter specified in your Spark code e.g select * from tbl where id > 10000. You can confirm that by yourself, just check the responsible Scala code. Also here is the corresponding test which demonstrates exactly that. The test test("buildWhereClause with multiple filters") tries to verify that the variable expectedWhereClause is equal to whereClause generated by the connector. The generated where clause should be:
"""
|WHERE "test_bool" = true
|AND "test_string" = \'Unicode是樂趣\'
|AND "test_double" > 1000.0
|AND "test_double" < 1.7976931348623157E308
|AND "test_float" >= 1.0
|AND "test_int" <= 43
|AND "test_int" IS NOT NULL
|AND "test_int" IS NULL
"""
which has occurred from the Spark-filters specified above.
The driver supports also column filtering. Meaning it will load only the required columns by pushing down the valid columns to redshift. You can again verify that from the corresponding Scala test("DefaultSource supports simple column filtering") and test("query with pruned and filtered scans").
Although in your case, you haven't specified any filters in your join query hence Spark can not leverage the two previous optimisations. If you are aware of such filters please feel free to apply them though.
Last but not least and as Salim already mentioned, the official Spark connector for redshift can be found here. The Spark connector is built on top of Amazon Redshift JDBC Driver therefore it will try to use it anyway as specified on the connector's code.

Performance consideration when reading from hive view Vs hive table via DataFrames

We have a view that unions multiple hive tables. If i use spark SQL in pyspark and read that view will there be any performance issue as against reading directly from the table.
In hive we had something called full table scan if we don't limit the where clause to an exact table partition. Is spark intelligent enough to directly read the table that has the data that we are looking for rather than searching through the entire view ?
Please advise.
You are talking about partition pruning.
Yes spark supports it spark automatically omits large data read when partition filters are specified.
Partition pruning is possible when data within a table is split across multiple logical partitions. Each partition corresponds to a particular value of a partition column and is stored as a subdirectory within the table root directory on HDFS. Where applicable, only the required partitions (subdirectories) of a table are queried, thereby avoiding unnecessary I/O
After partitioning the data, subsequent queries can omit large amounts of I/O when the partition column is referenced in predicates. For example, the following query automatically locates and loads the file under peoplePartitioned/age=20/and omits all others:
val peoplePartitioned = spark.read.format("orc").load("peoplePartitioned")
peoplePartitioned.createOrReplaceTempView("peoplePartitioned")
spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20")
more detailed info is provided here
You can also see this in the logical plan if you run an explain(True) on your query:
spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20").explain(True)
it will show which partitions are read by spark

How can you pushdown predicates to Cassandra or limit requested data when using Pyspark / Dataframes?

For example on docs.datastax.com we mention :
table1 = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load()
and its the only way I know, but lets say that I want to load only the last one million entries from this table. I don't want to load the whole table in memory every time, especially if this table has for example, over 10 million entries.
Thanks!
While you can't load data faster. You can load portions of the data or terminate early. Spark DataFrames utilize catalyst to optimize it's underlying query plans enables it to take some short cuts.
For example calling limit will allow Spark to skip reading some portions from the underlying DataSource. These would limit the amount of data read from Cassandra by canceling tasks from being executed.
Calling filter, or adding filters can be utilized by the underlying Datasource to help restrict the amount of information actually pulled from Cassandra. There are limitations on what can be pushed down but this is all detailed in the documentation.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md#pushing-down-clauses-to-cassandra
Note all of this is accomplished by simply doing further api calls on your DataSource once you call it. For example
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(table="kv", keyspace="ks")
.load()
df.show(10) // Will compute only enough tasks to get 10 records and no more
df.filter(clusteringKey > 5).show() //Will pass down the clustering predicate to C*

How to paralellize RDD work when using cassandra spark connector for data aggration?

Here is the sample senario, we have real time data record in cassandra, and we want to aggregate the data in different time ranges. What I write code like below:
val timeRanges = getTimeRanges(report)
timeRanges.foreach { timeRange =>
val (timestampStart, timestampEnd) = timeRange
val query = _sc.get.cassandraTable(report.keyspace, utilities.Helper.makeStringValid(report.scope)).
where(s"TIMESTAMP > ?", timestampStart).
where(s"VALID_TIMESTAMP <= ?", timestampEnd)
......do the aggregation work....
what the issue of the code is that for every time range, the aggregation work is running not in parallized. My question is how can I parallized the aggregation work? Since RDD can't run in another RDD or Future? Is there any way to parallize the work, or we can't using spark connector here?
Use the joinWithCassandraTable function. This allows you to use the data from one RDD to access C* and pull records just like in your example.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
joinWithCassandraTable utilizes the java driver to execute a single
query for every partition required by the source RDD so no un-needed
data will be requested or serialized. This means a join between any
RDD and a Cassandra Table can be preformed without doing a full table
scan. When preformed between two Cassandra Tables which share the
same partition key this will not require movement of data between
machines. In all cases this method will use the source RDD's
partitioning and placement for data locality.
Finally , we using union to join each RDD and makes them parallized.

Resources