How to extract only specific rows from mongodb using Pyspark? - apache-spark

I am extracting data from mongodb collection and writing it to bigquery table using Spark python code.
below is my code snippet:
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri","mongodb_url")\
.option("database","db_name")\
.option("collection", "collection_name")\
.load()
df.write \
.format("bigquery") \
.mode("append")\
.option("temporaryGcsBucket","gcs_bucket") \
.option("createDisposition","CREATE_IF_NEEDED")\
.save("bq_dataset_name.collection_name")
This will extract all the data from mongodb collection. but i want to extract only the documents which satisfied the condition(like where condition in sql query).
One way i found was to read whole data in dataframe and use filter on that dataframe like below:
df2 = df.filter(df['date'] < '12-03-2020 10:12:40')
But as my source mongo collection has 8-10 Gb of data, i cannot afford to read whole data everytime from mongo.
How can i use filtering while reading data from mongo using spark.read?

Have you tried checking if your whole data is being scanned even after applying filters?
Assuming you are using the official connector with spark, filter/ predicate pushdown is supported.
“Predicates pushdown” is an optimization from the connector and the
Catalyst optimizer to automatically “push down” predicates to the data
nodes. The goal is to maximize the amount of data filtered out on the
data storage side before loading it into Spark’s node memory.
There are two kinds of predicates automatically pushed down by the connector to MongoDB:
the select clause (projections) as a $project
the filter clause content (where) as one or more $match
You can find the supporting code for this over here.
Note:
There have been some issues regarding predicate pushdown on nested fields but this is a bug in spark itself and affects other sources as well. This has been fixed in Spark 3.x. Check this answer.

Related

How to paginate hive table in spark?

I wanted to do a pagination on a hive table having ~1.5 billion rows using pyspark. I came across one solution using ROW_NUMBER(). When I tried it, I am running out memory. Not sure whether spark is trying to bring in the complete table to it's memory and then doing a pagination.
After that, I came across this LIMIT clause in Hive SQL (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause) and tried it. But it failed in spark, the reason which I figured out was that hiveQL is not completely supported in spark.sql(). Spark SQL limit does not support multiple arguments for offset -> https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-limit.html
Is there a good approach where in I can do pagination using spark?
PS: The hive table does not have an ID column, with which I can sort and do a pagination. :)
basic use of spark :
# Extract the data
df = spark.read.table("my_table")
# Transform the data
df = df.withColumn("new_col", some_transformation())
# Load the data
df.write ... # write wherever you want

How to specify the filter condition with spark DataFrameReader API for a table?

I was reading about spark on databricks documentation https://docs.databricks.com/data/tables.html#partition-pruning-1
It says
When the table is scanned, Spark pushes down the filter predicates
involving the partitionBy keys. In that case, Spark avoids reading
data that doesn’t satisfy those predicates. For example, suppose you
have a table that is partitioned by <date>. A query
such as SELECT max(id) FROM <example-data> WHERE date = '2010-10-10'
reads only the data files containing tuples whose date value matches
the one specified in the query.
How can I specify such filter condition in DataFrameReader API while reading a table?
As spark is lazily evaluated when your read the data using dataframe reader it is just added as a stage in the underlying DAG.
Now when you run SQL query over the data it is also added as another stage in the DAG.
And when you apply any action over the dataframe, then the DAG is evaluated and all the stages are optimized by catalyst optimized which in the end generated the most cost effective physical plan.
At the time of DAG evaluation predicate conditions are pushed down and only the required data is read into the memory.
DataFrameReader is created (available) exclusively using SparkSession.read.
That means it is created when the following code is executed (example of csv file load)
val df = spark.read.csv("path1,path2,path3")
Spark provides a pluggable Data Provider Framework (Data Source API) to rollout your own datasource. Basically, it provides interfaces that can be implemented for reading/writing to your custom datasource. That's where generally the partition pruning and predicate filter pushdowns are implemented.
Databricks spark supports many built-in datasources (along with predicate pushdown and partition pruning capabilities) as per https://docs.databricks.com/data/data-sources/index.html.
So, if the need is to load data from JDBC table and specify filter conditions, please see the following example
// Note: The parentheses are required.
val pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
Please refer to more details here
https://docs.databricks.com/data/data-sources/sql-databases.html

Spark-Cassandra connector: how to change collections write behavior

In Java, I have a Spark dataset (Spark Structured Streaming) with a column of type java.util.ArrayList<Short> and I want to write the dataset in a Cassandra table which has a corresponding list<smallint>.
Each time I write the row in Cassandra it updates an existing row and I want to customize the write behavior of the list in order to control if
the written list will overwrite the existing list or
the content of the written list will be appended to the content of the list already saved in Cassandra
I found in the spark-cassandra-connector source code a class CollectionBehavior which is extended by both CollectionAppend and CollectionOverwrite. It seems exatcly what I am looking for but I didn't find a way to use it while writing to Cassandra.
The dataset is written to Cassandra using:
dataset.write()
.format("org.apache.spark.sql.cassandra")
.option("table", table)
.option("keyspace", keyspace)
.mode(SaveMode.Append)
.save();
Is it possible to change this behavior?
To save to Cassandra collections while setting the save mode for the collection, use the RDD API. Dataset API seem to be missing this so far. So changing the dataset to RDD and using RDD methods to save to cassandra should be able to give you the behaviour you want.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md

Optimized hive data aggregation using hive

I have a hive table (80 million records) with the followig schema (event_id ,country,unit_id,date) and i need to export this data to a text file as with the following requirments:
1-Rows are aggregated(combined) by event_id.
2-Aggregated rows must be sorted according to date.
For example rows with same event_id must be combined as a list of lists, ordered according to date.
What is the best performance wise solution to make this job using spark ?
Note: This is expected to be a batch job.
Performance-wise, I think the best solution is to write a spark program (scala or python) to read in the underlying files to the hive table, do your transformations, and then write the output as a file.
I've found that it's much quicker to just read the files in spark rather than querying hive through spark and pulling the result into a dataframe.

Spark-Cassandra: how to efficiently restrict partitions

After days thinking about it I'm still stuck with this problem: I have one table where "timestamp" is the partition key. This table contains billions of rows.
I also have "timeseries" tables that contain timestamps related to specific measurement processes.
With Spark I want to analyze the content of the big table. Of course it is not efficient to do a full table scan, and with a rather fast lookup in the timeseries table I should be able to target only, say, 10k partitions.
What is the most efficient way to achieve this?
Is SparkSQL smart enough to optimize something like this
sqlContext.sql("""
SELECT timeseries.timestamp, bigtable.value1 FROM timeseries
JOIN bigtable ON bigtable.timestamp = timeseries.timestamp
WHERE timeseries.parameter = 'xyz'
""")
Ideally I would expect Cassandra to fetch the timestamps from the timeseries table and then use that to query only that subset of partitions from bigtable.
If you add an "Explain" call to your query you'll see what the Catalyst planner will do for your query but I know it will not do the optimizations you want.
Currently Catalyst has no support for pushing down joins to DataSources which means the structure of your query is most likely got to look like.
Read Data From Table timeseries with predicate parameter = 'xyz'
Read Data From Table bigtable
Join these two results
Filter on bigtable.timestamp == timeseries.timestamp
The Spark Cassandra Connector will be given the predicate from the timeseries table read and will be able to optimize it if is a clustering key or a partition key. See the Spark Cassandra Connector Docs. If it doesn't fit into one of those pushdown categories it will require a Full Table Scan followed by a filter in Spark.
Since the Read Data From Table bigtable has no restrictions on it, Spark will instruct the Connector to read the entire table (Full Table Scan).
I can only take a guess on the optimizations done by the driver, but I'd surely expect a query such as that to restrict the JOIN on the WHERE, which means that your simple query will be optimized.
What I will do as well is point you in the general direction of optimizing Spark SQL. Have a look at Catalyst for Spark SQL, which is a tool for greatly optimizing queries all the way down to the physical level.
Here is a breakdown of how it works:
Deep Dive into Spark SQL Catalyst Optimizer
And the link to the git-repo: Catalyst repo

Resources