Spark-Cassandra connector: how to change collections write behavior - apache-spark

In Java, I have a Spark dataset (Spark Structured Streaming) with a column of type java.util.ArrayList<Short> and I want to write the dataset in a Cassandra table which has a corresponding list<smallint>.
Each time I write the row in Cassandra it updates an existing row and I want to customize the write behavior of the list in order to control if
the written list will overwrite the existing list or
the content of the written list will be appended to the content of the list already saved in Cassandra
I found in the spark-cassandra-connector source code a class CollectionBehavior which is extended by both CollectionAppend and CollectionOverwrite. It seems exatcly what I am looking for but I didn't find a way to use it while writing to Cassandra.
The dataset is written to Cassandra using:
dataset.write()
.format("org.apache.spark.sql.cassandra")
.option("table", table)
.option("keyspace", keyspace)
.mode(SaveMode.Append)
.save();
Is it possible to change this behavior?

To save to Cassandra collections while setting the save mode for the collection, use the RDD API. Dataset API seem to be missing this so far. So changing the dataset to RDD and using RDD methods to save to cassandra should be able to give you the behaviour you want.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md

Related

How to paginate hive table in spark?

I wanted to do a pagination on a hive table having ~1.5 billion rows using pyspark. I came across one solution using ROW_NUMBER(). When I tried it, I am running out memory. Not sure whether spark is trying to bring in the complete table to it's memory and then doing a pagination.
After that, I came across this LIMIT clause in Hive SQL (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause) and tried it. But it failed in spark, the reason which I figured out was that hiveQL is not completely supported in spark.sql(). Spark SQL limit does not support multiple arguments for offset -> https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-limit.html
Is there a good approach where in I can do pagination using spark?
PS: The hive table does not have an ID column, with which I can sort and do a pagination. :)
basic use of spark :
# Extract the data
df = spark.read.table("my_table")
# Transform the data
df = df.withColumn("new_col", some_transformation())
# Load the data
df.write ... # write wherever you want

How to extract only specific rows from mongodb using Pyspark?

I am extracting data from mongodb collection and writing it to bigquery table using Spark python code.
below is my code snippet:
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri","mongodb_url")\
.option("database","db_name")\
.option("collection", "collection_name")\
.load()
df.write \
.format("bigquery") \
.mode("append")\
.option("temporaryGcsBucket","gcs_bucket") \
.option("createDisposition","CREATE_IF_NEEDED")\
.save("bq_dataset_name.collection_name")
This will extract all the data from mongodb collection. but i want to extract only the documents which satisfied the condition(like where condition in sql query).
One way i found was to read whole data in dataframe and use filter on that dataframe like below:
df2 = df.filter(df['date'] < '12-03-2020 10:12:40')
But as my source mongo collection has 8-10 Gb of data, i cannot afford to read whole data everytime from mongo.
How can i use filtering while reading data from mongo using spark.read?
Have you tried checking if your whole data is being scanned even after applying filters?
Assuming you are using the official connector with spark, filter/ predicate pushdown is supported.
“Predicates pushdown” is an optimization from the connector and the
Catalyst optimizer to automatically “push down” predicates to the data
nodes. The goal is to maximize the amount of data filtered out on the
data storage side before loading it into Spark’s node memory.
There are two kinds of predicates automatically pushed down by the connector to MongoDB:
the select clause (projections) as a $project
the filter clause content (where) as one or more $match
You can find the supporting code for this over here.
Note:
There have been some issues regarding predicate pushdown on nested fields but this is a bug in spark itself and affects other sources as well. This has been fixed in Spark 3.x. Check this answer.

Write a spark DataFrame to a table

I am trying to understand the spark DataFrame API method called saveAsTable.
I have following question
If I simply write a dataframe using saveAsTable API
df7.write.saveAsTable("t1"), (assuming t1 did not exist earlier), will the newly created table be a hive table which can be read outside spark using Hive QL ?
Does spark also create some non-hive table (which are created using saveAsTable API but can not be read outside spark using HiveQL)?
How can check if a table is Hive Table or Non-Hive table ?
(I am new to big data processing, so pardon me if question is not phrased properly)
Yes. Newly created table will be hive table and can be queried from Hive CLI(Only if the DataFrame is created from single input HDFS path i.e. from non-partitioned single input HDFS path).
Below is the documentation comment in DataFrameWriter.scala class. Documentation link
When the DataFrame is created from a non-partitioned
HadoopFsRelation with a single input path, and the data source
provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and
Parquet), the table is persisted in a Hive compatible format, which
means other systems like Hive will be able to read this table.
Otherwise, the table is persisted in a Spark SQL specific format.
Yes, you can do. You table can be partitioned by a column, but can not use bucketing (its a problem between spark and hive).

Reading Big Query using Spark BigQueryConnector

I want to read a big query using spark big query connector and pass the partition information into it. This is working fine but its reading the full table. I want to filter the data based on some partition value. How can I do it? I don't want to read the full table and then apply filter on spark dataset. I want to pass the partition information while reading itself. Is that even possible?
Dataset<Row> testDS = session.read().format("bigquery")
.option("table", <TABLE>)
//.option("partition",<PARTITION>)
.option("project", <PROJECT_ID>)
.option("parentProject", <PROJECT_ID>)
.load();
filter is working this way .option("filter", "_PARTITIONTIME = '2020-11-23 13:00:00'")

Optimized hive data aggregation using hive

I have a hive table (80 million records) with the followig schema (event_id ,country,unit_id,date) and i need to export this data to a text file as with the following requirments:
1-Rows are aggregated(combined) by event_id.
2-Aggregated rows must be sorted according to date.
For example rows with same event_id must be combined as a list of lists, ordered according to date.
What is the best performance wise solution to make this job using spark ?
Note: This is expected to be a batch job.
Performance-wise, I think the best solution is to write a spark program (scala or python) to read in the underlying files to the hive table, do your transformations, and then write the output as a file.
I've found that it's much quicker to just read the files in spark rather than querying hive through spark and pulling the result into a dataframe.

Resources