How to create custom writer for Spark Dataframe? - apache-spark

How can I create a custom write format for Spark Dataframe to use it like df.write.format("com.mycompany.mydb").save()? I've tried reading through Datastax Cassandra connector code but still couldn't figure it out

Spark 3.0 completely changes the API. Some new interfaces e.g. TableProvider and SupportsWrite have been added.
You might find this guide helpful.

Using Spark's DataSourceV2.
If your are using Spark version < 2.3, then you can use Spark Data Source API V1.

Related

Unsupported encoding: DELTA_BYTE_ARRAY when reading from Kusto using Kusto Spark connector or using Kusto export with Spark version < 3.3.0

Since last week we started getting java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY while reading from Kusto using the Kusto Spark connector 'Distributed' mode (same thing happens when trying to use the export command and use parquet read over it). How can we resolve this issue? Is it caused by change from the Kusto service or Spark?
We tried setting the configs "spark.sql.parquet.enableVectorizedReader=false", and "parquet.split.files=false". This works but we are worried about the outcome of this approach.
The change of behavior is due to Kusto rolling out a new implementation of Parquet writer that uses new encoding schemes, one of which being delta byte array for strings and other byte array-based Parquet types. This encoding scheme has been part of the Parquet format for a few years now and modern readers are expected to support it. i.e. Spark 3.3.0. This provides performance and cost improvements and therefore we highly advise customers to move to Spark 3.3.0 or above. Kusto Spark connector is using Kusto export for reading and by that produces parquet files with the new writer.
Possible solutions in case this is not an option:
Use Kusto Spark connector version 3.1.10, which checks the Spark version and disables the writer in the export command if version is less than 3.3.0.
Disable Spark configs:
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false"). spark.conf.set("parquet.split.files", "false")
In cases non of the above solves the issue you may open a support ticket to ADX to disable the feature (this is a temporary solution)
Note- Synapse workspace will receive the connector updated version in the following days.

How to solve the following issue in Spark 3.0? Can not create the managed table. The associated location already exists.;

In my spark job, I tried to overwrite a table in each microbatch of structured streaming
batchDF.write.mode(SaveMode.Overwrite).saveAsTable("mytable")
It generated the following error.
Can not create the managed table('`mytable`'). The associated location('file:/home/ec2-user/environment/spark/spark-local/spark-warehouse/mytable') already exists.;
I knew in Spark 2.xx, the way to solve this issue is to add the following option.
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
It works well in spark 2.xx. However, this option was removed in Spark 3.0.0. Then, how should we solve this issue in Spark 3.0.0?
Thanks!
It looks like you run your test data generation and your actual test in the same process - can you just replace these with createOrReplaceTempView to save them to Spark's in-memory catalog instead of into a Hive catalog?
Something like : batchDF.createOrReplaceTempView("mytable")

Which is the best HBase connector to use for batch loading data into HBase from Spark?

As mentioned also in
Which HBase connector for Spark 2.0 should I use?
mainly there are two options:
RDD based https://github.com/apache/hbase/tree/master/hbase-spark
DataFrame based https://github.com/hortonworks-spark/shc
I do understand the optimizations and the differences with regard to READING from HBase.
However it's not clear for me which should I use for BATCH inserting into HBase.
I am not interested in one by one records, but by high throughput.
After digging through code, it seems that both resort to TableOutputFormat,
http://hbase.apache.org/1.2/book.html#arch.bulk.load
The project uses Scala 2.11, Spark 2, HBase 1.2
Does the DataFrame library provide any performance improvements over the RDD lib specifically for BULK LOAD ?
Lately, hbase-spark connector has been released to a new maven central repository with 1.0.0 version and supports Spark version 2.4.0 and Scala 2.11.12
<dependency>
<groupId>org.apache.hbase.connectors.spark</groupId>
<artifactId>hbase-spark</artifactId>
<version>1.0.0</version>
</dependency>
This supports both RDD and DataFrames. Please refer spark-hbase-connectors for more details
Happy Learning !!
Have you looked at bulk load examples on Hbase project.
See Hbase Bulk Examples, github page have java examples, you can write scala code easily.
Also read Apache Spark Comes to Apache HBase with HBase-Spark Module
Given a choice RDD vs DataFrame, we should use DataFrame as per recommendation on official documentation.
A DataFrame is a Dataset organized into named columns. It is
conceptually equivalent to a table in a relational database or a data
frame in R/Python, but with richer optimizations under the hood.
Hoping this helps.
Cheers !

Spark 1.6 a dataframe insert to Cassandra

i am trying to insert to cassandra a dataframe.
When i write
rdd.tosaveToCasssandra("keyspace","table")
Not problem but i can't write with this function
myDataFrame.tosaveToCassandra("keyspace","table")
Also i tried but didn't save.
myDataFrame.write.format("org.apache.spark.sql.cassandra").mode('append').options(table="mytable", keyspace="mykeyspace").save()
Do you have any idea except from new API for Spark 2.0
Thanks
For python there is currently no streaming Sink for Cassandra in the Spark Cassandra Connector, you will have to implement your own.

Spark saving to Cassandra with TTL

I am using Spark-Cassandra connector 1.1.0 with Cassandra 2.0.12.
I write RDDs to Cassandra via the saveToCassandra() Java API method.
Is there a way to set the TTL property of the persisted records with the connector?
Thanks,
Shai
Unfortunately it doesn't seem like there is a way to do this (that I know of) with version 1.1.0 of the connector. There is a way in 1.2.0-alpha3 however.
saveToCassandra() is a wrapper over WriterBuilder which has a withTTL method. Instead of using saveToCassandra you could use writerBuilder(keyspace,table,rowWriter).withTTL(seconds).saveToCassandra().
Yes, we can do.
Just set spark config key "spark.cassandra.output.ttl" .while creating sparkConf Object.
Note: Value should be in second

Resources