update query in Spark SQL - apache-spark

I wonder can I use the update query in sparksql just like:
sqlContext.sql("update users set name = '*' where name is null")
I got the error:
org.apache.spark.sql.AnalysisException:
Unsupported language features in query:update users set name = '*' where name is null
If the sparksql does not support the update query or am i writing the code incorrectly?

Spark SQL doesn't support UPDATE statements yet.
Hive has started supporting UPDATE since hive version 0.14. But even with Hive, it supports updates/deletes only on those tables that support transactions, it is mentioned in the hive documentation.
See the answers in databricks forums confirming that UPDATES/DELETES are not supported in Spark SQL as it doesn't support transactions. If we think, supporting random updates is very complex with most of the storage formats in big data. It requires scanning huge files, updating specific records and rewriting potentially TBs of data. It is not normal SQL.

Now it's possible, with Databricks Delta Lake

Spark SQL now supports update, delete and such data modification operations if the underlying table is in delta format.
Check this out:
https://docs.delta.io/0.4.0/delta-update.html#update-a-table

Related

Getting duplicate records while querying Hudi table using Hive on Spark Engine in EMR 6.3.1

I am querying a Hudi table using Hive which is running on Spark engine in EMR cluster 6.3.1
Hudi version is 0.7
I have inserted a few records and then updated the same using Hudi Merge on Read. This will internally create new files under the same partition with updated data/records.
Now, when I query the same table using Spark SQL, it works fine and does not return any duplicates. Basically, it only honours the latest records/parquet files for processing. It also works fine when I use Tez as the underlying engine for Hive.
But, when I run the same query on Hive prompt with Spark as underlying execution engine, it returns all the records and does not filter the previous parquet files.
Have tried setting the property spark.sql.hive.convertMetastoreParquet=false, still it did not work.
Please help.
This is a known issue in Hudi.
Still, using the below property, I am able to remove the duplicates in RO (read optimised) Hudi tables. The issue still persists in RT table (real time).
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat

Performance of spark while reading from hive vs parquet

Assuming I have an external hive table on top parquet/orc files partitioned on date, what would be the performance impact of using
spark.read.parquet("s3a://....").filter("date_col='2021-06-20'")
v/s
spark.sql("select * from table").filter("date_col='2021-06-20'")
After reading into a dataframe, It will be followed by a series of transformations and aggregations.
spark version : 2.3.0 or 3.0.2
hive version : 1.2.1000
number of records per day : 300-700 Mn
My hunch is that there won't be any performance difference while using either of the above queries since parquet natively has most of the optimizations that a hive metastore can provide and spark is capable of using it. Like, predicate push-down, advantages of columnar storage etc.
As a follow-up question, what happens if
The underlying data was csv instead of parquet. Does having a hive table on top improves performance ?
Hive table was bucketed. Does it make sense to read the underlying file system in this case instead of reading from table ?
Also, are there any situations where reading directly from parquet is a better option compared to hive ?
Hive should actually be faster here because they both have pushdowns, Hive already has the schema stored. The parquet read as you have it here will need to infer the merged schema. You can make them about the same by providing the schema.
You can make the Parquet version even faster by navigating directly to the partition. This avoids having to do the initial filter on the available partitions.
So something like this would do it:
spark.read.option("basePath", "s3a://....").parquet("s3a://..../date_col=2021-06-20")
Note this works best if you already have a schema, because this also skips schema merging.
As to your follow-ups:
It would make a huge difference if it's CSV, as it would then have to parse all of the data and then filter out those columns. CSV is really bad for large datasets.
Shouldn't really gain you all that much and may get you into trouble. The metadata that Hive stores can allow Spark to navigate your data more efficiently here than you trying to do it yourself.

Writing Data into Hive Transactional table

I am trying to write data into Hive transactional table using spark. Following is the sample code that I have used to insert data
dataSet.write().format("orc")
.partitionBy("column1")
.bucketBy(2,"column2")
.insertInto("table");
but unfortunately getting the following error while running the application.
org.apache.spark.sql.AnalysisException: 'insertInto' does not support
bucketBy right now;
The spark and hive versions that I have used is 2.4 and 3.1. Have googled a lot but didn't find any solution. I am pretty much new to hive Any help would be appreciated.
https://issues.apache.org/jira/browse/SPARK-15348 states clearly that Spark does not allow HIVE ORC ACID processing, currently. A pity, but not possible.
You need to write Hive scripts with TEZ or MR as underlying engine for Hive.

Importing blob data from RDBMS (Sybase) to Cassandra

I am trying to import large blob data ( around 10 TB ) from an RDBMS (Sybase ASE) into Cassandra, using DataStax Enterprise(DSE) 5.0 .
Is sqoop still the recommended way to do this in DSE 5.0? As per the release notes(http://docs.datastax.com/en/latest-dse/datastax_enterprise/RNdse.html) :
Hadoop and Sqoop are deprecated. Use Spark instead. (DSP-7848)
So should I use Spark SQL with JDBC data source to load data from Sybase, and then save the data frame to a Cassandra table?
Is there a better way to do this? Any help/suggestions will be appreciated.
Edit: As per DSE documentation (http://docs.datastax.com/en/latest-dse/datastax_enterprise/spark/sparkIntro.html), writing to blob columns from spark is not supported.
The following Spark features and APIs are not supported:
Writing to blob columns from Spark
Reading columns of all types is supported; however, you must convert collections of blobs to byte arrays before serialising.
Spark for the ETL of large data sets is preferred because it performs a distributed injest. Oracle data can be loaded into Spark RDDs or data frames and then just use saveToCassandra(keyspace, tablename). Cassandra Summit 2016 had a presentation Using Spark to Load Oracle Data into Cassandra by Jim Hatcher which discusses this topic in depth and provides examples.
Sqoop is deprecated but should still work in DSE 5.0. If its a one-time load and you're already confortable with Squoop, try that.

Does Spark support insert overwrite static partitions?

I noticed in the current Spark Sql manual that inserting into a dynamic partition is not supported:
Major Hive Features
Spark SQL does not currently support inserting to tables using dynamic partitioning.
However, is insert/overwriting into static partitions supported?
Spark SQL does not currently support inserting to tables using dynamic partitioning as of version spark 1.1
Static is supported, you need to write data in hive table location.
According to the release notes, Spark 1.2.0 supports dynamically partitioned inserts. Refer to SPARK-3007.

Resources