Does Spark support insert overwrite static partitions? - apache-spark

I noticed in the current Spark Sql manual that inserting into a dynamic partition is not supported:
Major Hive Features
Spark SQL does not currently support inserting to tables using dynamic partitioning.
However, is insert/overwriting into static partitions supported?

Spark SQL does not currently support inserting to tables using dynamic partitioning as of version spark 1.1
Static is supported, you need to write data in hive table location.

According to the release notes, Spark 1.2.0 supports dynamically partitioned inserts. Refer to SPARK-3007.

Related

Hive beeline and spark load count doesn't match for hive tables

I am using spark 2.4.4 and hive 2.3 ...
Using spark, I am loading a dataframe as Hive table using DF.insertInto(hiveTable)
if new table is created during run (of course before insertInto thru spark.sql) or existing tables created by spark 2.4.4 - everything works fine.
Issue is, if I am attempting to load some existing tables (older tables created spark 2.2 or before) - facing issues with COUNT of records. Diff count when count of target hive table is done thru beeline vs spark sql.
Please assist.
There seems to be an issue with sync of hive-Metastore and spark-catalog for hive tables (with parquet file format) created o2.n spark 2 (or before - with comple /nested data tydata type) and loaded using spark v2.4.
Usual case, spark.catalog.refresh(<hive-table-name>) will refresh the stats from hiveMetastore to spark.catalog.
In this case, an explicit spark.catalog.resfreshByPath(<location-maprfs-path>) need to bed executed to refresh the stats.pet*

how to register an existing delta table to hive

We are using spark for reading/writing data in delta format stored in HDFS (Databricks Delta table version 0.5.0).
We would like to utilize the power of Hive to interact with the delta tables.
How can we register an existing data in delta format from a path on HDFS to Hive?
Please note that currently we are running spark (2.4.0) on cloudera platform (CDH 6.3.3)
The only way I can do this so far is by registering it as an unmanaged table. The most significant difference, as far as I can tell, is that if you drop an unmanaged table, it does not drop the underlying data.

Databricks Delta and Hive Transactional Table

I've seen from two sources that right now you cannot interact in any meaningful way with HIVE Transactional Tables from Spark.
Hive ACID
Hive Transactional Tables are not readable by spark
I see Databricks has released a Transactional feature called Databricks Delta. Is it possible to now read HIVE Transactional Tables using this feature?
Nope. Not the Hive Transactional tables. You create a new type of table called Databricks Delta Table(Spark table of parquets) and leverage the Hive metastore to read/write to these tables.
Its a kind of External table but its more like data to schema. More of Spark and Parquet.
The solution for your problem might be to read the hive files and Impose the schema accordingly in a Databricks notebook and then save it as a databricks delta table.
like this : df.write.mode('overwrite').format('delta').save(/mnt/out/put/path)
You would still need to write a DDL pointing to that location.Just FYI DELTA table is Transactional.
I don't see the point on stressing on just Spark for accessing Hive ACID.
Actually Spark relies on a host language, Python and Scala being the most popular choices.
You could use Hive ACID from Python with no issues, this is a very well proven integration.
Your data can reside on Spark dataframes or RDDs, but as long as you can transfer it to standard Python data structures, you can interoperate with Hive ACID directly from these.

Importing blob data from RDBMS (Sybase) to Cassandra

I am trying to import large blob data ( around 10 TB ) from an RDBMS (Sybase ASE) into Cassandra, using DataStax Enterprise(DSE) 5.0 .
Is sqoop still the recommended way to do this in DSE 5.0? As per the release notes(http://docs.datastax.com/en/latest-dse/datastax_enterprise/RNdse.html) :
Hadoop and Sqoop are deprecated. Use Spark instead. (DSP-7848)
So should I use Spark SQL with JDBC data source to load data from Sybase, and then save the data frame to a Cassandra table?
Is there a better way to do this? Any help/suggestions will be appreciated.
Edit: As per DSE documentation (http://docs.datastax.com/en/latest-dse/datastax_enterprise/spark/sparkIntro.html), writing to blob columns from spark is not supported.
The following Spark features and APIs are not supported:
Writing to blob columns from Spark
Reading columns of all types is supported; however, you must convert collections of blobs to byte arrays before serialising.
Spark for the ETL of large data sets is preferred because it performs a distributed injest. Oracle data can be loaded into Spark RDDs or data frames and then just use saveToCassandra(keyspace, tablename). Cassandra Summit 2016 had a presentation Using Spark to Load Oracle Data into Cassandra by Jim Hatcher which discusses this topic in depth and provides examples.
Sqoop is deprecated but should still work in DSE 5.0. If its a one-time load and you're already confortable with Squoop, try that.

update query in Spark SQL

I wonder can I use the update query in sparksql just like:
sqlContext.sql("update users set name = '*' where name is null")
I got the error:
org.apache.spark.sql.AnalysisException:
Unsupported language features in query:update users set name = '*' where name is null
If the sparksql does not support the update query or am i writing the code incorrectly?
Spark SQL doesn't support UPDATE statements yet.
Hive has started supporting UPDATE since hive version 0.14. But even with Hive, it supports updates/deletes only on those tables that support transactions, it is mentioned in the hive documentation.
See the answers in databricks forums confirming that UPDATES/DELETES are not supported in Spark SQL as it doesn't support transactions. If we think, supporting random updates is very complex with most of the storage formats in big data. It requires scanning huge files, updating specific records and rewriting potentially TBs of data. It is not normal SQL.
Now it's possible, with Databricks Delta Lake
Spark SQL now supports update, delete and such data modification operations if the underlying table is in delta format.
Check this out:
https://docs.delta.io/0.4.0/delta-update.html#update-a-table

Resources