What are the differences between saveAsTable and insertInto in different SaveMode(s)? - apache-spark

I'm trying to write a DataFrame into Hive table (on S3) in Overwrite mode (necessary for my application) and need to decide between two methods of DataFrameWriter (Spark / Scala). From what I can read in the documentation, df.write.saveAsTable differs from df.write.insertInto in the following respects:
saveAsTable uses column-name based resolution while insertInto uses position-based resolution
In Append mode, saveAsTable pays more attention to underlying schema of the existing table to make certain resolutions
Overall, it gives me the impression that saveAsTable is just a smarter version of insertInto. Alternatively, depending on use-case, one might prefer insertInto
But do each of these methods come with some caveats of their own like performance penalty in case of saveAsTable (since it packs in more features)? Are there any other differences in their behaviours apart from what is told (not very clearly) in the docs?
EDIT-1
Documentation says this regarding insertInto
Inserts the content of the DataFrame to the specified table
and this for saveAsTable
In the case the table already exists, behavior of this function
depends on the save mode, specified by the mode function
Now I can list my doubts
Does insertInto always expect the table to exist?
Do SaveModes have any impact on insertInto?
If above answer is yes, then
what's the differences between saveAsTable with SaveMode.Append and insertInto given that table already exists?
does insertInto with SaveMode.Overwrite make any sense?

DISCLAIMER I've been exploring insertInto for some time and although I'm far from an expert in this area I'm sharing the findings for greater good.
Does insertInto always expect the table to exist?
Yes (per the table name and the database).
Moreover not all tables can be inserted into, i.e. a (permanent) table, a temporary view or a temporary global view are fine, but not:
a bucketed table
an RDD-based table
Do SaveModes have any impact on insertInto?
(That's recently been my question, too!)
Yes, but only SaveMode.Overwrite. After you think about insertInto the other 3 save modes don't make much sense (as it simply inserts a dataset).
what's the differences between saveAsTable with SaveMode.Append and insertInto given that table already exists?
That's a very good question! I'd say none, but let's see by just one example (hoping that proves something).
scala> spark.version
res13: String = 2.4.0-SNAPSHOT
sql("create table my_table (id long)")
scala> spark.range(3).write.mode("append").saveAsTable("my_table")
org.apache.spark.sql.AnalysisException: The format of the existing table default.my_table is `HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`.;
at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:117)
at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:76)
...
scala> spark.range(3).write.insertInto("my_table")
scala> spark.table("my_table").show
+---+
| id|
+---+
| 2|
| 0|
| 1|
+---+
does insertInto with SaveMode.Overwrite make any sense?
I think so given it pays so much attention to SaveMode.Overwrite. It simply re-creates the target table.
spark.range(3).write.mode("overwrite").insertInto("my_table")
scala> spark.table("my_table").show
+---+
| id|
+---+
| 1|
| 0|
| 2|
+---+
Seq(100, 200, 300).toDF.write.mode("overwrite").insertInto("my_table")
scala> spark.table("my_table").show
+---+
| id|
+---+
|200|
|100|
|300|
+---+

I want to point out a major difference between SaveAsTable and insertInto in SPARK.
In partitioned table overwrite SaveMode work differently in case of SaveAsTable and insertInto.
Consider below example.Where I am creating partitioned table using SaveAsTable method.
hive> CREATE TABLE `db.companies_table`(`company` string) PARTITIONED BY ( `id` date);
OK
Time taken: 0.094 seconds
import org.apache.spark.sql._*
import spark.implicits._
import org.apache.spark.sql._
scala>val targetTable = "db.companies_table"
scala>val companiesDF = Seq(("2020-01-01", "Company1"), ("2020-01-02", "Company2")).toDF("id", "company")
scala>companiesDF.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable(targetTable)
scala> spark.sql("select * from db.companies_table").show()
+--------+----------+
| company| id|
+--------+----------+
|Company1|2020-01-01|
|Company2|2020-01-02|
+--------+----------+
Now I am adding 2 new rows with 2 new partition values.
scala> val companiesDF = Seq(("2020-01-03", "Company1"), ("2020-01-04", "Company2")).toDF("id", "company")
scala> companiesDF.write.mode(SaveMode.Append).partitionBy("id").saveAsTable(targetTable)
scala>spark.sql("select * from db.companies_table").show()
+--------+----------+
| company| id|
+--------+----------+
|Company1|2020-01-01|
|Company2|2020-01-02|
|Company1|2020-01-03|
|Company2|2020-01-04|
+--------+----------+
As you can see 2 new rows are added to the table.
Now let`s say i want to Overwrite partition 2020-01-02 data.
scala> val companiesDF = Seq(("2020-01-02", "Company5")).toDF("id", "company")
scala>companiesDF.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable(targetTable)
As per our logic only partitions 2020-01-02 should be overwritten but the case with SaveAsTable is different.It will overwrite the enter table as you can see below.
scala> spark.sql("select * from db.companies_table").show()
+-------+----------+
|company| id|
+-------+----------+
|Company5|2020-01-02|
+-------+----------+
So if we want to overwrite only certain partitions in the table using SaveAsTable its not possible.
Refer this Link for more details.
https://towardsdatascience.com/understanding-the-spark-insertinto-function-1870175c3ee9

I recently started converting my Hive Scripts to Spark and I am still learning.
There is one important behavior I noticed with saveAsTable and insertInto which has not been discussed.
df.write.mode("overwrite").saveAsTable("schema.table") drops the existing table "schema.table" and recreates a new table based on the 'df' schema. The schema of the existing table becomes irrelevant and does not have to match with df. I got bitten by this behavior since my existing table was ORC and the new table created was parquet (Spark Default).
df.write.mode("overwrite").insertInto("schema.table") does not drop the existing table and expects the schema of the existing table to match with the schema of 'df'.
I checked the Create Time for the table using both options and reaffirmed the behavior.
Original Table stored as ORC - Wed Sep 04 21:27:33 GMT 2019
After saveAsTable (storage changed to Parquet) - Wed Sep 04 21:56:23 GMT 2019 (Create Time changed)
Dropped and Recreated origina table (ORC) - Wed Sep 04 21:57:38 GMT 2019
After insertInto (Still ORC) - Wed Sep 04 21:57:38 GMT 2019 (Create Time Not changed)

Another important point that I do consider while inserting data into an EXISTING Hive dynamic partitioned table from spark 2.xx :
df.write.mode("append").insertInto("dbName"."tableName")
Above command will intrinsically map the data in your "df" and append only new partitions to existing table.
Hope, it adds another point in deciding when to use "insertInto".

Here is the overall differences in summary table.

Related

Partitioned ORC table shows up empty in Hive

I've written a Spark dataframe to partitioned ORC files like this:
df.repartition("LOADED")\
.write\
.partitionBy("LOADED")\
.format("orc")\
.save("location")
Everything is on the disk correctly.
After that, I wanted to create a Hive table from it, like:
CREATE TABLE table USING ORC LOCATION 'location'
The command runs without any errors. But if I try to query the table, it's empty.
I've tried to do the same without partitioning, and it works just fine. What am I doing wrong?
The partitioned folders look like: LOADED=2019-11-16
For reference: I want to write the data to Azure Blob Storage, and create a Hive table from it in a different cluster.
You just need to update the partition info on the table so Hive can list the partitions presents. This is done through the MSCK REPAIR command:
spark.sql("MSCK REPAIR TABLE <tableName>")
More info on this command here
Quick example here
scala> spark.sql("select * from table").show
20/03/28 17:12:46 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+------+------+
|column|LOADED|
+------+------+
+------+------+
scala> spark.sql("MSCK REPAIR TABLE table")
scala> spark.sql("select * from table").show
+------+----------+
|column| LOADED|
+------+----------+
| a|2019-11-16|
| c|2019-11-16|
| b|2019-11-17|
+------+----------+
You are writing data directly to the location not through hiveQL statements in this case we need to update the metadata of the hive table from hive/spark using:
msck repair table <db_name>.<table_name>;
(or)
alter table <db_name>.<table_name> add partition(`LOADED`='<value>') location '<location_of the specific partition>';
Then run the below command to list out partitions from the table:
show partitions <db_name>.<table_name>;

Spark partitionBy | save by column value rather than columnName={value}

I am using scala and spark, my spark version is 2.4.3
My dataframe looks like this, there are other columns which i have not put and is not relavent.
+-----------+---------+---------+
|ts_utc_yyyy|ts_utc_MM|ts_utc_dd|
+-----------+---------+---------+
|2019 |01 |20 |
|2019 |01 |13 |
|2019 |01 |12 |
|2019 |01 |19 |
|2019 |01 |19 |
+-----------+---------+---------+
Basically i want to store the data in a bucketed format like
2019/01/12/data
2019/01/13/data
2019/01/19/data
2019/01/20/data
I am using following code snippet
df.write
.partitionBy("ts_utc_yyyy","ts_utc_MM","ts_utc_dd")
.format("csv")
.save(outputPath)
But the problem is it is getting stored along with the column name like below.
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=12/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=13/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=19/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=20/data
how do i save without column name in the folder name ?
Thanks.
This is the expected behaviour. Spark uses Hive partitioning so it writes using this convention, which enables partition discovery, filtering and pruning. In short, it optimises your queries by ensuring that the minimum amount of data is read.
Spark isn't really designed for the output you need. The easiest way for you to solve this is to have a downstream task that will simply rename the directories by splitting on the equals sign.

Spark SQL HAVING clause without Group/Aggregate

I am wondering how the HAVING clause is working in spark sql without GroupBY or any aggregate function?
1) Can we rely on HAVING without aggregate function?
2) Is there any other way to filter the columns that are generated on that select level?
I have tried executing the below Spark SQL is it working fine but can we rely on this?
spark.sql("""
select 1 as a having a=1
""").show()
spark.sql("""
select 1 as a having a=2
""").show()
+---+
| a|
+---+
| 1|
+---+
+---+
| a|
+---+
+---+
In some databases / engines, when the GROUP BY is not used in conjunction with HAVING, HAVING defaults to a WHERE clause.
Normally the WHERE clause is used.
I would not rely on HAVING without a GROUP BY.
To answer this: 1) Can we rely on HAVING without aggregate function?
No, you cannot rely on this behavior. This illegal SQL is no longer the default behavior as of Spark 2.4. But if you really want to use HAVING like a WHERE clause you can get the old behavior by setting conf spark.sql.legacy.parser.havingWithoutGroupByAsWhere = true

Describe on Dataframe is not displaying the complete resultset

I am using Scala 1.6. The describe on a data frame is not displaying the column header and the values. Please see below:
val data=sc.textFile("/tmp/sample.txt")
data.toDF.describe().show
This gives the below result:
Please let me know why it is not displaying the entire result set.
+-------+
|summary|
+-------+
| count|
| mean|
| stddev|
| min|
| max|
+-------+
I think you just need to use the show method.
sc.textFile("/tmp/sample.txt").toDF.show
As far as displaying the complete RDD, be careful with this as you will need to collect the results on the driver in order to do this. You may want to consider using take instead if the csv file is large.
val data = sc.textFile("/tmp/sample.txt").toDF
data.collect.foreach(println)
or
data.take(100).foreach(println)
This was because, spark 1.6 was considering every filed as String by default and it does not provide summary stats on String type. However, in Spark 2.1, the columns were correctly inferred as their respective data type (Int/String/Double etc.,) and summary stats included all the columns in the file and it was not restricted only to numerical fields.
I feel, df.describe() works more elegantly in Spark 2.1 than Spark 1.6.

How does computing table stats in hive or impala speed up queries in Spark SQL?

For increasing performance (e.g. for joins) it is recommended to compute table statics first.
In Hive I can do::
analyze table <table name> compute statistics;
In Impala:
compute stats <table name>;
Does my spark application (reading from hive-tables) also benefit from pre-computed statistics? If yes, which one do I need to run? Are they both saving the stats in the hive metastore? I'm using spark 1.6.1 on Cloudera 5.5.4
Note:
In the Docs of spark 1.6.1 (https://spark.apache.org/docs/1.6.1/sql-programming-guide.html) for the parameter spark.sql.autoBroadcastJoinThreshold I found a hint:
Note that currently statistics are only supported for Hive Metastore
tables where the command ANALYZE TABLE COMPUTE STATISTICS
noscan has been run.
This is the upcoming Spark 2.3.0 here (perhaps some of the features have already been released in 2.2.1 or ealier).
Does my spark application (reading from hive-tables) also benefit from pre-computed statistics?
It could if Impala or Hive recorded the table statistics (e.g. table size or row count) in a Hive metastore in the table metadata that Spark can read from (and translate to its own Spark statistics for query planning).
You can easily check it out by using DESCRIBE EXTENDED SQL command in spark-shell.
scala> spark.version
res0: String = 2.4.0-SNAPSHOT
scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name |info_value|
+--------------+----------+
|col_name |id |
|data_type |int |
|comment |NULL |
|min |0 |
|max |1 |
|num_nulls |0 |
|distinct_count|2 |
|avg_col_len |4 |
|max_col_len |4 |
|histogram |NULL |
+--------------+----------+
ANALYZE TABLE COMPUTE STATISTICS noscan computes one statistic that Spark uses, i.e. the total size of a table (with no row count metric due to noscan option). If Impala and Hive recorded it to a "proper" location, Spark SQL would show it in DESC EXTENDED.
Use DESC EXTENDED tableName for table-level statistics and see if you find the ones that were generated by Impala or Hive. If they are in DESC EXTENDED's output they will be used for optimizing joins (and with cost-based optimization turned on also for aggregations and filters).
Column statistics are stored (in a Spark-specific serialized format) in table properties and I really doubt that Impala or Hive could compute the stats and store them in the Spark SQL-compatible format.
I am assuming you are using Hive on Spark (or) Spark-Sql with hive context. If that is the case, you should run analyze in hive.
Analyze table<...> typically needs to run after the table is created or if there are significant inserts/changes. You can do this at the end of your load step itself, if this is a MR or spark job.
At the time of analysis, if you are using hive on spark - please also use the configurations in the link below. You can set this at the session level for each query. I have used the parameters in this link https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started in production and it works fine.
From what i understand compute stats on impala is the latest implementation and frees you from tuning hive settings.
From official doc:
If you use the Hive-based methods of gathering statistics, see the
Hive wiki for information about the required configuration on the Hive
side. Cloudera recommends using the Impala COMPUTE STATS statement to
avoid potential configuration and scalability issues with the
statistics-gathering process.
If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR
COLUMNS, Impala can only use the resulting column statistics if the
table is unpartitioned. Impala cannot use Hive-generated column
statistics for a partitioned table.
Useful link:
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_perf_stats.html

Resources