spark scala insert overwrite hive is taking too longer - apache-spark

I am trying to load spark dataframe into hive like below:
df.repartition(col(col_nme)).write.mode("overwrite").format("ORC").option("compression","snappy").insertInto(hive_tbl)
The same df in pyspark loads in 2 minutes but with scala it loads in 15 mins.
Any suggestions or clues?

Related

Spark is ignoring bucketing setting for Hive table

I am working with a one terabyte size dataset on S3. The data is in Parquet files. After executing the following code there are many files created in each partition but not the right number (6).
import org.apache.spark.sql.SaveMode
val dates = List(201208, 201209)
spark.sqlContext.sql("use db")
dates.foreach { date =>
val df = spark
.sqlContext
.sql("select * from db.orig_parquet_0 where departure_date_year_month_int=" + date)
df.write.format("orc")
.option("compression","zlib")
.option("path","s3://s3-bucket/test_orc_opt_1")
.sortBy("departure_date_year", "activity_date_int", "agency_continent")
.partitionBy("departure_date_year_month_int")
.bucketBy(6, "departure_date_year")
.mode(SaveMode.Append)
.saveAsTable("db.test_orc_opt_1");
}
When I try to query it from Presto it throws the following exception:
Query 20180820_074141_00004_46w5b failed: Hive table 'db.test_orc_opt_1' is corrupt. The number of files in the directory (13) does not match the declared bucket count (6) for partition: departure_date_year_month_int=201208
Is there a way to enforce bucketing for Spark?
Spark version 2.3.1
Try changing
.bucketBy(6, "departure_date_year")
to
.bucketBy(13, "departure_date_year")
which version of spark you are using?
Spark bucketing is different from Hive bucketing. Use hive to insert table instead of Spark.
Please look at page 42,
https://www.slideshare.net/databricks/hive-bucketing-in-apache-spark-with-tejas-patil

Understanding how Hive SQL gets executed in Spark

I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark
Ex:
warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
DF = spark.sql("select * from hive_table")
In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.
I am just wondering how the SQL is being processed. Whether in Hive or in Spark?
enableHiveSupport() and HiveContext are quite misleading, as they suggest some deeper relationship with Hive.
In practice Hive support means that Spark will use Hive metastore to read and write metadata. Before 2.0 there where some additional benefits (window function support, better parser), but this no longer the case today.
Hive support does not imply:
Full Hive Query Language compatibility.
Any form of computation on Hive.
SparkSQL allows reading and writing data to Hive tables. In addition to Hive data, any RDD can be converted to a DataFrame, and SparkSQL can be used to run queries on the DataFrame.
The actual execution will happen on Spark. You can check this in your example by running a DF.count() and track the job via Spark UI at http://localhost:4040.

Spark (pyspark) speed test

I am connected via jdbc to a DB having 500'000'000 of rows and 14 columns.
Here is the code used:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
properties = {'jdbcurl': 'jdbc:db:XXXXXXXXX','user': 'XXXXXXXXX', 'password': 'XXXXXXXXX'}
data = spark.read.jdbc(properties['jdbcurl'], table='XXXXXXXXX', properties=properties)
data.show()
The code above took 9 seconds to display the first 20 rows of the DB.
Later I created a SQL temporary view via
data[['XXX','YYY']].createOrReplaceTempView("ZZZ")
and I ran the following query:
sqlContext.sql('SELECT AVG(XXX) FROM ZZZ').show()
The code above took 1355.79 seconds (circa 23 minutes). Is this ok? It seems to be a large amount of time.
In the end I tried to count the number of rows of the DB
sqlContext.sql('SELECT COUNT(*) FROM ZZZ').show()
It took 2848.95 seconds (circa 48 minutes).
Am I doing something wrong or are these amounts standard?
When you read jdbc source with this method you loose parallelism, main advantage of spark. Please read the official spark jdbc guidelines, especially regarding partitionColumn, lowerBound, upperBound and numPartitions. This will allow spark to run multiple JDBC queries in parallel, resulting with partitioned dataframe.
Also tuning fetchsize parameter may help for some databases.

Spark dataframe saveAsTable not truncating data from Hive table

I am using Spark 2.1.0 and using Java SparkSession to run my SparkSQL.
I am trying to save a Dataset<Row> named 'ds' to be saved into a Hive table named as schema_name.tbl_name using overwrite mode.
But when I am running the below statement
ds.write().mode(SaveMode.Overwrite)
.option("header","true")
.option("truncate", "true")
.saveAsTable(ConfigurationUtils.getProperty(ConfigurationUtils.HIVE_TABLE_NAME));
the table is getting dropped after the first run.
When I am rerunning it, the table is getting created with the data loaded.
Even using truncate option didn't resolve my issue. Does saveAsTable consider truncating the data instead of dropping/creating the table? If so, what is the correct way to do it in Java ?
This is the reference to Apache JIRA for my question. Seems it is unresolved till now.
https://issues.apache.org/jira/browse/SPARK-21036

How to load specific Hive partition in DataFrame Spark 1.6?

Spark 1.6 onwards as per the official doc we cannot add specific hive partitions to DataFrame
Till Spark 1.5 the following used to work and the dataframe would have entity column and the data, as shown below:
DataFrame df = hiveContext.read().format("orc").load("path/to/table/entity=xyz")
However, this would not work in Spark 1.6.
If I give base path like the following it does not contain entity column which I want in DataFrame, as shown below -
DataFrame df = hiveContext.read().format("orc").load("path/to/table/")
How do I load specific hive partition in a dataframe? What was the driver behind removing this feature?
I believe it was efficient. Is there an alternative to achieve that in Spark 1.6?
As per my understanding, Spark 1.6 loads all partitions and if I filter for specific partitions it is not efficient, it hits memory and throws GC(Garbage Collection) errors because of thousands of partitions get loaded into memory and not the specific partition.
To add specific partition in a DataFrame using Spark 1.6 we have to do the following first set basePath and then give path of partition needs to be loaded
DataFrame df = hiveContext.read().format("orc").
option("basePath", "path/to/table/").
load("path/to/table/entity=xyz")
So above code will load only specific partition in a DataFrame.

Resources