Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS - apache-spark

I have a Spark dataframe which I want to save as Hive table with partitions. I tried the following two statements but they don't work. I don't see any ORC files in HDFS directory, it's empty. I can see baseTable is there in Hive console but obviously it's empty because of no files inside HDFS.
The following two lines saveAsTable() and insertInto()do not work. registerDataFrameAsTable() method works but it creates in memory table and causing OOM in my use case as I have thousands of Hive partitions to process. I am new to Spark.
dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").saveAsTable("baseTable");
dataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity","date").insertInto("baseTable");
//the following works but creates in memory table and seems to be reason for OOM in my case
hiveContext.registerDataFrameAsTable(dataFrame, "baseTable");

Hope you have already got your answer , but posting this answer for others reference, partitionBy was only supported for Parquet till Spark 1.4 , support for ORC ,JSON, text and avro was added in version 1.5+ please refer the doc below
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrameWriter.html

Related

Table created with "stored as Parquet" option using PySpark SQL or Hive does not actually store data files in Parquet format

I create table on Hadoop cluster using PySpark SQL:spark.sql("CREATE TABLE my_table (...) PARTITIONED BY (...) STORED AS Parquet") and load some data with: spark.sql("INSERT INTO my_table SELECT * FROM my_other_table"), however the resulting files do not seem to be Parquet files, they're missing ".snappy.parquet" extension.
The same problem occurs when repeating those steps in Hive.
But surprisingly when I create table using PySpark DataFrame: df.write.partitionBy("my_column").saveAsTable(name="my_table", format="Parquet")
everything works just fine.
So, my question is: what's wrong with the SQL way of creating and populating Parquet table?
Spark version 2.4.5, Hive version 3.1.2.
Update (27 Dec 2022 after #mazaneicha answer)
Unfortunately, there is no parquet-tools on the cluster I'm working with, so the best I could do is to check the content of the files with hdfs dfs -tail (and -head). And in all cases there is "PAR1" both at the beginning and at the end of the file. And even more - the meta-data of parquet version (implementation):
Method # of files Total size Parquet version File name
Hive Insert 8 34.7 G Jparquet-mr version 1.10.0 xxxxxx_x
PySpark SQL Insert 8 10.4 G Iparquet-mr version 1.6.0 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF insertInto 8 10.9 G Iparquet-mr version 1.6.0 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF saveAsTable 8 11.5 G Jparquet-mr version 1.10.1 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-c000.snappy.parquet
(To create the same number of files I used "repartition" with df, and "distribute by" with SQL).
So, considering the above mentioned, it's still not clear:
Why there is no file extension in 3 out of 4 cases?
Why files created with Hive are so big? (no compression, I suppose).
Why PySpark SQL and PySpark Dataframe versions/implementations of parquet differ and how set them explicitly?
File format is not defined by the extension, but rather by the contents. You can quickly check if format is parquet by looking for magic bytes PAR1 at the very beginning and the very end of a file.
For in-depth format, metadata and consistency checking, try opening a file with parquet-tools.
Update:
As mentioned in online docs, parquet is supported by Spark as one of the many data sources via its common DataSource framework, so that it doesn't have to rely on Hive:
"When reading from Hive metastore Parquet tables and writing to non-partitioned Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance..."
You can find and review this implementation in Spark git repo (its open-source! :))

spark structured streaming producing .c000.csv files

i am trying to fetch data from the kafka topic and pushing the same to hdfs location. I am facing following issue.
After every message (kafka) the hdfs location is updated with part files with .c000.csv format.i have created one hive table on top of the HDFS location, but the HIVE is not able to read data whatever written from spark structured streaming.
below is the file format after spark structured streaming
part-00001-abdda104-0ae2-4e8a-b2bd-3cb474081c87.c000.csv
Here is my code to insert:
val kafkaDatademostr = spark.readStream.format("kafka").option("kafka.bootstrap.servers","ttt.tt.tt.tt.com:8092").option("subscribe","demostream").option("kafka.security.protocol","SASL_PLAINTEXT").load
val interval=kafkaDatademostr.select(col("value").cast("string")) .alias("csv").select("csv.*")
val interval2=interval.selectExpr("split(value,',')[0] as rog" ,"split(value,',')[1] as vol","split(value,',')[2] as agh","split(value,',')[3] as aght","split(value,',')[4] as asd")
// interval2.writeStream.outputMode("append").format("console").start()
interval2.writeStream.outputMode("append").partitionBy("rog").format("csv").trigger(Trigger.ProcessingTime("30 seconds")).option("path", "hdfs://vvv/apps/hive/warehouse/area.db/test_kafcsv/").start()
Can someone help me, why is it creating files like this?
If I do dfs -cat /part-00001-ad35a3b6-8485-47c8-b9d2-bab2f723d840.c000.csv i can see my values.... but its not reading with hive due to format issue...
This c000 files are temporary files in which streaming data writes it data. As you are on appending mode, spark executor holds that writer thread , that's why on run time you are not able to read it using hive serializer, though hadoop fs -cat is working .

SparkSQL attempts to read data from non-existing path

I am having an issue with pyspark sql module. I created a partitioned table and saved it as parquet file into hive table by running spark job after multiple transformations.
Data load is successful into hive and also able to query the data. But when I try to query the same data from spark it says file path doesn't exist.
java.io.FileNotFoundException: File hdfs://localhost:8020/data/path/of/partition partition=15f244ee8f48a2f98539d9d319d49d9c does not exist
The partition which is mentioned in above error was the old partitioned column data which doesn't even exist now.
I have run the spark job which populates a new partition value.
I searched for solutions but all I can see is people say there was no issue in spark version 1.4 and there is an issue in 1.6
Can someone please suggest me the solution for this problem.

Spark SQL Fails when hive partition is missing

I have a table which has some missing partions. When I call it on hive it works fine
SELECT *
FROM my_table
but when call it from pyspark (v. 2.3.0) it fails with message Input path does not exist: hdfs://path/to/partition. The spark code I am running is just naive:
spark = ( SparkSession
.builder
.appName("prueba1")
.master("yarn")
.config("spark.sql.hive.verifyPartitionPath", "false")
.enableHiveSupport()
.getOrCreate())
spark.table('some_schema.my_table').show(10)
the config("spark.sql.hive.verifyPartitionPath", "false") has been proposed is
this question but seems to not work fine for me
Is there any way I can configure SparkSession so I can get rid of these. I am afraid that in the future more partitions will miss, so a hardcode solution is not possible
This error occurs when partitioned data dropped from HDFS i.e not using Hive commands to drop the partition.
If the data dropped from HDFS directly Hive doesn't know about the dropped the partition, when we query hive table it still looks for the directory and the directory doesn't exists in HDFS it results file not found exception.
To fix this issue we need to drop the partition associated with the directory in Hive table also by using
alter table <db_name>.<table_name> drop partition(<partition_col_name>=<partition_value>);
Then hive drops the partition from the metadata this is the only way to drop the metadata from the hive table if we dropped the partition directory from HDFS.
msck repair table doesn't drop the partitions instead only adds the new partitions if the new partition got added into HDFS.
The correct way to avoid these kind of issues in future drop the partitions by using Hive drop partition commands.
Does the other way around, .config("spark.sql.hive.verifyPartitionPath", "true") work for you? I have just managed to load data using spark-sql with this setting, while one of the partition paths from Hive was empty, and partition still existed in Hive metastore. Though there are caveats - it seems it takes significantly more time to load data compared to when this setting it set to false.

Does Spark support Partition Pruning with Parquet Files

I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id. The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark commands:
sqlContext.setConf("spark.sql.hive.metastorePartitionPruning", "true")
sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")
val df = sqlContext.sql("select * from tag_data where plant_name='PLANT01' and tag_id='1000'")
I would expect a fast response as this resolves to a single partition. In Hive and Presto this takes seconds, however in Spark it runs for hours.
The actual data is held in a S3 bucket, and when I submit the sql query, Spark goes off and first gets all the partitions from the Hive metastore (200000 of them), and then calls refresh() to force a full status list of all these files in the S3 object store (actually calling listLeafFilesInParallel).
It is these two operations that are so expensive, are there any settings that can get Spark to prune the partitions earlier - either during the call to the metadata store, or immediately afterwards?
Yes, spark supports partition pruning.
Spark does a listing of partitions directories (sequential or parallel listLeafFilesInParallel) to build a cache of all partitions first time around. The queries in the same application, that scan data takes advantage of this cache. So the slowness that you see could be because of this cache building. The subsequent queries that scan data make use of the cache to prune partitions.
These are the logs which shows partitions being listed to populate the cache.
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-01 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-02 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-03 on driver
These are the logs showing pruning is happening.
App > 16/11/10 12:29:16 main INFO DataSourceStrategy: Selected 1 partitions out of 20, pruned 95.0% partitions.
Refer convertToParquetRelation and getHiveQlPartitions in HiveMetastoreCatalog.scala.
Just a thought:
Spark API documentation for HadoopFsRelation says,
( https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/sources/HadoopFsRelation.html )
"...when reading from Hive style partitioned tables stored in file
systems, it's able to discover partitioning information from the paths
of input directories, and perform partition pruning before start
reading the data..."
So, i guess "listLeafFilesInParallel" could not be a problem.
A similar issue is already in spark jira: https://issues.apache.org/jira/browse/SPARK-10673
In spite of "spark.sql.hive.verifyPartitionPath" set to false and, there is no effect in performance, I suspect that the
issue might have been caused by unregistered partitions. Please list out the partitions of the table and verify if all
the partitions are registered. Else, recover your partitions as shown in this link:
Hive doesn't read partitioned parquet files generated by Spark
Update:
I guess appropriate parquet block size and page size were set while writing the data.
Create a fresh hive table with partitions mentioned, and file-format as parquet, load it from non-partitioned table using dynamic partition approach.
( https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions )
Run a plain hive query and then compare by running a spark program.
Disclaimer: I am not a spark/parquet expert. The problem sounded interesting, and hence responded.
similar question popped up here recently:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-reads-all-leaf-directories-on-a-partitioned-Hive-table-td35997.html#a36007
This question is old but I thought I'd post the solution here as well.
spark.sql.hive.convertMetastoreParquet=false
will use the Hive parquet serde instead of the spark inbuilt parquet serde. Hive's Parquet serde will not do a listLeafFiles on all partitions, but only and directly read from the selected partitions. On tables with many partitions and files, this is much faster (and cheaper, too). Feel free to try it ou! :)

Resources