Spark creates extra partitions inside partition - apache-spark

I have a Spark program that reads the data from text file as RDD and convert into Parquet file using spark-sql and partitioned using a single partitionkey. Once in a while instead of creating one partition it creates two partitions that is a partition inside a partition.
My data is partitioned on date and the output folder is in s3://datalake/intk/parquetdata.
After the Spark job is run I am seeing output as :
s3://datalake/intk/parquetdata/datekey=102018/a.parquet
s3://datalake/intk/parquetdata/datekey=102118/a.parquet
s3://datalake/intk/parquetdata/datekey=102218/datekey=102218/a.parquet
Code snippet:
val savepath = "s3://datalake/intk/parquetdata/"
val writeDF = InputDF.write
.mode(savemode)
.partitionBy(partitionKey)
.parquet(savePath)
I am running my Spark job in EMR cluster version 5.16, Spark version 2.2, Scala version 2.11 and the output location is s3. I am not sure why this behavior takes place and I don't see this issue following any pattern and this partition occurs only once in a while.

Related

How Spark performs write operation to Hive

I am working in Spark and still new to it. I am working on a job that reads data from some source, do some transformations and write to Hive.
For writing to Hive, I am doing dataframe.write.insertInto(hive_table)
My question is how does Spark write the entire dataframe to Hive? Will it write in parallel like different partitions on different executors will be written in parallel or will it collect all the data from various partitions to driver and then try to insert in one go?
Spark and Hive partitions are different. Spark executors will be writing in parallel to various Hive partitions.
Spark partitions will be processed in parallel by executors and when each executors encounters a Hive partition key, it will write to a new file in Hive location for that key.
So if you have 5 Spark partitions being processed in parallel and data in Hive is to be written in 3 partitions, whenever each executor will encounter a key for Hive partition, it will write to a file for that partitions.
You will see 5 files in each of the 3 Hive partition location written by each of the executors

spark structured streaming producing .c000.csv files

i am trying to fetch data from the kafka topic and pushing the same to hdfs location. I am facing following issue.
After every message (kafka) the hdfs location is updated with part files with .c000.csv format.i have created one hive table on top of the HDFS location, but the HIVE is not able to read data whatever written from spark structured streaming.
below is the file format after spark structured streaming
part-00001-abdda104-0ae2-4e8a-b2bd-3cb474081c87.c000.csv
Here is my code to insert:
val kafkaDatademostr = spark.readStream.format("kafka").option("kafka.bootstrap.servers","ttt.tt.tt.tt.com:8092").option("subscribe","demostream").option("kafka.security.protocol","SASL_PLAINTEXT").load
val interval=kafkaDatademostr.select(col("value").cast("string")) .alias("csv").select("csv.*")
val interval2=interval.selectExpr("split(value,',')[0] as rog" ,"split(value,',')[1] as vol","split(value,',')[2] as agh","split(value,',')[3] as aght","split(value,',')[4] as asd")
// interval2.writeStream.outputMode("append").format("console").start()
interval2.writeStream.outputMode("append").partitionBy("rog").format("csv").trigger(Trigger.ProcessingTime("30 seconds")).option("path", "hdfs://vvv/apps/hive/warehouse/area.db/test_kafcsv/").start()
Can someone help me, why is it creating files like this?
If I do dfs -cat /part-00001-ad35a3b6-8485-47c8-b9d2-bab2f723d840.c000.csv i can see my values.... but its not reading with hive due to format issue...
This c000 files are temporary files in which streaming data writes it data. As you are on appending mode, spark executor holds that writer thread , that's why on run time you are not able to read it using hive serializer, though hadoop fs -cat is working .

Is it possible to use Spark with ORC file format without Hive?

I am working with HDP 2.6.4, to be more specific Hive 1.2.1 with TEZ 0.7.0 , Spark 2.2.0.
My task is simple. Store data in ORC file format then use Spark to process the data. To achieve this, I am doing this:
Create a Hive table through HiveQL
Use Spark.SQL("select ... from ...") to load data into dataframe
Process against the dataframe
My questions are:
1. What is Hive's role behind the scene?
2. Is it possible to skip Hive?
You can skip Hive and use SparkSQL to run the command in step 1
In your case, Hive is defining a schema over your data and providing you a query layer for Spark and external clients to communicate
Otherwise, spark.orc exists for reading and writing of dataframes directly on the filesystem

Apache Spark parquet partition

I am trying to save DataFrame in Amazon S3 parquet folder using date as partition key. I am loading data day by day.
The first time I save it I see partition folder (i.e. "txDate=20160714").
When I am processing next files, they all go to "txDate=__HIVE_DEFAULT_PARTITION__": see parquet Hive partitions
txDate is int
I am using Databricks platform, Apache Spark 1.6.2 and Hadoop 2.
My code is in Python (Pyspark)
# initial save
df_newTx.write.partitionBy(['txDate']).format('parquet').mode('append').save("/mnt/dm.Inv/f_Tx.parquet")
# incremental save
df_tx_all.write.partitionBy(['txDate']).format('parquet').mode('append').save("/mnt/dm.Inv/f_Tx.parquet")

Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

I have a Spark dataframe which I want to save as Hive table with partitions. I tried the following two statements but they don't work. I don't see any ORC files in HDFS directory, it's empty. I can see baseTable is there in Hive console but obviously it's empty because of no files inside HDFS.
The following two lines saveAsTable() and insertInto()do not work. registerDataFrameAsTable() method works but it creates in memory table and causing OOM in my use case as I have thousands of Hive partitions to process. I am new to Spark.
dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").saveAsTable("baseTable");
dataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity","date").insertInto("baseTable");
//the following works but creates in memory table and seems to be reason for OOM in my case
hiveContext.registerDataFrameAsTable(dataFrame, "baseTable");
Hope you have already got your answer , but posting this answer for others reference, partitionBy was only supported for Parquet till Spark 1.4 , support for ORC ,JSON, text and avro was added in version 1.5+ please refer the doc below
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrameWriter.html

Resources