I wrote the following code in python:
val input = spark.read.format("csv").option("header",true).load("input_path")
input.write.format("delta")
.partitionBy("col1Name","col2Name")
.mode("overwrite")
.save("output_path")
input is read properly and we have the col1Name and col2Name.
The problem here is that the data is written in parquet and the _delta_log folder stays empty(no items) so when I try to read the delta data I get an an error : 'output_path` is not a Delta table
How can I change the code to make the data properly data written in data with the _delta_log folder properly filled in?
I am using the following different conf in the databricks cluster:Apache Spark 2.4.5, Scala 2.11 and Apache Spark 3.1.2, Scala 2.12) but all of them give the same result.
Any idea how to fix this please?
Thanks a lot
Related
I am facing the following issue while migrating the data from hive to SQL Server using spark job with query given through JSON file.
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file.
Column: [abc], Expected: string, Found: INT32
Now from what I understand is the parquet file contains different column structure than the Hive view. I am able to retrieve data using tools like Teradata, etc. While loading to different server causes the issue.
Can anyone help me understand the problem and give a workaround for the same?
Edit:
spark version 2.4.4.2
Scala version 2.11.12
Hive 2.3.6
SQL Server 2016
i am trying to fetch data from the kafka topic and pushing the same to hdfs location. I am facing following issue.
After every message (kafka) the hdfs location is updated with part files with .c000.csv format.i have created one hive table on top of the HDFS location, but the HIVE is not able to read data whatever written from spark structured streaming.
below is the file format after spark structured streaming
part-00001-abdda104-0ae2-4e8a-b2bd-3cb474081c87.c000.csv
Here is my code to insert:
val kafkaDatademostr = spark.readStream.format("kafka").option("kafka.bootstrap.servers","ttt.tt.tt.tt.com:8092").option("subscribe","demostream").option("kafka.security.protocol","SASL_PLAINTEXT").load
val interval=kafkaDatademostr.select(col("value").cast("string")) .alias("csv").select("csv.*")
val interval2=interval.selectExpr("split(value,',')[0] as rog" ,"split(value,',')[1] as vol","split(value,',')[2] as agh","split(value,',')[3] as aght","split(value,',')[4] as asd")
// interval2.writeStream.outputMode("append").format("console").start()
interval2.writeStream.outputMode("append").partitionBy("rog").format("csv").trigger(Trigger.ProcessingTime("30 seconds")).option("path", "hdfs://vvv/apps/hive/warehouse/area.db/test_kafcsv/").start()
Can someone help me, why is it creating files like this?
If I do dfs -cat /part-00001-ad35a3b6-8485-47c8-b9d2-bab2f723d840.c000.csv i can see my values.... but its not reading with hive due to format issue...
This c000 files are temporary files in which streaming data writes it data. As you are on appending mode, spark executor holds that writer thread , that's why on run time you are not able to read it using hive serializer, though hadoop fs -cat is working .
Using Kafka 0.9.0 and Spark 2.1.0 - I am using PySpark structured streaming to compute the results and output it on Kafka topic. I am referring the Spark docs for the same
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Now when I run the command
(output mode complete as it is aggregating the streaming data.)
(mydataframe.writeStream
.outputMode("complete")
.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "topicname")
.option("checkpointLocation","/data/checkpoint/1")
.start())
It gives me error as below
ERROR StreamExecution: Query [id = 0686130b-8668-48fa-bdb7-b79b63d82680, runId = b4b7494f-d8b8-416e-ae49-ad8498dfe8f2] terminated with error
org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:73)
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:73)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.kafka010.KafkaWriter$.validateQuery(KafkaWriter.scala:72)
at org.apache.spark.sql.kafka010.KafkaWriter$.write(KafkaWriter.scala:88)
at org.apache.spark.sql.kafka010.KafkaSink.addBatch(KafkaSink.scala:38)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply$mcV$sp(StreamExecution.scala:503)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:503)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:503)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:502)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcV$sp(StreamExecution.scala:255)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:239)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:177)**
Not sure what attribute value does it expects. Need help in resolving this.
The console output sink produces the correct output on console so code seems to work fine. Only when using kafka as output sink causing this issue
Not sure what attribute value does it expects. Need help in resolving this.
Your myDataFrame needs a column value (of either StringType or BinaryType) containing the payload (message) which you want to send to Kafka.
Currently you are trying to write to Kafka, but don't describe which data is to be written.
One way to obtain such a colunm is to rename an existing column using .withColumnRenamed. If you want to write multiple columns, it's usually a good idea to create a column containing a JSON representation of the dataframe, which can be obtained using the to_json sql.function. But beware of .toJSON!
Spark 2.1.0 does not support Kafka as output sink. It's been introduced in 2.2.0 as per documentation.
See also this answer, which links to the commit introducing the feature, and provides an alternate solution, as well as this JIRA, which added the documentation in 2.2.1.
I am using Spark 2.1.0 and using Java SparkSession to run my SparkSQL.
I am trying to save a Dataset<Row> named 'ds' to be saved into a Hive table named as schema_name.tbl_name using overwrite mode.
But when I am running the below statement
ds.write().mode(SaveMode.Overwrite)
.option("header","true")
.option("truncate", "true")
.saveAsTable(ConfigurationUtils.getProperty(ConfigurationUtils.HIVE_TABLE_NAME));
the table is getting dropped after the first run.
When I am rerunning it, the table is getting created with the data loaded.
Even using truncate option didn't resolve my issue. Does saveAsTable consider truncating the data instead of dropping/creating the table? If so, what is the correct way to do it in Java ?
This is the reference to Apache JIRA for my question. Seems it is unresolved till now.
https://issues.apache.org/jira/browse/SPARK-21036
I have a Spark dataframe which I want to save as Hive table with partitions. I tried the following two statements but they don't work. I don't see any ORC files in HDFS directory, it's empty. I can see baseTable is there in Hive console but obviously it's empty because of no files inside HDFS.
The following two lines saveAsTable() and insertInto()do not work. registerDataFrameAsTable() method works but it creates in memory table and causing OOM in my use case as I have thousands of Hive partitions to process. I am new to Spark.
dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").saveAsTable("baseTable");
dataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity","date").insertInto("baseTable");
//the following works but creates in memory table and seems to be reason for OOM in my case
hiveContext.registerDataFrameAsTable(dataFrame, "baseTable");
Hope you have already got your answer , but posting this answer for others reference, partitionBy was only supported for Parquet till Spark 1.4 , support for ORC ,JSON, text and avro was added in version 1.5+ please refer the doc below
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrameWriter.html