Sometimes I get this error when a job in Databricks is writing in Azure data lake:
HttpRequest: 409,err=PathAlreadyExists,appendpos=,cid=f448-0832-41ac-a2ab-8821453ef3c8,rid=7d4-101f-005a-578c-f82000000,connMs=0,sendMs=0,recvMs=38,sent=0,recv=168,method=PUT,url=https://awutmp.dfs.core.windows.net/bronze/app/_delta_log/_last_checkpoint?resource=file&timeout=90
My code read from a blob storage using autoloader and write in Azure Data Lake:
Schemas:
val binarySchema = StructType(List(
StructField("path", StringType, true),
StructField("modificationTime", TimestampType, true),
StructField("length", LongType, true),
StructField("content", BinaryType, true)
))
val jsonSchema = StructType(List(
StructField("EquipmentId", StringType, true),
StructField("EquipmentName", StringType, true),
StructField("EquipmentType", StringType, true),
StructField("Name", StringType, true),
StructField("Value", StringType, true),
StructField("ValueType", StringType, true),
StructField("LastSourceTimeStamp", StringType, true),
StructField("LastReprocessDate", StringType, true),
StructField("LastStateDuration", StringType, true),
StructField("MessageId", StringType, true)
))
Create delta table if not exists:
val sinkPath = "abfss://bronze#awutmp.dfs.core.windows.net/app"
val tableSQL =
s"""
CREATE TABLE IF NOT EXISTS bronze.awutmpapp(
path STRING,
file_modification_time TIMESTAMP,
file_length LONG,
value STRING,
json struct<EquipmentId STRING, EquipmentName STRING, EquipmentType STRING, Name STRING, Value STRING,ValueType STRING, LastSourceTimeStamp STRING, LastReprocessDate STRING, LastStateDuration STRING, MessageId STRING>,
job_name STRING,
job_version STRING,
schema STRING,
schema_version STRING,
timestamp_etl_process TIMESTAMP,
year INT GENERATED ALWAYS AS (YEAR(file_modification_time)) COMMENT 'generated from file_modification_time',
month INT GENERATED ALWAYS AS (MONTH(file_modification_time)) COMMENT 'generated from file_modification_time',
day INT GENERATED ALWAYS AS (DAY(file_modification_time)) COMMENT 'generated from file_modification_time'
)
USING DELTA
PARTITIONED BY (year, month, day)
LOCATION '${sinkPath}'
"""
spark.sql(tableSQL)
Options:
val options = Map[String, String](
"cloudFiles.format" -> "BinaryFile",
"cloudFiles.useNotifications" -> "true",
"cloudFiles.queueName" -> queue,
"cloudFiles.connectionString" -> queueConnString,
"cloudFiles.validateOptions" -> "true",
"cloudFiles.allowOverwrites" -> "true",
"cloudFiles.includeExistingFiles" -> "true",
"recursiveFileLookup" -> "true",
"modifiedAfter" -> "2022-01-01T00:00:00.000+0000",
"pathGlobFilter" -> "*.json.gz",
"ignoreCorruptFiles" -> "true",
"ignoreMissingFiles" -> "true"
)
Method process each microbatch:
def decompress(compressed: Array[Byte]): Option[String] =
Try {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(compressed))
scala.io.Source.fromInputStream(inputStream).mkString
}.toOption
def binaryToStringUDF: UserDefinedFunction = {
udf { (data: Array[Byte]) => decompress(data).orNull }
}
def processMicroBatch: (DataFrame, Long) => Unit = (df: DataFrame, id: Long) => {
val resultDF = df
.withColumn("content_string", binaryToStringUDF(col("content")))
.withColumn("array_value", split(col("content_string"), "\n"))
.withColumn("array_noempty_values", expr("filter(array_value, value -> value <> '')"))
.withColumn("value", explode(col("array_noempty_values")))
.withColumn("json", from_json(col("value"), jsonSchema))
.withColumnRenamed("length", "file_length")
.withColumnRenamed("modificationTime", "file_modification_time")
.withColumn("job_name", lit("jobName"))
.withColumn("job_version", lit("1.0"))
.withColumn("schema", lit(schema.toString))
.withColumn("schema_version", lit("1.0"))
.withColumn("timestamp_etl_process", current_timestamp())
.withColumn("timestamp_tz", expr("current_timezone()"))
.withColumn("timestamp_etl_process",
to_utc_timestamp(col("timestamp_etl_process"), col("timestamp_tz")))
.drop("timestamp_tz", "array_value", "array_noempty_values", "content", "content_string")
resultDF
.write
.format("delta")
.mode("append")
.option("path", sinkPath)
.save()
}
val storagePath = "wasbs://signal#externalaccount.blob.core.windows.net/"
val checkpointPath = "/checkpoint/signal/autoloader"
spark
.readStream
.format("cloudFiles")
.options(options)
.schema(binarySchema)
.load(storagePath)
.writeStream
.format("delta")
.outputMode("append")
.foreachBatch(processMicroBatch)
.option("checkpointLocation", checkpointPath)
.trigger(Trigger.AvailableNow)
.start()
.awaitTermination()
It is aditional information I have seen in Azure log analytics:
How can I solve this error?
Related
How to write schema for below json :
"place_results": {
"title": "W2A Architects",
"place_id": "ChIJ4SUGuHw5xIkRAl0856nZrBM",
"data_id": "0x89c4397cb80625e1:0x13acd9a9e73c5d02",
"data_cid": "1417747306467056898",
"reviews_link": "httpshl=en",
"photos_link": "https=en",
"gps_coordinates": {
"latitude": 40.6027801,
"longitude": -75.4701499
},
"place_id_search": "http",
"rating": 3.7,
I am getting nulls while writing below schema. How to know the correct datatype to use?
StructField('place_results', StructType([
StructField('address', StringType(), True),
StructField('data_cid', StringType(), True),
StructField('data_id', StringType(), True),
StructField('gps_coordinates', StringType(), True),
StructField('open_state', StringType(), True),
StructField('phone', StringType(), True),
StructField('website', StringType(), True)
])),
This should work:
StructType([
StructField('place_results',
StructType([
StructField('data_cid', StringType(), True),
StructField('data_id', StringType(), True),
StructField('gps_coordinates', StructType([
StructField('latitude', DoubleType(), True),
StructField('longitude', DoubleType(), True)]), True),
StructField('photos_link', StringType(), True),
StructField('place_id', StringType(), True),
StructField('place_id_search', StringType(), True),
StructField('rating', DoubleType(), True),
StructField('reviews_link', StringType(), True),
StructField('title', StringType(), True)]), True)
])
I got this using this command:
spark.read.option("multiLine", True).json("dbfs:/test/sample.json").schema
Trying to save hudi table in Jupyter notebook with hive-sync enabled. I am using EMR: 5.28.0 with AWS Glue as catalog enabled:
# Create a DataFrame
inputDF = spark.createDataFrame(
[
("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
],
["id", "creation_date", "last_update_time"]
)
# Specify common DataSourceWriteOptions in the single hudiOptions variable
hudiOptions = {
'hoodie.table.name': 'my_hudi_table',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'creation_date',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': 'my_hudi_table',
'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
}
# Write a DataFrame as a Hudi dataset
(inputDF.write
.format('org.apache.hudi')
.option('hoodie.datasource.write.operation', 'insert')
.options(**hudiOptions)
.mode('overwrite')
.save('s3://dytyniak-test-data/myhudidataset/'))
receiving the following error:
An error occurred while calling o309.save.
: org.apache.hudi.hive.HoodieHiveSyncException: Cannot create hive connection jdbc:hive2://localhost:10000/
I assume you are following the tutorial from AWS documentation. I got it to work using Hudi 0.9.0 by setting hive_sync.mode to hms in hudiOptions (see hudi docs):
hudiOptions = {
'hoodie.table.name': 'my_hudi_table',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'creation_date',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': 'my_hudi_table',
'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.mode': 'hms'
}
Does Spark v2.3.1 depends on local timezone when reading from JSON file?
My src/test/resources/data/tmp.json:
[
{
"timestamp": "1970-01-01 00:00:00.000"
}
]
and Spark code:
SparkSession.builder()
.appName("test")
.master("local")
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate()
.read()
.option("multiLine", true).option("mode", "PERMISSIVE")
.schema(new StructType()
.add(new StructField("timestamp", DataTypes.TimestampType, true, Metadata.empty())))
.json("src/test/resources/data/tmp.json")
.show();
Result:
+-------------------+
| timestamp|
+-------------------+
|1969-12-31 22:00:00|
+-------------------+
How to make spark return 1970-01-01 00:00:00.000?
P.S. This question is not a duplicate of Spark Strutured Streaming automatically converts timestamp to local time, because provided there solution not work for me and is already included (see .config("spark.sql.session.timeZone", "UTC")) into my question.
I'm trying to write a Spark Structured Streaming (2.3) dataset to ScyllaDB (Cassandra).
My code to write the dataset:
def saveStreamSinkProvider(ds: Dataset[InvoiceItemKafka]) = {
ds
.writeStream
.format("cassandra.ScyllaSinkProvider")
.outputMode(OutputMode.Append)
.queryName("KafkaToCassandraStreamSinkProvider")
.options(
Map(
"keyspace" -> namespace,
"table" -> StreamProviderTableSink,
"checkpointLocation" -> "/tmp/checkpoints"
)
)
.start()
}
My ScyllaDB Streaming Sinks:
class ScyllaSinkProvider extends StreamSinkProvider {
override def createSink(sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): ScyllaSink =
new ScyllaSink(parameters)
}
class ScyllaSink(parameters: Map[String, String]) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit =
data.write
.cassandraFormat(
parameters("table"),
parameters("keyspace")
//parameters("cluster")
)
.mode(SaveMode.Append)
.save()
}
However, when I run this code, I receive an exception:
...
[error] +- StreamingExecutionRelation KafkaSource[Subscribe[transactions_load]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]
[error] at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
[error] at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
[error] Caused by: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame;
[error] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[error] at org.apache.spark.sql.Dataset.write(Dataset.scala:3103)
[error] at cassandra.ScyllaSink.addBatch(CassandraDriver.scala:113)
[error] at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:477)
...
I have seen a similar question, but that is for CosmosDB - Spark CosmosDB Sink: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame
You could convert it to an RDD first and then write:
class ScyllaSink(parameters: Map[String, String]) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = synchronized {
val schema = data.schema
// this ensures that the same query plan will be used
val rdd: RDD[Row] = df.queryExecution.toRdd.mapPartitions { rows =>
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row])
}
// write the RDD to Cassandra
}
}
I have created a dummy HBase table called emp having one record. Below is the data.
> hbase(main):005:0> put 'emp','1','personal data:name','raju' 0 row(s)
> in 0.1540 seconds
> hbase(main):006:0> scan 'emp' ROW
> COLUMN+CELL 1 column=personal
> data:name, timestamp=1512478562674, value=raju 1 row(s) in 0.0280
> seconds
Now I have establish a connection between HBase and pySparkusing shc. Can you please help me with the code to read the aboveHBase table as a dataframe in PySpark.
Version Details:
Spark Version 2.2.0, HBase 1.3.1, HCatalog 2.3.1
you can try like this
pyspark --master local --packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf.cloudera.hbase/hbase-site.xml
empdata = ''.join("""
{
'table': {
'namespace': 'default',
'name': 'emp'
},
'rowkey': 'key',
'columns': {
'emp_id': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},
'emp_name': {'cf': 'personal data', 'col': 'name', 'type': 'string'}
}
}
""".split())
df = sqlContext \
.read \
.options(catalog=empdata) \
.format('org.apache.spark.sql.execution.datasources.hbase') \
.load()
df.show()
[Refer this blog for more info]
https://diogoalexandrefranco.github.io/interacting-with-hbase-from-pyspark/