Kafka JSON Data with Schema is Null in PySpark Structured Streaming. Mismatched input on new schema - apache-spark

I am trying to read Kafka messages in JSON in Spark Structured Streaming. Example of the messages in Kafka is as follows:
{
"_id": {
"$oid": "5eb292531c7d910b8c98dbce"
},
"Id": 37,
"Timestamp": {
"$date": 1582889068616
},
"TTNR": "R902170286",
"SNR": 91177446,
"State": 0,
"I_A1": "FALSE",
"I_B1": "FALSE",
"I1": 0.0037385,
"Mabs": -20.9814753,
"p_HD1": 31.0069236,
"pG": 27.640614,
"pT": 1.7169713,
"pXZ": 3.4712914,
"T3": 25.2174444,
"nan": 179.3099976,
"Q1": 0,
"a_01X": [
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925
]
}
After reading the stream in Kafka, the value field as string looks like this:
|value |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
{"_id":{"$oid":"5eb292531c7d910b8c98dbce"},"Id":37,"Timestamp":{"$date":1582889068616},"TTNR":"R902170286","SNR":91177446,"State":0,"I_A1":"FALSE","I_B1":"FALSE","I1":0.0037385,"Mabs":-20.9814753,"p_HD1":31.0069236,"pG":27.640614,"pT":1.7169713,"pXZ":3.4712914,"T3":25.2174444,"nan":179.3099976,"Q1":0,"a_01X":[62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925]}
|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
A schema has been defined to select some fields as follows:
json_schema=StructType([ \
StructField("_id",StructField("$oid",StringType())), \
StructField("Id", DoubleType()), \
StructField('Timestamp', StructField("$date", LongType())), \
StructField("TTNR", StringType()), \
StructField("SNR", DoubleType()), \
StructField("State", LongType()), \
StructField("I_A1", StringType()), \
StructField("I_B1", StringType()), \
StructField("I1", DoubleType()), \
StructField("Mabs", DoubleType()), \
StructField("p_HD1", DoubleType()), \
StructField("pG", DoubleType()), \
StructField("pT", DoubleType()), \
StructField("pXZ", DoubleType()), \
StructField("T3", DoubleType()), \
StructField("nan", DoubleType()), \
StructField("Q1", LongType()), \
StructField("a_01X", ArrayType(DoubleType()))
])
(Solved with parsing error)But after trying to print to the console, I get null values:
data_stream_json = data_stream_value.select(from_json(col("value"), json_schema).alias("json_detail"))
data_stream_output = data_stream_json \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
+----+----+----+----+
| Id|TTNR| SNR| Q1|
+----+----+----+----+
|null|null|null|null|
+----+----+----+----+
(New Error) After changing the schema, there is a new problem parsing the string:
pyspark.sql.utils.ParseException: u'\nmismatched input \'{\' expecting {\'SELECT\', \'FROM\', \'ADD\', \'AS\', \'ALL\', \'ANY\', \'DISTINCT\', \'WHERE\', \'GROUP\', \'BY\', \'GROUPING\', \'SETS\', \'CUBE\', \'ROLLUP\', \'ORDER\', \'HAVING\', \'LIMIT\', \'AT\', \'OR\', \'AND\', \'IN\', NOT, \'NO\', \'EXISTS\', \'BETWEEN\', \'LIKE\', RLIKE, \'IS\', \'NULL\', \'TRUE\', \'FALSE\', \'NULLS\', \'ASC\', \'DESC\', \'FOR\', \'INTERVAL\', \'CASE\', \'WHEN\', \'THEN\', \'ELSE\', \'END\', \'JOIN\', \'CROSS\', \'OUTER\', \'INNER\', \'LEFT\', \'SEMI\', \'RIGHT\', \'FULL\', \'NATURAL\', \'ON\', \'PIVOT\', \'LATERAL\', \'WINDOW\', \'OVER\', \'PARTITION\', \'RANGE\', \'ROWS\', \'UNBOUNDED\', \'PRECEDING\', \'FOLLOWING\', \'CURRENT\', \'FIRST\', \'AFTER\', \'LAST\', \'ROW\', \'WITH\', \'VALUES\', \'CREATE\', \'TABLE\', \'DIRECTORY\', \'VIEW\', \'REPLACE\', \'INSERT\', \'DELETE\', \'INTO\', \'DESCRIBE\', \'EXPLAIN\', \'FORMAT\', \'LOGICAL\', \'CODEGEN\', \'COST\', \'CAST\', \'SHOW\', \'TABLES\', \'COLUMNS\', \'COLUMN\', \'USE\', \'PARTITIONS\', \'FUNCTIONS\', \'DROP\', \'UNION\', \'EXCEPT\', \'MINUS\', \'INTERSECT\', \'TO\', \'TABLESAMPLE\', \'STRATIFY\', \'ALTER\', \'RENAME\', \'ARRAY\', \'MAP\', \'STRUCT\', \'COMMENT\', \'SET\', \'RESET\', \'DATA\', \'START\', \'TRANSACTION\', \'COMMIT\', \'ROLLBACK\', \'MACRO\', \'IGNORE\', \'BOTH\', \'LEADING\', \'TRAILING\', \'IF\', \'POSITION\', \'EXTRACT\', \'DIV\', \'PERCENT\', \'BUCKET\', \'OUT\', \'OF\', \'SORT\', \'CLUSTER\', \'DISTRIBUTE\', \'OVERWRITE\', \'TRANSFORM\', \'REDUCE\', \'SERDE\', \'SERDEPROPERTIES\', \'RECORDREADER\', \'RECORDWRITER\', \'DELIMITED\', \'FIELDS\', \'TERMINATED\', \'COLLECTION\', \'ITEMS\', \'KEYS\', \'ESCAPED\', \'LINES\', \'SEPARATED\', \'FUNCTION\', \'EXTENDED\', \'REFRESH\', \'CLEAR\', \'CACHE\', \'UNCACHE\', \'LAZY\', \'FORMATTED\', \'GLOBAL\', TEMPORARY, \'OPTIONS\', \'UNSET\', \'TBLPROPERTIES\', \'DBPROPERTIES\', \'BUCKETS\', \'SKEWED\', \'STORED\', \'DIRECTORIES\', \'LOCATION\', \'EXCHANGE\', \'ARCHIVE\', \'UNARCHIVE\', \'FILEFORMAT\', \'TOUCH\', \'COMPACT\', \'CONCATENATE\', \'CHANGE\', \'CASCADE\', \'RESTRICT\', \'CLUSTERED\', \'SORTED\', \'PURGE\', \'INPUTFORMAT\', \'OUTPUTFORMAT\', DATABASE, DATABASES, \'DFS\', \'TRUNCATE\', \'ANALYZE\', \'COMPUTE\', \'LIST\', \'STATISTICS\', \'PARTITIONED\', \'EXTERNAL\', \'DEFINED\', \'REVOKE\', \'GRANT\', \'LOCK\', \'UNLOCK\', \'MSCK\', \'REPAIR\', \'RECOVER\', \'EXPORT\', \'IMPORT\', \'LOAD\', \'ROLE\', \'ROLES\', \'COMPACTIONS\', \'PRINCIPALS\', \'TRANSACTIONS\', \'INDEX\', \'INDEXES\', \'LOCKS\', \'OPTION\', \'ANTI\', \'LOCAL\', \'INPATH\', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)\n\n== SQL ==\n{"fields":[{"metadata":{},"name":"_id","nullable":true,"type":{"metadata":{},"name":"$oid","nullable":true,"type":"string"}},{"metadata":{},"name":"Id","nullable":true,"type":"double"},{"metadata":{},"name":"Timestamp","nullable":true,"type":{"metadata":{},"name":"$date","nullable":true,"type":"long"}},{"metadata":{},"name":"TTNR","nullable":true,"type":"string"},{"metadata":{},"name":"SNR","nullable":true,"type":"double"},{"metadata":{},"name":"State","nullable":true,"type":"long"},{"metadata":{},"name":"I_A1","nullable":true,"type":"string"},{"metadata":{},"name":"I_B1","nullable":true,"type":"string"},{"metadata":{},"name":"I1","nullable":true,"type":"double"},{"metadata":{},"name":"Mabs","nullable":true,"type":"double"},{"metadata":{},"name":"p_HD1","nullable":true,"type":"double"},{"metadata":{},"name":"pG","nullable":true,"type":"double"},{"metadata":{},"name":"pT","nullable":true,"type":"double"},{"metadata":{},"name":"pXZ","nullable":true,"type":"double"},{"metadata":{},"name":"T3","nullable":true,"type":"double"},{"metadata":{},"name":"nan","nullable":true,"type":"double"},{"metadata":{},"name":"Q1","nullable":true,"type":"long"},{"metadata":{},"name":"a_01X","nullable":true,"type":{"containsNull":true,"elementType":"double","type":"array"}}],"type":"struct"}\n^^^\n'
I want some help with this.

Note If you have complex nested json try to use this DataType.fromJson method to convert json schema to StructType schema & keep the json schema outside of code. Any change in schema just update json schema & restart your application, it will take new schema automatically.
I have converted json data to schema string, Please check below code.
scala> val jsonSchema = """{"type":"struct","fields":[{"name":"I1","type":"double","nullable":true,"metadata":{}},{"name":"I_A1","type":"string","nullable":true,"metadata":{}},{"name":"I_B1","type":"string","nullable":true,"metadata":{}},{"name":"Id","type":"long","nullable":true,"metadata":{}},{"name":"Mabs","type":"double","nullable":true,"metadata":{}},{"name":"Q1","type":"long","nullable":true,"metadata":{}},{"name":"SNR","type":"long","nullable":true,"metadata":{}},{"name":"State","type":"long","nullable":true,"metadata":{}},{"name":"T3","type":"double","nullable":true,"metadata":{}},{"name":"TTNR","type":"string","nullable":true,"metadata":{}},{"name":"Timestamp","type":{"type":"struct","fields":[{"name":"$date","type":"long","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"_id","type":{"type":"struct","fields":[{"name":"$oid","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"a_01X","type":{"type":"array","elementType":"double","containsNull":true},"nullable":true,"metadata":{}},{"name":"nan","type":"double","nullable":true,"metadata":{}},{"name":"pG","type":"double","nullable":true,"metadata":{}},{"name":"pT","type":"double","nullable":true,"metadata":{}},{"name":"pXZ","type":"double","nullable":true,"metadata":{}},{"name":"p_HD1","type":"double","nullable":true,"metadata":{}}]}"""
jsonSchema: String = {"type":"struct","fields":[{"name":"I1","type":"double","nullable":true,"metadata":{}},{"name":"I_A1","type":"string","nullable":true,"metadata":{}},{"name":"I_B1","type":"string","nullable":true,"metadata":{}},{"name":"Id","type":"long","nullable":true,"metadata":{}},{"name":"Mabs","type":"double","nullable":true,"metadata":{}},{"name":"Q1","type":"long","nullable":true,"metadata":{}},{"name":"SNR","type":"long","nullable":true,"metadata":{}},{"name":"State","type":"long","nullable":true,"metadata":{}},{"name":"T3","type":"double","nullable":true,"metadata":{}},{"name":"TTNR","type":"string","nullable":true,"metadata":{}},{"name":"Timestamp","type":{"type":"struct","fields":[{"name":"$date","type":"long","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"_id","type":{"type":"struct","fields":[{"name":"$oid","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"a_01X","type":{"type":"array","elementType":"double","containsNull":true},"nullable":true,"metadata":{}},{"name":"nan","type":"double","nullable":true,"metadata":{}},{"name":"pG","type":"double","nullable":true,"metadata":{}},{"name":"pT","type":"double","nullable":true,"metadata":{}},{"name":"pXZ","type":"double","nullable":true,"metadata":{}},{"name":"p_HD1","type":"double","nullable":true,"metadata":{}}]}
scala> val schema = DataType.fromJson(jsonSchema).asInstanceOf[StructType]
schema: org.apache.spark.sql.types.StructType = StructType(StructField(I1,DoubleType,true), StructField(I_A1,StringType,true), StructField(I_B1,StringType,true), StructField(Id,LongType,true), StructField(Mabs,DoubleType,true), StructField(Q1,LongType,true), StructField(SNR,LongType,true), StructField(State,LongType,true), StructField(T3,DoubleType,true), StructField(TTNR,StringType,true), StructField(Timestamp,StructType(StructField($date,LongType,true)),true), StructField(_id,StructType(StructField($oid,StringType,true)),true), StructField(a_01X,ArrayType(DoubleType,true),true), StructField(nan,DoubleType,true), StructField(pG,DoubleType,true), StructField(pT,DoubleType,true), StructField(pXZ,DoubleType,true), StructField(p_HD1,DoubleType,true))

I don't know your whole code, but by seeing the one you put here seems to me that you need to first convert your kafka input into String first as it initially comes in HexaDecimal and then you use your schema on that String.

I figured it out.
The trick was to change my Kafka serializer from AVRO to string format. Although AVRO preserves the schema, it also introduced some preceeding characters like newline (see below) that was hard to remove and parse as a json in my case.
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
{"_id":{"$oid":"5e58f86d5afd84039c135405"},"Id":1,"Timestamp":{"$date":1582889068580},"TTNR":"R902170286","SNR":92177446,"State":0,"I_A1":"FALSE","I_B1":"FALSE","I1":0.0036622,"Mabs":-20.5236976,"p_HD1":30.985062,"pG":27.7779473,"pT":1.727958,"pXZ":3.4487671,"T3":25.2296518,"nan":215.3000031,"Q1":0,"a_01X":[62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925]}
|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Getting my input as a String introduced more fields, which was easier to remove. I had to define a bigger schema, but parsing was successful.

Related

write delta lake in Databricks error: HttpRequest 409 err PathAlreadyExist

Sometimes I get this error when a job in Databricks is writing in Azure data lake:
HttpRequest: 409,err=PathAlreadyExists,appendpos=,cid=f448-0832-41ac-a2ab-8821453ef3c8,rid=7d4-101f-005a-578c-f82000000,connMs=0,sendMs=0,recvMs=38,sent=0,recv=168,method=PUT,url=https://awutmp.dfs.core.windows.net/bronze/app/_delta_log/_last_checkpoint?resource=file&timeout=90
My code read from a blob storage using autoloader and write in Azure Data Lake:
Schemas:
val binarySchema = StructType(List(
StructField("path", StringType, true),
StructField("modificationTime", TimestampType, true),
StructField("length", LongType, true),
StructField("content", BinaryType, true)
))
val jsonSchema = StructType(List(
StructField("EquipmentId", StringType, true),
StructField("EquipmentName", StringType, true),
StructField("EquipmentType", StringType, true),
StructField("Name", StringType, true),
StructField("Value", StringType, true),
StructField("ValueType", StringType, true),
StructField("LastSourceTimeStamp", StringType, true),
StructField("LastReprocessDate", StringType, true),
StructField("LastStateDuration", StringType, true),
StructField("MessageId", StringType, true)
))
Create delta table if not exists:
val sinkPath = "abfss://bronze#awutmp.dfs.core.windows.net/app"
val tableSQL =
s"""
CREATE TABLE IF NOT EXISTS bronze.awutmpapp(
path STRING,
file_modification_time TIMESTAMP,
file_length LONG,
value STRING,
json struct<EquipmentId STRING, EquipmentName STRING, EquipmentType STRING, Name STRING, Value STRING,ValueType STRING, LastSourceTimeStamp STRING, LastReprocessDate STRING, LastStateDuration STRING, MessageId STRING>,
job_name STRING,
job_version STRING,
schema STRING,
schema_version STRING,
timestamp_etl_process TIMESTAMP,
year INT GENERATED ALWAYS AS (YEAR(file_modification_time)) COMMENT 'generated from file_modification_time',
month INT GENERATED ALWAYS AS (MONTH(file_modification_time)) COMMENT 'generated from file_modification_time',
day INT GENERATED ALWAYS AS (DAY(file_modification_time)) COMMENT 'generated from file_modification_time'
)
USING DELTA
PARTITIONED BY (year, month, day)
LOCATION '${sinkPath}'
"""
spark.sql(tableSQL)
Options:
val options = Map[String, String](
"cloudFiles.format" -> "BinaryFile",
"cloudFiles.useNotifications" -> "true",
"cloudFiles.queueName" -> queue,
"cloudFiles.connectionString" -> queueConnString,
"cloudFiles.validateOptions" -> "true",
"cloudFiles.allowOverwrites" -> "true",
"cloudFiles.includeExistingFiles" -> "true",
"recursiveFileLookup" -> "true",
"modifiedAfter" -> "2022-01-01T00:00:00.000+0000",
"pathGlobFilter" -> "*.json.gz",
"ignoreCorruptFiles" -> "true",
"ignoreMissingFiles" -> "true"
)
Method process each microbatch:
def decompress(compressed: Array[Byte]): Option[String] =
Try {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(compressed))
scala.io.Source.fromInputStream(inputStream).mkString
}.toOption
def binaryToStringUDF: UserDefinedFunction = {
udf { (data: Array[Byte]) => decompress(data).orNull }
}
def processMicroBatch: (DataFrame, Long) => Unit = (df: DataFrame, id: Long) => {
val resultDF = df
.withColumn("content_string", binaryToStringUDF(col("content")))
.withColumn("array_value", split(col("content_string"), "\n"))
.withColumn("array_noempty_values", expr("filter(array_value, value -> value <> '')"))
.withColumn("value", explode(col("array_noempty_values")))
.withColumn("json", from_json(col("value"), jsonSchema))
.withColumnRenamed("length", "file_length")
.withColumnRenamed("modificationTime", "file_modification_time")
.withColumn("job_name", lit("jobName"))
.withColumn("job_version", lit("1.0"))
.withColumn("schema", lit(schema.toString))
.withColumn("schema_version", lit("1.0"))
.withColumn("timestamp_etl_process", current_timestamp())
.withColumn("timestamp_tz", expr("current_timezone()"))
.withColumn("timestamp_etl_process",
to_utc_timestamp(col("timestamp_etl_process"), col("timestamp_tz")))
.drop("timestamp_tz", "array_value", "array_noempty_values", "content", "content_string")
resultDF
.write
.format("delta")
.mode("append")
.option("path", sinkPath)
.save()
}
val storagePath = "wasbs://signal#externalaccount.blob.core.windows.net/"
val checkpointPath = "/checkpoint/signal/autoloader"
spark
.readStream
.format("cloudFiles")
.options(options)
.schema(binarySchema)
.load(storagePath)
.writeStream
.format("delta")
.outputMode("append")
.foreachBatch(processMicroBatch)
.option("checkpointLocation", checkpointPath)
.trigger(Trigger.AvailableNow)
.start()
.awaitTermination()
It is aditional information I have seen in Azure log analytics:
How can I solve this error?

EMR Hudi cannot create hive connection jdbc:hive2://localhost:10000/

Trying to save hudi table in Jupyter notebook with hive-sync enabled. I am using EMR: 5.28.0 with AWS Glue as catalog enabled:
# Create a DataFrame
inputDF = spark.createDataFrame(
[
("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
],
["id", "creation_date", "last_update_time"]
)
# Specify common DataSourceWriteOptions in the single hudiOptions variable
hudiOptions = {
'hoodie.table.name': 'my_hudi_table',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'creation_date',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': 'my_hudi_table',
'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
}
# Write a DataFrame as a Hudi dataset
(inputDF.write
.format('org.apache.hudi')
.option('hoodie.datasource.write.operation', 'insert')
.options(**hudiOptions)
.mode('overwrite')
.save('s3://dytyniak-test-data/myhudidataset/'))
receiving the following error:
An error occurred while calling o309.save.
: org.apache.hudi.hive.HoodieHiveSyncException: Cannot create hive connection jdbc:hive2://localhost:10000/
I assume you are following the tutorial from AWS documentation. I got it to work using Hudi 0.9.0 by setting hive_sync.mode to hms in hudiOptions (see hudi docs):
hudiOptions = {
'hoodie.table.name': 'my_hudi_table',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'creation_date',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': 'my_hudi_table',
'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.mode': 'hms'
}

How to group stream of Map to Map<String, Map<String, String>>?

I have stream of Map [name:name1, type:type1, desc:desc1, ordinal:1]. How to convert/group (with Groovy) to Map>: Map(type1: Map (neme:name1, desc:desc1, ordinal:1)).
Stream of Map
[name:productName, type:IN, ordinal:1, description:desc]
[name:productName1, type:IN, ordinal:2, description:desc]
[name:productName2, type:OUT, ordinal:3, description:desc]
and I have get: Map:
IN: Map[
[name:productName, type:IN, ordinal:1, description:desc.],
[name:productName1, type:IN, ordinal:2, description:desc.]]
OUT: Map[
[name:productName2, type:OUT, ordinal:3, description:desc.]]
You can use Stream.collect() method with Collectors.groupingBy { it.type } to collect all elements as map of type key and value of list of elements. Consider the following example:
import java.util.stream.Collectors
import java.util.stream.Stream
def input = Stream.of(
[name: 'productName', type: 'IN', ordinal: 1, description: 'desc'],
[name: 'productName1', type: 'IN', ordinal: 2, description: 'desc'],
[name: 'productName2', type: 'OUT', ordinal: 3, description: 'desc'],
)
def result = input.collect(Collectors.groupingBy { it.type })
result.each { println it }
Output:
IN=[{name=productName, type=IN, ordinal=1, description=desc}, {name=productName1, type=IN, ordinal=2, description=desc}]
OUT=[{name=productName2, type=OUT, ordinal=3, description=desc}]
Alternatively, if your input is not a Stream but a List, you could use good old Groovy Collection.groupBy() that does the same effect:
def input2 = [[name: 'productName', type: 'IN', ordinal: 1, description: 'desc'],
[name: 'productName1', type: 'IN', ordinal: 2, description: 'desc'],
[name: 'productName2', type: 'OUT', ordinal: 3, description: 'desc']]
def result2 = input2.groupBy { it.type }
result2.each { println it }

How to read data from HBase table using pyspark?

I have created a dummy HBase table called emp having one record. Below is the data.
> hbase(main):005:0> put 'emp','1','personal data:name','raju' 0 row(s)
> in 0.1540 seconds
> hbase(main):006:0> scan 'emp' ROW
> COLUMN+CELL 1 column=personal
> data:name, timestamp=1512478562674, value=raju 1 row(s) in 0.0280
> seconds
Now I have establish a connection between HBase and pySparkusing shc. Can you please help me with the code to read the aboveHBase table as a dataframe in PySpark.
Version Details:
Spark Version 2.2.0, HBase 1.3.1, HCatalog 2.3.1
you can try like this
pyspark --master local --packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf.cloudera.hbase/hbase-site.xml
empdata = ''.join("""
{
'table': {
'namespace': 'default',
'name': 'emp'
},
'rowkey': 'key',
'columns': {
'emp_id': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},
'emp_name': {'cf': 'personal data', 'col': 'name', 'type': 'string'}
}
}
""".split())
df = sqlContext \
.read \
.options(catalog=empdata) \
.format('org.apache.spark.sql.execution.datasources.hbase') \
.load()
df.show()
[Refer this blog for more info]
https://diogoalexandrefranco.github.io/interacting-with-hbase-from-pyspark/

How to query datasets in avro format?

this works with parquet
val sqlDF = spark.sql("SELECT DISTINCT field FROM parquet.`file-path'")
I tried the same way with Avro but it keeps giving me an error even if i use com.databricks.spark.avro.
When I execute the following query:
val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM avro.`file path`")
I get the AnalysisException. Why?
org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;; line 1 pos 51
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.execution.datasources.ResolveDataSource$$anonfun$apply$1.applyOrElse(rules.scala:61)
at org.apache.spark.sql.execution.datasources.ResolveDataSource$$anonfun$apply$1.applyOrElse(rules.scala:38)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.execution.datasources.ResolveDataSource.apply(rules.scala:38)
at org.apache.spark.sql.execution.datasources.ResolveDataSource.apply(rules.scala:37)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
Changing the name of the format to com.databricks.spark.avro does not make any difference and queries fail.
val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM com.databricks.spark.avro`file-path`")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.' expecting {<EOF>, ',', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 65)
== SQL ==
SELECT DISTINCT Source_Product_Classification FROM com.databricks.spark.avro`/uat/myfile`
-----------------------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
... 48 elided
Spark SQL supports avro format through a separate spark-avro module.
A library for reading and writing Avro data from Spark SQL.
Please note that spark-avro is a seaprate module that is not included by default in Spark.
You should load the module using spark-submit --packages, e.g.
$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
See With spark-shell or spark-submit.
Jaceks answer works in general but in my environment it was not working due to obscure reasons. and spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 is hanging for a long with out producing any result.
I solved this problems using --jars option along with spark-shell
Steps :
1) go to https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11/4.0.0
copy link address of jar http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar
2) wget http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar .
3) spark-shell --jars <pathwhere you downloaded jar file>/spark-avro_2.11-4.0.0.jar
4)spark.read.format("com.databricks.spark.avro").load("s3://MYAVROLOCATION.avro")
which got converted in to dataframe and was able to print it.
In your case once you get the dataframe you can do sql on your way.
Note : If you are not using spark-shell you can make uber jar using sbt or maven with spark-avro_2.11-4.0.0.jar using below maven coordinates.
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
Note : Avro datasource was introduced in spark 2.4 on wards.. SparkSPARK-24768
Have a built-in AVRO data source implementation
Which means that all the above things are not necessary any more.
See spark-release-2-4-0 release notes
Spark Avro Integration:
By using Spark, we can integrate avro format using spark-avro module. spark-avro library originally developed by databricks as a open source library. spark-avro module is external and not included in the spark-submit or spark-shell by default. So externally we need to specify while submitting spark job.
In the following section, i will explain how to integrate Spark and Avro data format.
Spark version > 2.4
Spark 2.4 release onwards, Spark SQL provides built-in support for reading and writing Apache Avro data.
Maven Dependency:
https://mvnrepository.com/artifact/org.apache.spark/spark-avro
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>2.4.5</version>
</dependency>
Spark Submit:
./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.5 ...
SparkShell:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 ...
Example:
SparkAvroWriteExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
case class Employee( id:Long, name:String, salary:Float, deptId: Int)
object SparkAvroWriteExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeList = List(Employee(1, "Ranga", 10000, 1),
Employee(2, "Vinod", 1000, 1),
Employee(3, "Nishanth", 500000, 2),
Employee(4, "Manoj", 25000, 1),
Employee(5, "Yashu", 1600, 1),
Employee(6, "Raja", 50000, 2)
);
val employeeDF = spark.createDataFrame(employeeList);
employeeDF.coalesce(1).write.format("avro").mode("overwrite").save("employees.avro");
spark.close();
}
}
SparkAvroReadExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
object SparkAvroReadExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeDF = spark.read.format("avro").load("employees.avro");
employeeDF.printSchema();
employeeDF.foreach(employee => {println(employee);});
spark.close();
}
}
Github link
https://github.com/rangareddy/ranga-spark-poc/tree/master/spark-2.4/SparkAvro
Spark version < 2.4
In Spark version < 2.4, explicitly we need to specify avro format as com.databricks.spark.avro otherwise we will get org.apache.spark.sql.AnalysisException: Failed to find data source: avro. error.
Maven Dependency:
Spark Version Compatible version of Avro Data Source for Spark
1.2 0.2.0
1.3 1.0.0
1.4+ 2.0.1
2.0 - 2.1 3.2.0
2.2 - 2.3 4.0.0
https://mvnrepository.com/artifact/com.databricks/spark-avro
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
Spark Submit:
./bin/spark-submit --packages com.databricks:spark-avro_2.11:4.0.0 ...
SparkShell:
./bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0 ...
Examples:
SparkAvroWriteExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
case class Employee( id:Long, name:String, salary:Float, deptId: Int)
object SparkAvroWriteExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeList = List(Employee(1, "Ranga", 10000, 1),
Employee(2, "Vinod", 1000, 1),
Employee(3, "Nishanth", 500000, 2),
Employee(4, "Manoj", 25000, 1),
Employee(5, "Yashu", 1600, 1),
Employee(6, "Raja", 50000, 2)
);
val employeeDF = spark.createDataFrame(employeeList);
employeeDF.coalesce(1).write.format("com.databricks.spark.avro").mode("overwrite").save("employees.avro");
spark.close();
}
}
SparkAvroReadExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
object SparkAvroReadExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeDF = spark.read.format("com.databricks.spark.avro").load("employees.avro");
employeeDF.printSchema();
employeeDF.foreach(employee => {println(employee);});
spark.close();
}
}
Github link
https://github.com/rangareddy/ranga-spark-poc/tree/master/spark-2.3/SparkAvro
Thats all folks!!

Resources