I'm trying to use spark streaming with availableNow trigger to ingest data from an Azure Event Hub into a Delta Lake table in Databricks.
My code:
conn_str = "my conn string"
ehConf = {
"eventhubs.connectionString":
spark.sparkContext._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(conn_str),
"eventhubs.consumerGroup":
"my-consumer-grp",
}
read_stream = spark.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
stream = read_stream.writeStream \
.format("delta") \
.option("checkpointLocation", checkpoint_location) \
.trigger(availableNow=True) \
.toTable(full_table_name, mode="append")
According to the documentation https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
The availableNow trigger should process all data currently available, in a micro-batch style.
However, This is not happening, instead, it processes only 1000 rows.
The output of the stream tells the story:
{
"sources" : [ {
"description" : "org.apache.spark.sql.eventhubs.EventHubsSource#2c5bba32",
"startOffset" : {
"my-hub-name" : {
"0" : 114198857
}
},
"endOffset" : {
"my-hub-name" : {
"0" : 119649573
}
},
"latestOffset" : {
"my-hub-name" : {
"0" : 119650573
}
},
"numInputRows" : 1000,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 36.1755236407047
} ]
}
We can clearly see the offset changes by way more than the 1000 processed.
I have verified the content of the target table, it contains the last 1000 offsets. \
According to the Event Hub configuration for Pyspark https://github.com/Azure/azure-event-hubs-spark/blob/master/docs/PySpark/structured-streaming-pyspark.md#event-hubs-configuration
The maxEventsPerTrigger is set to 1000*partitionCount by default, this should only affect how many events are processed per batch though, and not the total amount of records processed by availableNow trigger.
Running the same query with the trigger being once=True will instead ingest all of the events (assuming batch size is set large enough).
Is seems like the availableNow trigger is not working as intended for Azure Event Hub.
The 'avaiableNow' trigger seems to not be implemented yet in 'azure-event-hub-spark' package.
But there is a workaround possible using Kafka connector to Azure Event Hub - https://github.com/Azure/azure-event-hubs-for-kafka/tree/master/tutorials/spark
So essentially the previous code becomes
bootstrap_servers = "my-evh-namespace.servicebus.windows.net:9093"
eventhub_endpoint = "my-evh-namespace-endpoint"
# The 'kafkashaded' part here is because it's running in Databricks.
# Otherwise drop that part.
EH_SASL = f"kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"{eventhub_endpoint}\";"
topic = "my-eventhub-name"
read_stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("subscribe", topic) \
.option("maxOffsetsPerTrigger", 1000) \
.option("startingOffsets", "earliest") \
.option("includeHeaders", "true") \
.load()
# Notice that the output writeStream remains the same.
stream = read_stream.writeStream \
.format("delta") \
.option("checkpointLocation", checkpoint_location) \
.trigger(availableNow=True) \
.toTable(full_table_name, mode="append")
This results in the stream performing as expected - ingesting all events up until start time in batches of size maxOffsetsPerTrigger
Related
I am working on a pipeline in Databricks > Workflows > Delta Live Tables and having an issue with the streaming part.
Expectations:
One bronze table reads the json files with AutoLoader (cloudFiles), in a streaming mode (spark.readStream)
One silver table reads and flattens the bronze table in streaming (dlt.read_stream)
Result:
When taking the root location as the source (load /*, several hundreds of files): the pipelines starts, but the number of rows/files appended is not updated in the graph until the bronze part be completed. Then, the silver part starts, the number of files/rows never updates either and the pipeline terminates with a memory error.
When taking a very small number of files (/specific_folder among hundreds) : the pipeline runs well and terminates with no error, but again, the number of rows/files appended is not updated in the graph until each part is completed.
This led me to the conclusion that the pipeline seems not to run in a streaming mode.
Maybe I am missing something about the config or how to run properly a DLT pipeline, and would need your help on this please.
Here is the configuration of the pipeline:
{
"id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"clusters": [
{
"label": "default",
"aws_attributes": {
"instance_profile_arn": "arn:aws:iam::xxxxxxxxxxxx:instance-profile/iam_role_example"
},
"autoscale": {
"min_workers": 1,
"max_workers": 10,
"mode": "LEGACY"
}
}
],
"development": true,
"continuous": false,
"channel": "CURRENT",
"edition": "PRO",
"photon": false,
"libraries": [
{
"notebook": {
"path": "/Repos/user_example#xxxxxx.xx/dms/bronze_job"
}
}
],
"name": "01-landing-task-1",
"storage": "dbfs:/pipelines/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"configuration": {
"SCHEMA": "example_schema",
"RAW_MOUNT_NAME": "xxxx",
"DELTA_MOUNT_NAME": "xxxx",
"spark.sql.parquet.enableVectorizedReader": "false"
},
"target": "landing"}
Here is the code of the pipeline (the query in the silver table contains many more columns with a get_json_object, ~30 actually):
import dlt
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.window import Window
RAW_MOUNT_NAME = spark.conf.get("RAW_MOUNT_NAME")
SCHEMA = spark.conf.get("SCHEMA")
SOURCE = spark.conf.get("SOURCE")
TABLE_NAME = spark.conf.get("TABLE_NAME")
PRIMARY_KEY_PATH = spark.conf.get("PRIMARY_KEY_PATH")
#dlt.table(
name=f"{SCHEMA}_{TABLE_NAME}_bronze",
table_properties={
"quality": "bronze"
}
)
def bronze_job():
load_path = f"/mnt/{RAW_MOUNT_NAME}/{SOURCE}/5e*"
return spark \
.readStream \
.format("text") \
.option("encoding", "UTF-8") \
.load(load_path) \
.select("value", "_metadata") \
.withColumnRenamed("value", "json") \
.withColumn("id", F.expr(f"get_json_object(json, '$.{PRIMARY_KEY_PATH}')")) \
.withColumn("_etl_timestamp", F.col("_metadata.file_modification_time")) \
.withColumn("_metadata", F.col("_metadata").cast(T.StringType())) \
.withColumn("_etl_operation", F.lit("U")) \
.withColumn("_etl_to_delete", F.lit(False)) \
.withColumn("_etl_file_name", F.input_file_name()) \
.withColumn("_etl_job_processing_timestamp", F.current_timestamp()) \
.withColumn("_etl_table", F.lit(f"{TABLE_NAME}")) \
.withColumn("_etl_partition_date", F.to_date(F.col("_etl_timestamp"), "yyyy-MM-dd")) \
.select("_etl_operation", "_etl_timestamp", "id", "json", "_etl_file_name", "_etl_job_processing_timestamp", "_etl_table", "_etl_partition_date", "_etl_to_delete", "_metadata")
#dlt.table(
name=f"{SCHEMA}_{TABLE_NAME}_silver",
table_properties = {
"quality": "silver",
"delta.autoOptimize.optimizeWrite": "true",
"delta.autoOptimize.autoCompact": "true"
}
)
def silver_job():
df = dlt.read_stream(f"{SCHEMA}_{TABLE_NAME}_bronze").where("_etl_table == 'extraction'")
return df.select(
df.id.alias('medium_id'),
F.get_json_object(df.json, '$.request').alias('request_id'))
Thank you very much for your help!
I've Spark Structured Streaming process build with Pyspark that reads a avro message from a kafka topic, make some transformations and load the data as avro in a target topic.
I use the ABRIS package (https://github.com/AbsaOSS/ABRiS) to serialize/deserialize the Avro from Confluent, integrating with Schema Registry.
The schema contains integer columns as follows:
{
"name": "total_images",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "total_videos",
"type": [
"null",
"int"
],
"default": null
},
The process raises the following error: Cannot convert Catalyst type IntegerType to Avro type ["null","int"].
I've tried to convert the columns to be nullable but the error persists.
If someone have a suggestion I would appreciate that
I burned hours on this one
Actually, It is unrelated to Abris dependency (behaviour is the same with native spark-avro apis)
There may be several root causes but in my case … using Spark 3.0.1, Scala with Dataset : it was related to encoder and wrong type in the case class handling datas.
Shortly, avro field defined with "type": ["null","int"] can’t be mapped to scala Int, it needs Option[Int]
Using the following code:
test("Avro Nullable field") {
val schema: String =
"""
|{
| "namespace": "com.mberchon.monitor.dto.avro",
| "type": "record",
| "name": "TestAvro",
| "fields": [
| {"name": "strVal", "type": ["null", "string"]},
| {"name": "longVal", "type": ["null", "long"]}
| ]
|}
""".stripMargin
val topicName = "TestNullableAvro"
val testInstance = TestAvro("foo",Some(Random.nextInt()))
import sparkSession.implicits._
val dsWrite:Dataset[TestAvro] = Seq(testInstance).toDS
val allColumns = struct(dsWrite.columns.head, dsWrite.columns.tail: _*)
dsWrite
.select(to_avro(allColumns,schema) as 'value)
.write
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("topic", topicName)
.save()
val dsRead:Dataset[TestAvro] = sparkSession.read
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.load()
.select(from_avro(col("value"), schema) as 'Metric)
.select("Metric.*")
.as[TestAvro]
assert(dsRead.collect().contains(testInstance))
}
It fails if case class is defined as follow:
case class TestAvro(strVal:String,longVal:Long)
Cannot convert Catalyst type LongType to Avro type ["null","long"].
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type LongType to Avro type ["null","long"].
at org.apache.spark.sql.avro.AvroSerializer.newConverter(AvroSerializer.scala:219)
at org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$1(AvroSerializer.scala:239)
It works properly with:
case class TestAvro(strVal:String,longVal:Option[Long])
Btw, it would be more than nice to have support for SpecificRecord within Spark encoders (you can use Kryo but it is sub efficient)
Since, in order to use efficiently typed Dataset with my avro data … I need to create additional case classes (which duplicates of my SpecificRecords).
Here, I am Sending the json data to kafka from "test" topic ,give the schema to json, do some transformation and print it on console.
Here is the code:-
val kafkadata = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("zookeeper.connect", "localhost:2181")
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.option("max.poll.records", 10)
.option("failOnDataLoss", false)
.load()
val schema1 = new StructType()
.add("id_sales_order", StringType)
.add("item_collection",
MapType(
StringType,
new StructType()
.add("id", LongType)
.add("ip", StringType)
.add("description", StringType)
.add("temp", LongType)
.add("c02_level", LongType)
.add("geo",
new StructType()
.add("lat", DoubleType)
.add("long", DoubleType)
)
)
)
val df = kafkadata.selectExpr("cast (value as string) as
json")
.select(from_json($"json",
schema=schema1).as("data"))
.select($"data.id_sales_order",explode($"data.item_collection"))
val query = df.writeStream
.outputMode("append")
.queryName("table")
.format("console")
.start()
query.awaitTermination()
spark.stop()
I am sending data to kafka by 2 ways:-
1) Single line json:-
{"id_sales_order": "2", "item_collection": {"2": {"id": 10,"ip": "68.28.91.22","description": "Sensor attached to the container ceilings","temp":35,"c02_level": 1475,"geo": { "lat":38.00, "long":97.00}}}}
It is giving me output
+--------------+---+--------------------+
|id_sales_order|key| value|
+--------------+---+--------------------+
| 2| 2|[10,68.28.91.22,S...|
+--------------+---+--------------------+
2)Multiline json:-
{
"id_sales_order": "2",
"item_collection": {
"2": {
"id": 10,
"ip": "68.28.91.22",
"description": "Sensor attached to the container ceilings",
"temp":35,
"c02_level": 1475,
"geo":
{ "lat":38.00, "long":97.00}
}
}
}
It is not giving me any output.
+--------------+---+-----+
|id_sales_order|key|value|
+--------------+---+-----+
+--------------+---+-----+
Json coming from source is like 2nd one.
How do you handle json while reading streaming data from kafka?
I think the problem may be that the from_json function doesn't understand multiline json.
As described in Spark Structured Streaming with Hbase integration, I'm interesting in writing data to HBase in structured streaming framework. I've cloned SHC code from github, extend it by sync provider and trying to write records to HBase. Buy, I've received error: """Queries with streaming sources must be executed with writeStream.start(). My python code is below:
JARS
Spark verdion: 2.4.0, scala.version: 2.11.8, com.hortonworks.shc-core: 1.1.1-2.1-s_2.11 spark-sql-kafka-0-10_2.11
spark = SparkSession \
.builder \
.appName("SparkConsumer") \
.getOrCreate()
print 'read Avro schema from file: {}...'.format(schema_name)
schema = avro.schema.parse(open(schema_name, 'rb').read())
reader = avro.io.DatumReader(schema)
print 'the schema is read'
rows = spark \
.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', brokers) \
.option('subscribe', topic) \
.option('group.id', group_id) \
.option('maxOffsetsPerTrigger', 1000) \
.option("startingOffsets", "earliest") \
.load()
rows.printSchema()
schema = StructType([ \
StructField('consumer_id', StringType(), False), \
StructField('audit_system_id', StringType(), False), \
StructField('object_path', StringType(), True), \
StructField('object_type', StringType(), False), \
StructField('what_action', StringType(), False), \
StructField('when', LongType(), False), \
StructField('where', StringType(), False), \
StructField('who', StringType(), True), \
StructField('workstation', StringType(), True) \
])
def decode_avro(msg):
bytes_reader = io.BytesIO(bytes(msg))
decoder = avro.io.BinaryDecoder(bytes_reader)
data = reader.read(decoder)
return (\
data['consumer_id'],\
data['audit_system_id'],\
data['object_path'],\
data['object_type'],\
data['what_action'],\
data['when'],\
data['where'],\
data['who'],\
data['workstation']\
)
udf_decode_avro = udf(decode_avro, schema)
values = rows.select('value')
values.printSchema()
changes = values.withColumn('change', udf_decode_avro(col('value'))).select('change.*')
changes.printSchema()
change_catalog = '''
{
"table":
{
"namespace": "uba_input",
"name": "changes"
},
"rowkey": "consumer_id",
"columns":
{
"consumer_id": {"cf": "rowkey", "col": "consumer_id", "type": "string"},
"audit_system_id": {"cf": "data", "col": "audit_system_id", "type": "string"},
"object_path": {"cf": "data", "col": "object_path", "type": "string"},
"object_type": {"cf": "data", "col": "object_type", "type": "string"},
"what_action": {"cf": "data", "col": "what_action", "type": "string"},
"when": {"cf": "data", "col": "when", "type": "bigint"},
"where": {"cf": "data", "col": "where", "type": "string"},
"who": {"cf": "data", "col": "who", "type": "string"},
"workstation": {"cf": "data", "col": "workstation", "type": "string"}
}
}'''
query = changes \
.writeStream \
.outputMode("append") \
.format('HBase.HBaseSinkProvider')\
.option('hbasecat', change_catalog) \
.option("checkpointLocation", '/tmp/checkpoint') \
.start()
# .format('org.apache.spark.sql.execution.datasources.hbase')\
# query = changes \
# .writeStream \
# .format('console') \
# .start()
query.awaitTermination()
I am not able to create aws emr cluster when I add command "--enable-debugging"
I am able to create the cluster without enable-debugging command.
Getting error like: aws: error: invalid json argument for option --configurations
my script to create cluster is :
aws emr create-cluster \
--name test-cluster \
--release-label emr-5.5.0 \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m3.xlarge \
--no-auto-terminate \
--termination-protected \
--visible-to-all-users \
--use-default-roles \
--log-uri s3://testlogs/ \
--enable-debugging \
--tags Owner=${OWNER} Environment=Dev Name=${OWNER}-test-cluster \
--ec2-attributes KeyName=$KEY,SubnetId=$SUBNET \
--applications Name=Hadoop Name=Pig Name=Hive \
--security-configuration test-sec-config \
--configurations s3://configurations/mapreduceconfig.json
mapreduceconfig.json file is :
[
{
"Classification": "mapred-site",
"Properties": {
"mapred.tasktracker.map.tasks.maximum": 2
}
},
{
"Classification": "hadoop-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"HADOOP_DATANODE_HEAPSIZE": 2048,
"HADOOP_NAMENODE_OPTS": "-XX:GCTimeRatio=19"
}
}
]
}
]
Well, the error is self apparent. --configurations option does not support S3:// file system. As per the examples and documentation
http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html
It only supports file:// and a direct public link to a file in S3. like https://s3.amazonaws.com/myBucket/mapreduceconfig.json
So, your configurations gotta be Public.
Not sure how you got it working without --enable-debugging command.