Avro bytes from Event hub cannot be deserialized with pyspark - apache-spark

We are sending Avro data encoded with (azure.schemaregistry.encoder.avroencoder) to Event-Hub using a standalone python job and we can deserialize using the same decoder using another standalone python consumer. The schema registry is also supplied to the Avro encoder in this case.
This is the stand alone producer I use
import os
from azure.eventhub import EventHubProducerClient, EventData
from azure.schemaregistry import SchemaRegistryClient
from azure.schemaregistry.encoder.avroencoder import AvroEncoder
from azure.identity import DefaultAzureCredential
os.environ["AZURE_CLIENT_ID"] = ''
os.environ["AZURE_TENANT_ID"] = ''
os.environ["AZURE_CLIENT_SECRET"] = ''
token_credential = DefaultAzureCredential()
fully_qualified_namespace = ""
group_name = "testSchemaReg"
eventhub_connection_str = ""
eventhub_name = ""
definition = """
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}"""
schema_registry_client = SchemaRegistryClient(fully_qualified_namespace, token_credential)
avro_encoder = AvroEncoder(client=schema_registry_client, group_name=group_name, auto_register=True)
eventhub_producer = EventHubProducerClient.from_connection_string(
conn_str=eventhub_connection_str,
eventhub_name=eventhub_name
)
with eventhub_producer, avro_encoder:
event_data_batch = eventhub_producer.create_batch()
dict_content = {"name": "Bob", "favorite_number": 7, "favorite_color": "red"}
event_data = avro_encoder.encode(dict_content, schema=definition, message_type=EventData)
event_data_batch.add(event_data)
eventhub_producer.send_batch(event_data_batch)
I was able to deserialise using the stand alone consumer
async def on_event(partition_context, event):
print("Received the event: \"{}\" from the partition with ID: \"{}\"".format(event.body_as_str(encoding='UTF-8'),
partition_context.partition_id))
print("message type is :")
print(type(event))
dec = avro_encoder.decode(event)
print("decoded msg:\n")
print(dec)
await partition_context.update_checkpoint(event)
async def main():
client = EventHubConsumerClient.from_connection_string(
"connection str"
"topic name",
consumer_group="$Default",
eventhub_name="")
async with client:
await client.receive(on_event=on_event, starting_position="-1")
As a next step , I replaced the standalone python consumer with the py-spark consumer running on synapse notebook. Below are the problems I faced
The from_avro function in spark is not able to deserialize the Avro message encoded with azure encoder.
As a work a round, I tied creating an UDF which makes use of azure encoder , but I see that azure encoder is expecting the event to be of type EventData, but when spark reads the data using event hub API, we get the data in Byte Array.
#udf
def decode(row_msg):
encoder = AvroEncoder(client=schema_registry_client)
encoder.decode(bytes(row_msg))
I don't see any proper documentation on the deserializer that we can use with spark or any distributed system.
All examples are with Stand Alone clients. Do we have any connector that we can use with spark/Flink ?

Answering my own question, azure event hub schema registry doesn't support spark or any distributed system.
They are working on it and trying to add this support to spark
https://github.com/Azure/azure-event-hubs-spark/pull/573

Because the avro schema is not part of the payload ("the data"), the from_avro function in spark will not be able to deserialize the message. This should be expected.
In order to decode, you also need to pass the associated content_type value on the EventData object into the decode​ method. This content_type value holds the schema ID that will be used to retrieve the schema used for deserialization. You can set content_type along with content in the MessageContent dict. This sample should be helpful.
We currently don't have a connector to be used with spark/flint. However, if this is something you're interested in, please feel free to file a feature-request issue here: https://github.com/Azure/azure-sdk-for-python/issues.

Related

How to parallelly send data from pyspark dataframe to kinesis putRecord using boto3 in AWS Glue

Trying to send json data from pyspark dataframe to kinesis putrecord using boto3.
Was able to write a for loop for the same here's the code
for value in DataFrame.toJSON().toLocalIterator():
jsondict=json.loads(value)
# Send message to Kinesis DataStream
client = boto3.client('kinesis')
response = client.put_record(
StreamName = "kinesisStreamname",
Data = bytes(json.dumps(jsondict), 'utf-8'),
PartitionKey = str('part_key')
)
But since for-loop is being executed in master node it does not run parallely in worker node.
Tried map and foreach of pyspark dataframe but It's not yet working.
Here's the sample code I tried.
def pushToKinesis(value):
# jsondict=json.loads(value)
client = boto3.client('kinesis')
response = client.put_record(
StreamName = "kinesis_streamName",
Data = bytes(json.dumps("test"), 'utf-8'),
PartitionKey = str('part_key')
)
rdd = jsonMapDataFrame.toJSON().map(pushToKinesis)
Running in Aws glue and sending data to kinesis using Boto3

Alternative to Azure Event Hub Capture for sending Event Hub messages to Blob Storage?

Is there any way to send my Event Hub data, which is being send in JSON format via Postman by HTTP post to blob storage in Azure?
I've tried using the EventHub's Capture feature, but unfortunately, the data is being saved in an Avro format, I really have a hard time being able to convert it back to its original JSON format again.
Therefore I would like to send my EventHub data directly to some kind of blob storage which will keep my Event Hub messages in their original JSON format, which I then can retrieve with the use of an Azure function (Get Http trigger) from my SPA via a frontend communication.
Also, will I have to create a new blob for each message in a container? As I don't think I'll be able to write them all in one blob since I won't be able to retrieve my data via the frontend when I trigger my get HTTP function at the same time.
Are there alternatives to Event Hub Capture? Is using plain blob storage the best solution? I've read a few articles on Azure Timeseries Insights and CosmosDB, but I'm not sure if these are the best ways to handle my problem.
So the issue is that I initially send this as raw data via Postman:
Raw data as JSON send via Postman:
{
"id":1,
"receiver":"2222222222222",
"message":{
"Name":"testing",
"PersonId":2,
"CarId":2,
"GUID":"1s3q1d-s546dq1-8e22e",
"LineId":2,
"SvcId":2,
"Lat":-64.546547,
"Lon":-64.546547,
"TimeStamp":"2021-03-18T08:29:36.758Z",
"Recorder":"dq65ds4qdezzer",
"Env":"DEV"
},
"operator":20404,
"sender":"MSISDN",
"binary":1,
"sent":"2021-03-18T08:29:36.758Z"
}
Once this is caught by Event Hub Capture it converts to an Avro file.
I am trying to retrieve the data by using fastavro and converting it to a JSON format.
The problem is that I am not getting back the same raw data that was initially sent by Postman. I can't find a way to convert it back to its original state, why does Avro also send me additional information from Postman?
I probably need to find a way to set the "Body" to only convert. But for some reason, it also adds "bytes" inside the body
I am just trying to get my original raw data back that was sent via Postman.
init.py (Azure function)
import logging
import os
import string
import json
import uuid
import avro.schema
import tempfile
import azure.functions as func
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
from fastavro import reader, json_writer
#Because the Apache Python avro package is written in pure Python, it is relatively slow, therefoer I make use of fastavro
def avroToJson(avroFile):
with open("json_file.json", "w") as json_file:
with open(avroFile, "rb") as avro_file:
avro_reader = reader(avro_file)
json_writer(json_file, avro_reader.writer_schema, avro_reader)
def main(req: func.HttpRequest) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
print('Processor started using path ' + os.getcwd())
connect_str = "###########"
container = ContainerClient.from_connection_string(connect_str, container_name="####")
blob_list = container.list_blobs() # List the blobs in the container.
for blob in blob_list:
# Content_length == 508 is an empty file, so process only content_length > 508 (skip empty files).
if blob.size > 508:
print('Downloaded a non empty blob: ' + blob.name)
# Create a blob client for the blob.
blob_client = ContainerClient.get_blob_client(container, blob=blob.name)
# Construct a file name based on the blob name.
cleanName = str.replace(blob.name, '/', '_')
cleanName = os.getcwd() + '\\' + cleanName
# Download file
with open(cleanName, "wb+") as my_file: # Open the file to write. Create it if it doesn't exist.
my_file.write(blob_client.download_blob().readall())# Write blob contents into the file.
avroToJson(cleanName)
with open('json_file.json','r') as file:
jsonStr = file.read()
return func.HttpResponse(jsonStr, status_code=200)
Expected result:
{
"id":1,
"receiver":"2222222222222",
"message":{
"Name":"testing",
"PersonId":2,
"CarId":2,
"GUID":"1s3q1d-s546dq1-8e22e",
"LineId":2,
"SvcId":2,
"Lat":-64.546547,
"Lon":-64.546547,
"TimeStamp":"2021-03-18T08:29:36.758Z",
"Recorder":"dq65ds4qdezzer",
"Env":"DEV"
},
"operator":20404,
"sender":"MSISDN",
"binary":1,
"sent":"2021-03-18T08:29:36.758Z"
}
Actual result:
{
"SequenceNumber":19,
"Offset":"10928",
"EnqueuedTimeUtc":"4/1/2021 8:43:19 AM",
"SystemProperties":{
"x-opt-enqueued-time":{
"long":1617266599145
}
},
"Properties":{
"Postman-Token":{
"string":"37ff4cc6-9124-45e5-ba9d-######e"
}
},
"Body":{
"bytes":"{\r\n \"id\": 1,\r\n \"receiver\": \"2222222222222\",\r\n \"message\": {\r\n \"Name\": \"testing\",\r\n \"PersonId\": 2,\r\n \"CarId\": 2,\r\n \"GUID\": \"1s3q1d-s546dq1-8e22e\",\r\n \"LineId\": 2,\r\n \"SvcId\": 2,\r\n \"Lat\": -64.546547,\r\n \"Lon\": -64.546547,\r\n \"TimeStamp\": \"2021-03-18T08:29:36.758Z\",\r\n \"Recorder\": \"dq65ds4qdezzer\",\r\n \"Env\": \"DEV\"\r\n },\r\n \"operator\": 20404,\r\n \"sender\": \"MSISDN\",\r\n \"binary\": 1,\r\n \"sent\": \"2021-03-29T08:29:36.758Z\"\r\n}"
}
}

Apache Flink Stateful Function - Serialization problem?

I'm trying to build a project using an Apache Flink Stateful Function in Python, but I can't seem to get it to work. What I've narrowed the issue down to is that it seems when I send the request to my stateful function through my protobuf schema, the serializer is unable to serialize my message into the class I'm expecting. Here's what I'm trying to do:
import json
from statefun import StatefulFunctions, RequestReplyHandler
from jobs.session_event_pb2 import Event
functions = StatefulFunctions()
#functions.bind("namespace/funcname")
def funcname(context, session: Event):
print("hello world")
handler = RequestReplyHandler(functions)
if __name__ == '__main__':
inputFile = open("my_file.json", "r")
for line in inputFile:
data = json.loads(line).get('properties')
if data is not None and data.get('prop1') is not None and data.get('prop2') is not None:
request = Event()
request.prop1 = data["prop1"]
request.prop2 = data["prop2"]
request = request.SerializeToString()
handler(request)
Here's my Protobuf schema:
syntax = "proto3";
package mypackage;
message Event {
string prop1 = 1;
string prop2 = 2;
}
What am I doing wrong here?
That's because the RequestReply handler does not take direct protobuf messages. The Flink runtime sends a type called ToFunction and receives a response of type FromFunction. This payload contains your caller messages along with persisted values and other meta information.
If you can't to invoke the functions directly, such as in a test, I would encourage you to do that and not use the handler at all.

Running a cloud function with a pubsub push trigger

I've setup a Python script that will take certain bigquery tables from one dataset, clean them with a SQL query, and add the cleaned tables to a new dataset. This script works correctly. I want to set this up as a cloud function that triggers at midnight every day.
I've also used cloud scheduler to send a message to a pubsub topic at midnight every day. I've verified that this works correctly. I am new to pubsub but I followed the tutorial in the documentation and managed to setup a test cloud function that prints out hello world when it gets a push notification from pubsub.
However, my issue is that when I try to combine the two and automate my script - I get a log message that the execution crashed:
Function execution took 1119 ms, finished with status: 'crash'
To help you understand what I'm doing, here is the code in my main.py:
# Global libraries
import base64
# Local libraries
from scripts.one_minute_tables import helper
def one_minute_tables(event, context):
# Log out the message that triggered the function
print("""This Function was triggered by messageId {} published at {}
""".format(context.event_id, context.timestamp))
# Get the message from the event data
name = base64.b64decode(event['data']).decode('utf-8')
# If it's the message for the daily midnight schedule, execute function
if name == 'midnight':
helper.format_tables('raw_data','table1')
else:
pass
For the sake of convenience, this is a simplified version of my python script:
# Global libraries
from google.cloud import bigquery
import os
# Login to bigquery by providing credentials
credential_path = 'secret.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path
def format_tables(dataset, list_of_tables):
# Initialize the client
client = bigquery.Client()
# Loop through the list of tables
for table in list_of_tables:
# Create the query object
script = f"""
SELECT *
FROM {dataset}.{table}
"""
# Call the API
query = client.query(script)
# Wait for job to finish
results = query.result()
# Print
print('Data cleaned and updated in table: {}.{}'.format(dataset, table))
This is my folder structure:
And my requirements.txt file has only one entry in it: google-cloud-bigquery==1.24.0
I'd appreciate your help in figuring out what I need to fix to run this script with the pubsub trigger without getting a log message that says the execution crashed.
EDIT: Based on the comments, this is the log of the function crash
{
"textPayload": "Function execution took 1078 ms, finished with status: 'crash'",
"insertId": "000000-689fdf20-aee2-4900-b5a1-91c34d7c1448",
"resource": {
"type": "cloud_function",
"labels": {
"function_name": "one_minute_tables",
"region": "us-central1",
"project_id": "PROJECT_ID"
}
},
"timestamp": "2020-05-15T16:53:53.672758031Z",
"severity": "DEBUG",
"labels": {
"execution_id": "x883cqs07f2w"
},
"logName": "projects/PROJECT_ID/logs/cloudfunctions.googleapis.com%2Fcloud-functions",
"trace": "projects/PROJECT_ID/traces/f391b48a469cbbaeccad5d04b4a704a0",
"receiveTimestamp": "2020-05-15T16:53:53.871051291Z"
}
The problem comes from the list_of_tables attributes. You call your function like this
if name == 'midnight':
helper.format_tables('raw_data','table1')
And you iterate on your 'table1' parameter
Perform this, it should work
if name == 'midnight':
helper.format_tables('raw_data',['table1'])

Get Facebook Marketing API Ads insights results as CSV or JSON format

I am attempting to use the Facebook-Python-Ads-SDK to automate reporting on Ad Account performance. I have successfully requested a report at the ad set level, however the output of the report is a Cursor object, where I would prefer it to be a json or csv. I have tried the "export_format" option in params but it does not seem to make any difference. The output looks like JSON, so I attempted to import the object as a dataframe in pandas using pd.read_json(result) but it gives off an error saying that the object type "Cursor" needs to be str or bytes.
Does anyone have any experience with this api that can help me out? My code is below.
def report_request(start_date,end_date):
fields = [
'date_start',
'account_name',
'adset_name',
'ad_name',
'impressions',
'clicks',
'spend'
]
params = {
'time_range': {
'since': start_time,
'until': end_time,
},
'level':'ad',
'export_format':'csv'
}
account_id = [<ACCOUNT_ID>]
adAccount = AdAccount('act_' + account_id)
api_batch = get_api().new_batch()
request = adAccount.get_insights(fields=fields, params=params, async=False, batch=api_batch)
result = request.execute()
return result

Resources