Start DataProc job from Pub/Sub notification - apache-spark

I have a background service which produces files in Google Cloud Storage. Once it is done it generates a file in the output folder.
In my flow I need to get the list of these files and start DataProc Spark job with the list of files. The processing is not real-time and takes tens of minutes.
GCS has a notifications system. It can stream the notification to Pub/Sub service.
In GCS there will be a file .../feature/***/***.done created to identify the service job completion.
Can I subscribe to new files in GCS by wildcard?
Once the file is created the notification gets to Pub/Sub service.
I believe I can write Cloud Function that would read this notification, by some magic will get the location of the modified file and will be able to list all files from that folder. Then publish another message to Pub/Sub with all the required information
Is that possible to start DataProc job by Pub/Sub notification?
Ideally, it would be great to use Jobs instead of Streaming to reduce costs. This may mean that PubSub initiates Job instead of streaming Job pulls the new message from PubSub

Question 1: Can I subscribe to new files in GCS by wildcard?
You can set up GCS notifications to filter by path prefix. See the -p option here. Cloud Pub/Sub also has a filtering-by-attribute API in the Beta. You can use it to filter by the attributes set by GCS. The filtering language supports exact matches and prefix checks on the attributes set by GCS.
The message published to the Cloud Pub/Sub topic will have attributes that give you the bucket and name of the object, so you should be able to easily read other files in that bucket/subpath.
Question 2: Is it possible to start a DataProc job by Pub/Sub notification?
Yes, you can set up a Cloud Function to subscribe from your Cloud Pub/Sub topic. The function can then start up a DataProc cluster using the DataProc client library, or do any other action.

Related

How does the Snowpipe cloud messaging mechanism exactly work on Azure?

I've successfully integrated Snowpipe with a container inside the Azure storage and loaded data into my target table, but now I can't exactly figure out how does Snowpipe actually works. Also, please let me know if there is already a good resource that answers this question, I'd be very grateful.
In my example, I tested a Snowpipe mechanism that uses cloud messaging. So, from my understanding, when a file is uploaded into an Azure container, Azure Event Grid sends an event message to an Azure queue, from which Snowpipe is notified that a new file is uploaded into the container. Then, Snowpipe in the background starts its loading process and imports the data into a target table.
If this is correct, I don't understand how does Azure queue informs Snowpipe about uploaded files. Is this connected to the "notification integration" inside Snowflake? Also, I don't understand what does it mean when they say on the Snowflake page that "Snowpipe copies the files into a queue, from which they are loaded into the target table...". Is this an Azure queue or some Snowflake queue?
I hope this question makes sense, any help or detailed explanation of the whole process is appreciated!
You've pretty much nailed it. to answer your specific questions... (and don't feel bad about them, this is definitely confusing)
how does Azure queue informs Snowpipe about uploaded files? Is this connected to the "notification integration" inside Snowflake?
Yes, this is the notification integration. But Azure is not "informing" the Snowpipe, it's the other way around. The Azure queue creates a notification that various other applications can subscribe to (this has no awareness of Snowflake). The notification integration on the snowflake side is snowflake's way to integrate with these external notifications
Snowpipe's queueing
Once snowflake recieves one of these notifications it puts that notification into a snowflake-side queue (or according to that page, the file itself. I was surprised by this, but the end result is the same). Snowpipes are wired up to that notification integration (as part of the create statement). The files are directed to the appropriate snowpipe based on the information in the "Stage" (also as part of the pipe create statement. I'm actually not certain if this part is a push or a pull). Then it runs the COPY INTO on that file.

Using Pub/Sub for Google Cloud Storage with GKE

I have a GKE application that currently is driven by Notifications from a Google Cloud Storage bucket. I want to convert this node.js application to be triggered instead by PubSub notifications. I've been crawling through Google documentation pages most of the day, and do not have a clear answer. I see some python code that might do it, but it's not helping much.
The code as it is currently written is working - an image landing in my GCS bucket triggers a notification to my GKE pod(s), and my function runs. Trying to understand what I need to do inside my function to subscribe to a Pub/Sub topic to trigger the processing. Any and all suggestions welcome.
Firstly thanks, I didn't know the notification capability of GCS!!
The principle is close but you use PubSub as intermediary. Instead of notify directly your application with a watchbucket command, you notif a PubSub topic.
From there, the notifications arrive in PubSub topic, now you have to create a subscription. 2 types are possible:
Push: you specify an HTTP URL that is called with a POST request, and the body contain the notification message.
Pull: your application need to create a connection with the PubSub subscription and to read the messages.
Pro and cons
Push requires an authentication from the PubSub push subscription to your application. And if you use internal IP, you can't use this solution (URL endpoint must be publicly accessible). The main advantage is the scalability and the simplicity of the model.
Pull require an authentication of the subscriber (here your application) and thus, even if your application is privately deployed, you can use Pull subscription. Pull is recommended for high throughput but require higher skill in processing, concurrency/multi-threading programming. You don't scale on request rate (as with Push model) but according to the number of message that you read. And you need to acknowledge manually the messages.
Data model is mentioned here. Your pubsub message is like that
{
"data": string,
"attributes": {
string: string,
...
},
"messageId": string,
"publishTime": string,
"orderingKey": string
}
The attributes are discribed in the documentation and the payload (base64 encoded, be carefull) has this format. Very similar of what you get today.
So, why the attributes? Because you can use the filter feature on PubSub to create subscription with only a subset of messages.
You can also shiht gears and use Cloud Event (base on Knative events) if you use Cloud Run for Anthos in your GKE cluster. Here, the main advantage is the portability of the solution, because the messages are compliant with Cloud Event format and not specific to GCP.

How to apply load test to Azure Event Hubs triggered consumer function with JMeter to find out consumer's processing speed to egress speed ratio?

I have an Event Hubs application that I haven't published Azure yet, so it is running in my local with Azure emulator.
My consumer is an Azure Event Hubs triggered function and running in my local. So there are no multiple function instances; only one function instance backed by a single EventProcessorHost instance running in my local and there are 2 partitions in the Event Hub.
The consumer does not do any map/filter operation, but only saves the incoming message to an Azure Blob storage.
I want to do load testing/stress testing with probably JMeter to find out and compare the consumer's processing time with egress(flow of data to consumer from Event Hubs broker). In this way I hope to be able to see whether my consumer is late or not, so the criteria of consumer's processing speed/egress speed is the reason of the load test.
I haven't done load testing with JMeter or any other tool before, but I have learned that you can use it to test things from simple APIs to JMS message brokers. API or web page load testing makes sense and looks straightforward, but I'm struggling to fully grasp how to test pub/sub programs. There are some tutorials like Apache Kafka - How to Load Test with JMeter I have read, but for .NET side and Event Hub there are no descriptive tutorials that make sense.
Here are my questions:
Is it right to run the load test in my local with the consumer function which is running in the Azure emulator and saving to Azure cloud blob storage or should I publish this function to the cloud and run the test in the cloud? Does it affects the results?
I couldn't find any JMeter tutorials specific to Event Hubs, so how can I do it with JMeter? Should I have to write any wrapper program?
Besides JMeter, if it's not suitable for this task, how can I do load testing with any other way?

GOOGLE STORAGE - how can i know if my bucket was uploaded a new file

i'm working for the first time with google storage with no experience on it
I have a question that does google storage have an feature that if storage got a new uploaded file, storage will send a notification to client that have subscribed
Or rise an event that can be watched from another place.
My solution is list all object from bucket and save it to a temp file. After that, I list object again every minute then if a new file had been uploaded, it would not be in a temp file so i can know it. But i think this way is not such good if the thing above exists.
Many thanks.
Object Change Notification can be used to notify an application
when an object is updated or added to a bucket.
Alternatively, you can use Cloud Pub/Sub Notifications for Cloud Storage, which are actually the recommended way to track changes to objects in your Cloud Storage buckets because they're faster, more flexible, easier to set up, and more cost-effective.
Ultimately, if you only want to trigger a lightweight, stand-alone function in response to events and don't want to manage a Cloud Pub/Sub topic, use Cloud Functions with Storage Triggers which can respond to change notifications emerging from Google Cloud Storage. These notifications can be configured to trigger in response to various events inside a bucket—object creation, deletion, archiving and metadata updates.

Google Cloud DataFlow for NRT data application

I'm evaluating Kafka/Spark/HDFS for developing NRT (sub sec) java application that receives data from an external gateway and publishes it to desktop/mobile clients (consumer) for various topics. At the same time the data will be fed through streaming and batching (persistent) pipelines for analytics and ML.
For example the flow would be...
A standalone TCP client reads streaming data from external TCP server
The client publishes data for different topics based on the packets (Kafka) and passes it to the streaming pipeline for analytics (Spark)
A desktop/mobile consumer app subscribes to various topics and receives NRT data events (Kafka)
The consumer also receives analytics from the streaming/batch pipelines (Spark)
Kafka clusters have to be managed, configured and monitored for optimal performance and scalability. This may require additional person resources and tools to manage the operation.
Kafka, Spark and HDFS can optionally be deployed over Amazon EC2 (or Google Cloud using connectors).
I was reading about Google Cloud DataFlow, Cloud Storage, BigQuery and Pub-Sub. The data flow provides auto scaling and tools to monitor data pipelines in real-time, which is extremely useful. But the setup has a few restrictions e.g. pub-sub push requires the client to use https endpoint and the app deployment needs to use web server e.g. App engine webapp or web server on GCE.
This may not be as efficient (I'm concerned about latency when using http) as deploying a bidirectional tcp/ip app that can leverage the pub-sub and data flow pipelines for streaming data.
Ideally, the preferable setup on Google cloud would be to run the TCP client connecting to the external gateway deployed on GCE that pushes data using pub-sub to the desktop consumer app. In addition, it would leverage the DataFlow pipeline for analytics and cloud storage with spark for ML (prediction API is a bit restrictive) using the cloudera spark connector for data flow.
One could deploy Kafka/Spark/HDFS etc on Google cloud but that kinda defeats the purpose of leveraging the Google cloud technology.
Appreciate any thoughts on whether the above setup is possible using Google cloud or stay with EC2/Kafka/Spark etc.
Speaking about the Cloud Pub/Sub side, there are a couple of things to keep in mind:
If you don't want to have to have a web server running in your subscribers, you could consider using the pull-based subscriber instead of the push-based one. To minimize latency, you want to have at least a few outstanding pull requests at any time.
Having your desktop consumer app act as a subscriber to Pub/Sub directly will only work if you have no more than 10,000 clients; there is a limit of 10,000 subscriptions. If you need to scale beyond that, you should consider Google Cloud Messaging or Firebase.
From the Dataflow side of things, this sounds like a good fit, particularly as you'll be mixing streaming and batch style analytics. If you haven't yet, check out our Mobile Gaming walkthrough.
I'm not quite sure what you mean about using Cloudera's Dataflow/Spark runner for ML. That runner runs Dataflow code on Spark, but not the other way around.

Resources