In Azure I have an Event Hub with partition count 5 and a Stream Analytics Job which persists data from the hub to blob storage as is in json format. So now there are 5 files created to store incoming data.
Is it possible without changing hub partition to configure stream analytics job so it saves all the data to a single file?
For reference, what is taken into consideration for how to split output files is described here.
In your case, the condition that is met is:
If the query is fully partitioned, and a new file is created for each output partition
That's the trick here, if your query is passthrough (no shuffling around partitions) from event hub (partitioned) to storage account (matching incoming partitions via splitting files) then your job is always fully partitioned.
What you can do, if you don't care about performance, is to break the partition alignment. For that you can repartition your input or your query (via snapshot aggregation).
In my opinion though, you should look into using another tool (ADF, Power BI Dataflow) to process these downstream. You should see those files are landing files, optimized for query throughput. If you remove the partition alignment from your job, you severely limit its capability to scale and absorb spikes in incoming traffic.
After experimenting with partitioning suggested by this answer I found out that my goal can be achieved by changing Stream Analytics Job configuration.
There are different compatibility levels for stream analytics jobs and the latest one at the moment (1.2) introduced automatic parallel query execution for input sources with multiple partitions:
Previous levels: Azure Stream Analytics queries required the use of PARTITION BY clause to parallelize query processing across input source partitions.
1.2 level: If query logic can be parallelized across input source partitions, Azure Stream Analytics creates separate query instances and runs computations in parallel.
So when I changed compatibility level of the job to 1.1 it started to write all the output to a single file in a blob storage.
Related
I have multiple (hundreds) of event streams, each persisted as multiple blobs in azure blob storage, each encoded as multi-line json, and I need to perform an analysis on these streams.
For the analysis I need to "replay" them, which basically is a giant reduce operation per stream using a big custom function, that is not commutative. Since other departments are using databricks, I thought I could parallelize the tasks with it.
My main question: Is spark/databricks a suitable tool for the job and if so, how would you approach it?
I am completely new to spark, but I am currently reading up on spark using the "Complete Guide" and the "Learning Spark 2.ed" and I have trouble to answer that question myself.
As far as I see, most of the dataset / Spark SQL is not suitable for this task? Can I just inject custom code in a spark application that is not using these APIs and how do I control how the tasks get distributed afterwards?
Can I read in all blob names, partition them by stream and then generate tasks that read in all blobs in a partition and just feed them into my function without spark trying to be clever in the background?
We are ingesting data to an ADX Table using stream ingestion from an event hub source.
In order to plan for backup / disaster recoverability, the documentation suggests to configure continous export to recover from local outages and provide a possibility to restore data to another cluster.
Reading through the documentation of continous data export i saw in the section "limitations" that continous export is not supported for tables configured for stream ingestion.
Now i'm a bit stuck. What is the recommended way to backup those tables?
The support for continuous export defined on streaming ingestion tables is still being worked on, it should complete within 2021.
However, please note that for disaster recovery scenarios, this is the highest effort, lowest resiliency, and longest time to recover (RPO and RTO), so while it provides the lowest cost you should be careful picking this option for DR.
If your data is coming from an Event Hub and you want to back it all up (effectively ingest it to two clusters), there is another option: create another consumer group on your EH and set the secondary cluster read from this additional consumer group.
What I have?
I have Spark Streaming Application (on Kafka Streams) on Hadoop Cluster that aggregates each 5 minutes users' clicks and some actions done on a web site and
converts them into metrics.
Also I have a table in GreenPlum (on its own cluster) with users data that may get updated. This table is filled using Logical Log Streaming Replication via Kafka. Table size is 100 mln users.
What I want?
I want to join Spark Streams with static data from GreenPlum every 1 or 5 minutes and then aggregate data already using e.g. user age from static table.
Notes
Definitely, I don't need to read all records from users table. There are rather stable core segment + number of new users registering each minute.
Currently I use PySpark 2.1.0
My solutions
Copy data from GreenPlum cluster to Hadoop cluster and save it as
orc/parquet files. Each 5 minute add new files for new users. Once a
day reload all files.
Create new DB on Hadoop and Setup Log replication via Kafka as it is
done for GreenPlum. Read data from DB and use built in Spark
Streaming joins.
Read data from GreenPlum on Spark in cache. Join stream data with
cache.
For each 5 minute save/append new user data in a file, ignore old
user data. Store extra column e.g. last_action to truncate this
file if a user wasn't active on web site during last 2 weeks. Thus,
join this file with stream.
Questions
What of these solutions are more suitable for MVP? for Production?
Are there any better solutions/best practices for such sorts of
problem. Some literature)
Spark streaming reading data from a cache like Apache geode make this better. used this approach in real-time fraud use case. In a nut shell I have features generated on Greenplum Database using historical data. The feature data and some decision making lookup data is pushed in to geode. Features are periodically refreshed (10 min interval) and then refreshed in geode. Spark scoring streaming job constantly scoring the transactions as the come in w/o reading from Greenplum. Also spark streaming job puts the score in geode, which is synced to Greenplum using different thread. I had spark streaming running on cloud foundry using k8. This is a very high level but should give you an idea.
You might want to check out the GPDB Spark Connector --
http://greenplum-spark-connector.readthedocs.io/en/latest/
https://greenplum-spark.docs.pivotal.io/130/index.html
You can load data directly from the segments into Spark.
Currently, if you want to write back to GPDB, you need to use a standard JDBC to the master.
I have a Spark 2.2 Structured streaming flow from an on-premise system into a containerized cloud spark cluster where kafka recieves the data, and SSS maintains a number of queries that flush to disk every ten seconds. A query console-sink is not accessible to external sessions outside the streaming context (hence the CSV flush); the monitoring dashboard runs spark sql from another context to get metrics.
Right now I am only aggregating the data that has come in since streaming was last started. Now I need to aggregate data since forever with the incoming streaming data to provide (near) realtime views. This will mean running a bunch of GROUP BY's on billions of records - maintaining several million aggregate rows in-memory.
My question is regarding how Spark streaming queries can scale like this: how efficient is memory usage (I'll probably use 32 worker contaiers) and is this the correct way to manage a (near-) realtime view of incoming data using kafka and SSS?
I am developing a data pipeline that will consume data from Kafka, process it via spark streaming and ingest it into Cassandra.
The data pipeline I will go into production will definitely evolve after several months. But how to move from old to new data pipeline, but to maintain continuous delivery and avoid any data loss?
Thank you
The exact solution will depend on the specific requirements of your application. In general, Kafka will serve you as buffer. Messages going into Kafka are preserved following the topic expiration time.
In Spark streaming, you need to track the consumed offsets either automatically through snapshots, or manually (we do the later as that provides more recovery options).
Then you can stop, deploy a new version and restart your pipeline from where you previously left. In this model, messages are processed with at-least-once semantics and zero data loss.