Custom processing of multiple json event streams with spark/databricks - apache-spark

I have multiple (hundreds) of event streams, each persisted as multiple blobs in azure blob storage, each encoded as multi-line json, and I need to perform an analysis on these streams.
For the analysis I need to "replay" them, which basically is a giant reduce operation per stream using a big custom function, that is not commutative. Since other departments are using databricks, I thought I could parallelize the tasks with it.
My main question: Is spark/databricks a suitable tool for the job and if so, how would you approach it?
I am completely new to spark, but I am currently reading up on spark using the "Complete Guide" and the "Learning Spark 2.ed" and I have trouble to answer that question myself.
As far as I see, most of the dataset / Spark SQL is not suitable for this task? Can I just inject custom code in a spark application that is not using these APIs and how do I control how the tasks get distributed afterwards?
Can I read in all blob names, partition them by stream and then generate tasks that read in all blobs in a partition and just feed them into my function without spark trying to be clever in the background?

Related

is there any better approach to sync data from Bigquery to singlestore throgh pipelines?

I have data in the Bigquery table and wanted to sync it to singlestore table. I can see the singlestore pipeline documentation here https://docs.singlestore.com/db/v7.8/en/reference/sql-reference/pipelines-commands/create-pipeline.html. it has options to use GCS to load data from. it seems like it expects files from google cloud. I am new to singlestore, can somebody suggest a better approach. should I use pipelines or not? I have created a query stream from Bigquery and now want to insert data to singlestore DB in Nodejs. can we use write stream to singlestore? can we use the pipeline to insert records via the above stream from BQ?
The most efficient way to perform batch data movement from BigQuery to SingleStoreDB would be to perform exports of the data to GCS and use Pipelines to pull the data into SingleStoreDB. Pipelines are optimized for loading data into SingleStoreDB in parallel. If you export the data in Avro format, it will be even more efficient on both sides. It will likely be less complex and more efficient than trying to build the same workflow in Node.js.

Non-partitioned Stream Analytics Job output

In Azure I have an Event Hub with partition count 5 and a Stream Analytics Job which persists data from the hub to blob storage as is in json format. So now there are 5 files created to store incoming data.
Is it possible without changing hub partition to configure stream analytics job so it saves all the data to a single file?
For reference, what is taken into consideration for how to split output files is described here.
In your case, the condition that is met is:
If the query is fully partitioned, and a new file is created for each output partition
That's the trick here, if your query is passthrough (no shuffling around partitions) from event hub (partitioned) to storage account (matching incoming partitions via splitting files) then your job is always fully partitioned.
What you can do, if you don't care about performance, is to break the partition alignment. For that you can repartition your input or your query (via snapshot aggregation).
In my opinion though, you should look into using another tool (ADF, Power BI Dataflow) to process these downstream. You should see those files are landing files, optimized for query throughput. If you remove the partition alignment from your job, you severely limit its capability to scale and absorb spikes in incoming traffic.
After experimenting with partitioning suggested by this answer I found out that my goal can be achieved by changing Stream Analytics Job configuration.
There are different compatibility levels for stream analytics jobs and the latest one at the moment (1.2) introduced automatic parallel query execution for input sources with multiple partitions:
Previous levels: Azure Stream Analytics queries required the use of PARTITION BY clause to parallelize query processing across input source partitions.
1.2 level: If query logic can be parallelized across input source partitions, Azure Stream Analytics creates separate query instances and runs computations in parallel.
So when I changed compatibility level of the job to 1.1 it started to write all the output to a single file in a blob storage.

How to share data from Spark RDD between two applications

What is the best way to share spark RDD data between two spark jobs.
I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.
Job 2: Query job that will access the same RDD created in job 1 and generate reports.
I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.
thankyou !
The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.
To get a better idea you should look here:
Serializing RDD
You can share RDDs across different applications using Apache Ignite.
Apache ignite provides an abstraction to share the RDDs through which applications can access the RDDs corresponding to different applications. In addition Ignite has the support for SQL indexes, where as native Spark doesn't.
Please refer https://ignite.apache.org/features/igniterdd.html for more details.
According to the official document describes:
Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs.
http://spark.apache.org/docs/latest/job-scheduling.html
You can save to a temporary view. Table will be available to other sessions until the one that creates it is closed

Google Dataflow vs Apache Spark

I am surveying Google Dataflow and Apache Spark to decide which one is more suitable solution for our bigdata analysis business needs.
I found there are Spark SQL and MLlib in the spark platform to do structured data query and machine learning.
I wonder is there any corresponding solution in the Google Dataflow platform?
It would help if you could expand a bit on your specific use case(s). What are you trying to accomplish in relation to "Bigdata analysis"? The short answer... it depends :-)
Here are some key architectural points to consider in relation to Google Cloud Dataflow v. Spark and Hadoop MR.
Resource Mgmt: Cloud Dataflow is a completely on demand execution environment. Specifically - when you execute a job in Dataflow the resources are allocated on demand for that job only. There is no sharing/contention of resources across jobs. In comparison to a Spark or MapReduce cluster you would typically deploy a cluster of X nodes and then submit jobs and then tune the node resources across jobs. Of course you can build up and tear down these clusters, but the Dataflow model is geared towards hands free dev ops in relation to resource management. If want to optimize resource usage to job demands Dataflow is a solid model to control cost and nearly forget about resource tuning. If you prefer a multi-tenant style cluster I'd suggest you look at Google Cloud Dataproc as it provides the on demand cluster management aspects like Dataflow, but focused on class Hadoop workloads like MR, Spark, Pig, ...
Interactivity: Currently Cloud Dataflow does not provide an interactive mode. Meaning once you submit a job the work resources are bound to the graph that was submitted AND the majority of the data is loaded into resources as needed. Spark can be a better model if you want to load data into the cluster via in memory RDD's and then dynamically execute queries. The challenge is that as your data sizes and query complexity increases you will have to handle the devOps. Now if most of your queries can be expressed in SQL syntax you may want to look at BigQuery. BigQuery provides the "on demand" aspects of Dataflow and enables you to interactively execute queries over massive amounts of data e.g petabytes. The biggest advantage in my opinion of BigQuery is that you do not have think/worry about hardware allocation to deal with your data sizes. Meaning as your data sizes grow you don't have to think about hardware (memory and disk size) configuration.
Programming Model: Dataflow's programming model is functionally biased vs. a classic MapReduce model. There are many similarities between Spark and Dataflow in terms of API primitives. Things to consider: 1) Dataflow's primary programming language is Java. There is a Python SDK in the works. The Dataflow Java SDK in open sourced and has been ported to Scala. Today, Spark has more SDK surface choice with GraphX, Streaming, Spark SQL, and ML. 2) Dataflow is a unified programming model for batch and streaming based DAG development. The goal was to remove the complexity and cost switching when moving between batch and streaming models. The same graph can seamlessly run in either mode. 3) Today, Cloud Dataflow does not support converging/iterative based graph execution. If you need the power of something like MLib then Spark is the way to go. Keep in mind this is the state of things today.
Streaming & Windowing: Dataflow (building on top of the unified programming model) was architected to be a highly reliable, durable, and scalable execution environment for streaming. One of the key differences between Dataflow and Spark is that Dataflow enables you to easily process data in terms of its true event time vs. solely processing it at it's arrival time into the graph. You can window data into fixed, sliding, session or custom windows based on event time or arrival time. Dataflow also provides Triggers (applied to Windows) that enable you to control how you want to handle late arriving data. Net-net you dial in the level of correctness control to meet the needs of your analysis. For example, lets say you have a mobile game that interacts with a 100 edge nodes. These nodes create 10000's events second related to game play. Let's say a group of nodes can't communicate with your back end streaming analysis system. In the case of Dataflow - once that data does arrive - you can control how you'd like to handle the data in relation to your query correctness needs. Dataflow also provides the ability to upgrade your streaming jobs while they are in flight. For example, let's say you discover a logical bug in a transform. You can upgrade your in flight job without losing your existing Windowed state. Net-net you can keep you business running.
Net-net:
- if you are really primarily doing ETL style work (filtering, shaping, joining, ...) or batch style MapReduce Dataflow is a great path if you want minimal devOps.
- if you need to implement ML style graphs, go the Spark path and give Dataproc a try
- if you are doing ML and you first need to do ETL to clean up your training data implement a hybrid with Dataflow and Dataproc
- if you need interactivity Spark is a solid choice, but so is BigQuery if you are/can express your queries in SQL
- if you need to process your ETL and or MR jobs over streams, Dataflow is a solid choice.
So... what are you scenarios?
I've tried both :
Dataflow is still very young, the is no "out-of-the-box" solution for doing ML with it (even though you could implement algorithms in transforms), you could output the processes data to cloud storage and read it later with another tool.
Spark would be recommended but you would have to manage your cluster yourself.
However there is a good alternative: Google Dataproc
You can develop analysis tools with spark and deploy them with one command on your cluster, dataproc will manage the cluster itself without having to tweak the configuration.
I have built code using spark,DataFlow .Let me put my thoughts.
Spark/DataProc: I have used spark (Pyspark) a lot for ETL. You can use SQL and any programming language of your choice. Lot of functions are available (Including Window functions). Build your dataframe and write your transformation and it can be super fast. Once data is cached , any operation on the Dataframe will quick.
You can simply build hive external table on the GCS. Then you can use Spark for ETL and Load data into Big Query. This is for Batch processing.
For streaming you can use spark Streaming and load data into Big query.
Now if you have cluster allready then you have think whether to move to Google cloud or not. I found Data proc (Google Cloud Hadoop/Spark) offering is better as you don't have to worry many cluster managements..
DataFlow : It's know as apache beam. Here you can write your code in Java/Python or any other language. You can execute the code in any framework (Spark/MR/Flink).This is a unified model. Here you can do both batch processing and Stream Data processing.
google now offers both programming models- mapreduce and spark.
Cloud DataFlow and Cloud DataProc they are respectively

Big Data Analytics using Redshift vs Spark, Oozie Workflow Scheduler with Redshift Analytics

We want to do Big Data Analytics on our data stored in Amazon Redshift (currently in Terabytes, but will grow with time).
Currently, it seems that all our Analytics can be done through Redshift queries (and hence, no distributed processing might be required at our end) but we are not sure if that will remain to be the case in future.
In order to build a generic system that should be able to cater our future needs as well, we are looking to use Apache Spark for data analytics.
I know that data can be read into Spark RDDs from HDFS, HBase and S3, but does it support data reading from Redshift directly?
If not, we can look to transfer our data to S3 and then read it in Spark RDDs.
My question is if we should carry out our Data Analytics through Redshift's queries directly or should we look to go with the approach above and do analytics through Apache Spark (Problem here is that Data Locality optimization might not be available)?
In case we do analytics through Redshift queries directly, can anyone please suggest a good Workflow Scheduler to write our Analytics jobs with. Our requirement is to be able to execute jobs as a DAG (Job2 should execute only if Job1 succeeds, etc) and be able to schedule our workflows through the proposed Workflow Engine.
Oozie seems like a good fit for our requirements but it turns out that Oozie cannot be used without Hadoop. Does it make sense to set up Hadoop on our machines and then use Oozie Workflow Scheduler to schedule our Data Analysis jobs through Redshift queries?
You cannot access data stored on Redshift nodes directly (each via Spark), only via SQL queries submitted the cluster as a whole.
My suggestion would be to use Redshift as long as possible and only take on the complexity of Spark/Hadoop when you absolutely need it.
If, in the future, you move to Hadoop then Cascading Lingual gives you the option of running your existing Redshift analytics more or less unchanged.
Regarding workflow, Oozie is not a good fit for Redshift. I would suggest you look at Azkaban (true DAG) or Luigi (uses a Python DSL).

Resources