Apache Spark on a cloud Infrastructure - apache-spark

How can I process the given data efficiently using Apache Spark on a cloud Infrastructure as a Service (IaaS) platform? I have a dataset of over 60 million data that I need to run the dataset effectively.

There are many options to do the same.
In Azure you can use Synapse/Azure Data Factory.
In GCS,you can use Dataproc cluster with Cloud Composer.It would be great if you can mention the whole scenario what is your exact source(csv/RDBMS table/IOT) and what would be the target/sink then it would be easier to provide answer

Related

Customizing nodes of an Azure Synapse Workspace Spark Cluster

When creating a Spark cluster within an Azure Synapse workspace, is there a means to install arbitrary files and directories onto it's cluster nodes and/or onto the node's underlying distributed filesystem?
By arbitrary files and directories, I literally mean arbitrary files and directories; not just extra Python libraries like demonstrated here.
Databricks smartly provided a means to do this on it's cluster nodes (described in this document). Now I'm trying to see if there's a means to do the same on an Azure Synapse Workspace Spark Cluster.
Thank you.
Unfortunately, Azure Synapse Analytics don't support arbitrary binary installs or writing to Spark local storage.
I would suggest you to provide feedback on the same:
https://feedback.azure.com/forums/307516-azure-synapse-analytics
All of the feedback you share in these forums will be monitored and reviewed by the Microsoft engineering teams responsible for building Azure.

Local instance of Databricks for development

I am currently working on a small team that is developing a Databricks based solution. For now we are small enough to work off of cloud instances of Databricks. As the group grows this will not really be practical.
Is there a "local" install of Databricks that can be installed for development purposes (it doesn't need to be a scalable version but does need to be essentially fully featured)? In other words, is there a way each developer can create their own development instance of Databricks on their local machine?
Is there another way to provide a dedicated Databricks environment for each developer?
Databricks, as a cloud-deployed platform, leverages many cloud technologies in its deployment. For example, Auto Loader incrementally ingests new data files as they arrive in AWS using EventBridge, SNS and S3, while Azure uses EventHubs, Notification Hubs and ADLS technologies. They aim to create a seamless look and feel across AWS, Azure and GCP but can do this only in the cloud.
For local deployment, you may be able to use Apache Spark and MlFlow and create a similar experience, but the notebook experience isn't open source. The workflow of Databricks is proprietary, though Databricks has open-sourced many of its technologies, like Delta Lake. The local Spark, MlFlow, may suffice for some and then use the cloud sparingly, but the seamless workflow offered by Databricks is challenging to replicate outside of the leading cloud vendors.

Batch processing with spark and azure

I am working for an energy provider company. Currently, we are generating 1 GB data in form of flat files per day. We have decided to use azure data lake store to store our data, in which we want to do batch processing on a daily basis. My question is that what is the best way to transfer the flat files into azure data lake store? and after the data is pushed into azure I am wondering whether it is good idea to process the data with HDInsight spark? like Dataframe API or SparkSQL and finally visualize it with azure?
For a daily load from a local file system I would recommend using Azure Data Factory Version 2. You have to install Integration Runtimes on Premise (more than one for High Avalibility). You have to consider several security topics (local firewalls, network connectivity etc.) A detailed documentation can be found here. There are also some good Tutorials available. With Azure Data Factory you can trigger your upload to Azure with a Get-Metadata-Activity and use e. g. an Azure Databricks Notebook Activity for further Spark processing.

Apache Beam using Spark Runner Deployment on Pivotal Cloud Foundry

I have a requirement to deploy a Apache Beam application using Spark Runtime engine. My Question is if I can deploy a Spark Application on Pivotal Cloud Foundry environment. Could you please provide examples if available.
Thanks
Yes, Cloud Foundry can run Apache Spark applications. CF now has the ability to mount persistent volumes, manage container networking for the Spark cluster itself, and provide isolation segments for different kind of compute nodes (e.g. to identify a sub-cluster with high performance networking that might be more suited to Spark apps vs. general purpose apps).
You still need a backing store outside of CF for the data you want to feed into Spark or output from Spark. This could be HDFS, Cassandra, JDBC/SQL, NFS, HTTP / S3 , etc.
Cloud Foundry is stateless, but it’s quite capable of running workloads like Spring Cloud Data Flow today, which integrates with Apache Spark quite nicely, along with Hbase, Hadoop, regular RDBMSs, Kafka/Redis/RabbitMQ, FTP servers, cloud services.. whatever you need really.
Here are the links, you can refer them.
How to leverage Pivotal Cloud Foundry, Pivotal HD, Apache Spark and EMC ECS to analyze Twitter data
Spark on Cloud Foundry

Accessing Raw Data for Hadoop

I am looking at the data.seattle.gov data sets and I'm wondering in general how all of this large raw data can get sent to hadoop clusters. I am using hadoop on azure.
It looks like data.seattle.gov is a self contained data service, not built on top of the public cloud.
They have own Restful API for the data access.
Thereof I think the simplest way is to download interested Data to your hadoop cluster, or
to S3 and then use EMR or own clusters on Amazon EC2.
If they (data.seattle.gov ) has relevant queries capabilities you can query the data on demand from Your hadoop cluster passing data references as input. It might work only if you doing very serious data reduction in these queries - otherwise network bandwidth will limit the performance.
In Windows Azure you can place your data sets (unstructured data etc..) in Windows Azure Storage and then access it from the Hadoop Cluster
Check out the blog post: Apache Hadoop on Windows Azure: Connecting to Windows Azure Storage from Hadoop Cluster:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/05/apache-hadoop-on-windows-azure-connecting-to-windows-azure-storage-your-hadoop-cluster.aspx
You can also get your data from the Azure Marketplace e.g. Gov Data sets etc..
http://social.technet.microsoft.com/wiki/contents/articles/6857.how-to-import-data-to-hadoop-on-windows-azure-from-windows-azure-marketplace.aspx

Resources