I have a requirement to deploy a Apache Beam application using Spark Runtime engine. My Question is if I can deploy a Spark Application on Pivotal Cloud Foundry environment. Could you please provide examples if available.
Thanks
Yes, Cloud Foundry can run Apache Spark applications. CF now has the ability to mount persistent volumes, manage container networking for the Spark cluster itself, and provide isolation segments for different kind of compute nodes (e.g. to identify a sub-cluster with high performance networking that might be more suited to Spark apps vs. general purpose apps).
You still need a backing store outside of CF for the data you want to feed into Spark or output from Spark. This could be HDFS, Cassandra, JDBC/SQL, NFS, HTTP / S3 , etc.
Cloud Foundry is stateless, but it’s quite capable of running workloads like Spring Cloud Data Flow today, which integrates with Apache Spark quite nicely, along with Hbase, Hadoop, regular RDBMSs, Kafka/Redis/RabbitMQ, FTP servers, cloud services.. whatever you need really.
Here are the links, you can refer them.
How to leverage Pivotal Cloud Foundry, Pivotal HD, Apache Spark and EMC ECS to analyze Twitter data
Spark on Cloud Foundry
Related
How can I process the given data efficiently using Apache Spark on a cloud Infrastructure as a Service (IaaS) platform? I have a dataset of over 60 million data that I need to run the dataset effectively.
There are many options to do the same.
In Azure you can use Synapse/Azure Data Factory.
In GCS,you can use Dataproc cluster with Cloud Composer.It would be great if you can mention the whole scenario what is your exact source(csv/RDBMS table/IOT) and what would be the target/sink then it would be easier to provide answer
I am currently working on a small team that is developing a Databricks based solution. For now we are small enough to work off of cloud instances of Databricks. As the group grows this will not really be practical.
Is there a "local" install of Databricks that can be installed for development purposes (it doesn't need to be a scalable version but does need to be essentially fully featured)? In other words, is there a way each developer can create their own development instance of Databricks on their local machine?
Is there another way to provide a dedicated Databricks environment for each developer?
Databricks, as a cloud-deployed platform, leverages many cloud technologies in its deployment. For example, Auto Loader incrementally ingests new data files as they arrive in AWS using EventBridge, SNS and S3, while Azure uses EventHubs, Notification Hubs and ADLS technologies. They aim to create a seamless look and feel across AWS, Azure and GCP but can do this only in the cloud.
For local deployment, you may be able to use Apache Spark and MlFlow and create a similar experience, but the notebook experience isn't open source. The workflow of Databricks is proprietary, though Databricks has open-sourced many of its technologies, like Delta Lake. The local Spark, MlFlow, may suffice for some and then use the cloud sparingly, but the seamless workflow offered by Databricks is challenging to replicate outside of the leading cloud vendors.
I am using Apache Nutch to crawl sites for one of my project. The data and the content is successfully crawled. For indexing and search queries, I am using Elasticsearch clusters to process data easily. I am really new to Elasticsearch clusters, locally everything is working fine. But now I want to deploy the same local clusters on Azure services so I can communicate with data from another application that I am working on. I have seen some tutorials, but there is no option of deploying your local clusters on Azure.
Please guide me through this.
You can not deploy local cluster on azure, you need to setup another es cluster on azure nodes & point your writing application to that cluster.
Has anyone tried using Azure data bricks as the spark cluster for CDAP job processing. CDAP documentation details how to add it to Azure HDInsight, but just wondering is there a way to configure CDAP to point to data bricks spark cluster, is it even possible? OR this kind of integration needs a specific data bricks client connector jar? If anyone has any insights that would be helpful.
There is no out of box support for Databricks spark on Azure. But, that said you can develop a new Cloud Runtime that is capable of submitting the jobs to Databricks spark cluster. Here is example of how to write a runtime extension for Cloud Dataproc and EMR.
I am looking at the data.seattle.gov data sets and I'm wondering in general how all of this large raw data can get sent to hadoop clusters. I am using hadoop on azure.
It looks like data.seattle.gov is a self contained data service, not built on top of the public cloud.
They have own Restful API for the data access.
Thereof I think the simplest way is to download interested Data to your hadoop cluster, or
to S3 and then use EMR or own clusters on Amazon EC2.
If they (data.seattle.gov ) has relevant queries capabilities you can query the data on demand from Your hadoop cluster passing data references as input. It might work only if you doing very serious data reduction in these queries - otherwise network bandwidth will limit the performance.
In Windows Azure you can place your data sets (unstructured data etc..) in Windows Azure Storage and then access it from the Hadoop Cluster
Check out the blog post: Apache Hadoop on Windows Azure: Connecting to Windows Azure Storage from Hadoop Cluster:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/05/apache-hadoop-on-windows-azure-connecting-to-windows-azure-storage-your-hadoop-cluster.aspx
You can also get your data from the Azure Marketplace e.g. Gov Data sets etc..
http://social.technet.microsoft.com/wiki/contents/articles/6857.how-to-import-data-to-hadoop-on-windows-azure-from-windows-azure-marketplace.aspx