I am currently are trying to retrieve information from the EPA into our web app, which needs to utilize ibm bluemix and apache spark. The information that we are gathering from the EPA is this:
https://aqs.epa.gov/api and ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/
But not only are we gathering historical data, we also want to update the data by inserting new data every hour into the web app. Hence concerning this I have a few questions:
1) Do we need to open a hdfs to store all the data? Or could we just retrieve the data by its URL and store it in a dataframe? IBM bluemix said it would provide 5 GB of storage, so how would one utilize that to store the historical data and store updated data per hour?
2) If we are going to update the data per hour by inserting new data into the data storage / data frame, should we still use spark streaming? If yes, how would we use spark streaming for URL data? A lot of resources I see online is only useful if one has an hdfs / formal database.
What we are doing currently is that we import the URLs through pandas:
url = "https://aqs.epa.gov/api/rawData?user=sogun3#gmail.com&pw=baycrane57&format=JSON¶m=44201&bdate=20110501&edate=20110501&state=37&county=063"
import urllib2
content = urllib2.urlopen(url).read()
print content
However, if we use this method, it means that spark needs to be running 24-7 to ensure that the most updated data is utilized. How does one configure spark to run 24-7? Or is there a better method to process all the data and put them nicely in a dataframe so that the data could be accessed easily later?
Also, in a web app, can one still use iPython for data processing? Or is iPython just for interacting with the data and understanding the data experimentally?
Thanks a lot!
You have options ;-) If you need to read the source EPA data and then process it before you use it in your web app, then you can use the spark service to ETL (Extract Transform Load) the source data from EPA web site, manipulate or wrangle the data into the shape and size you want, and then save it into a storage service like Bluemix Object Storage. You web app would then read the data in the format you want directly from object storage. However, if the source EPA data is largely in a format you want to use in the web app, then you most certainly can create RDDs directly from web site and pull in the data as and when you need it. These datasets look small from my quick peek, so I don't think you need to worry about spark pulling it directly into memory for you to work on it; i.e. no need to try to store it locally with spark in the bluemix service cluster. Besides, there is no HDFS currently provided by the spark service; so as mentioned earlier, you would use an external storage service. re: "IBM bluemix said it would provide 5 GB of storage", that is intended for storing your personal and 3rd-party spark libraries and such.
re: "spark needs to be running 24-7". The spark service runs 24x7. Your spark code running on the service will run for as long as you program it to run ;-)
IPython (or Jupyter notebooks) is intended as a REPL for the web. So, yes, interactive. In your case, you can certainly write your spark code in an IPython notebook and have that run for as long as necessary, pulling and processing the EPA data for the web app, storing it in say object storage. The web app can then pull the data it needs from object storage. It is said that in the future APIs we will provided for the spark service, at which point your web app could talk directly to the spark service; in the meantime, you can certainly make something work with notebooks.
Related
A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.
I have a question regarding a specific spark application usage.
So I want our Spark application to run as a REST API Server, like Spring Boot Applications, therefore it will not be a batch process, instead we will load the application and then we want to keep the application live (no call to spark.close()) and to use the application as Realtime query engine via some API which we will define. I am targeting to deploy it to Databricks. Any suggestions will be good.
I have checked Apache Livy, but not sure whether it will be good option or not.
Any suggestions will be helpful.
Spark isn't designed to run like this; it has no REST API server frameworks other than the HistoryServer and Worker UI built-in
If you wanted a long-running Spark action, then you could use Spark Streaming and issue actions to it via raw sockets, Kafka, etc. rather than HTTP methods
Good question let's discuss step by step
You can create it and it's working fine , following is example :
https://github.com/vaquarkhan/springboot-microservice-apache-spark
I am sure you must be thinking to create Dataset or Data frame and keep into memory and use as Cache (Redis,Gemfire etc ) but here is catch
i) If you have data in few 100k then you really not needed Apache Spark power Java app is good to return response really fast.
ii) If you have data in petabyte then loading into memory as dataset or data frame will not help as Apache Spark doesn’t support indexing since Spark is not a data management system but a fast batch data processing engine, and Gemfire you have flexibility to add index to fast retrieval of data.
Work Around :
Using Apache Ignite’s(https://ignite.apache.org/) In-memory indexes (refer Fast
Apache Spark SQL Queries)
Using data formats that supports indexing like ORC, Parquet etc.
So Why not use Sparing application with Apache Spark without using spark.close().
Spring application as micro service you need other services either on container or PCF/Bluemix/AWS /Azure/GCP etc and Apache Spark has own world and need compute power which is not available on PCF.
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
Once Spark job submit you have to wait results in between you cannot fetch data.
How to use Spark with Spring application as Rest API call :
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.
https://livy.apache.org/
I have written multiple spark driver programs that load some data from HDFS to data frames and accomplish spark sql queries on it and persist the results in HDFS again. Now I need to provide a long running java program in order to receive requests and their some parameters(such as the number of top rows should be returned) from a web application (e.g. a dashboard) via post and get and send back the results to web application. My web application is somewhere out of the Spark cluster. Briefly my goal is to send requests and their accompanying data from web application via something such as POST to long running java program. then it receives the request and runs the corresponding spark driver (spark app) and returns the results for example in JSON format.
Is there any solution to develop this use case?
Is Livy a good choise? If your answer is positive what should I do?
Take a look at The Spark JobServer. I think it has the ability to shard RDD's between jobs which can be a huge performance boost.
https://github.com/spark-jobserver/spark-jobserver
I have been looking for a way to monitor performance in Spark on Bluemix. I know in the Apache Spark project, they provide a metrics service based on the Coda Hale Metrics Library. This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV files. Details here: http://spark.apache.org/docs/latest/monitoring.html
Does anyone know of any way to do this in the Bluemix Spark service? Ideally, I would like to save the metrics to a csv file in Object Storage.
Appreciate the help.
Thanks
Saul
Currently, I do not see an option for usage of "Coda Hale Metrics Library" and reporting the job history or accessing the information via REST API.
However, on the main page of the Spark history server, you can see the Event log directory. It refers to your following user directory: file:/gpfs/fs01/user/USER_ID/events/
There I saw JSON (like) formatted files.
I am working on a Hadoop project and generating lots of data in my local cluster. Sooner later I will be using cloud based Hadoop solution because my Hadoop cluster is very small comparative to real work load, however I dont have a choice as of now which one I will be using i.e. Windows Azure based, EMR or something else. I am generating lots of data locally and want to store this data to some cloud based storage based on the fact that I will use this data with Hadoop later but very soon.
I am looking for suggestion to decided which cloud store to choose based in someone experience. Thanks in advance.
First of all it is a great question. Let's try to understand "How data is processed in Hadoop":
In Hadoop all the data is processed on Hadoop cluster means when you process any data, that data is copied from its sources to HDFS, which is an essential component of Hadoop.
When data is copied to HDFS only after your run Map/Reduce jobs in it to get your results.
That means it does not matter what and where your data sources is(Amazon S3, Azure Blob, SQL Azure, SQL Server, on premise source etc), you will have to move/transfer/copy your data from source to HDFS, within the limits of Hadoop.
Once data is processed in Hadoop cluster, the result will be stored the location you would have configured in your job. The output data source can be HDFS or an outside location accessible from Hadoop Cluster
Once you have data copied to HDFS you can keep it one HDFS as long as you want but you will have to pay the price to use the Hadoop cluster.
In some cases when you are running Hadoop Job between some interval and data move/copy can be done faster, it is good to have a strategy to 1) acquire Hadoop cluster 2) copy data 3) run job 4) release cluster.
So based on above details, when you choose a data source in Cloud for your Hadoop Cluster you would have to consider the following:
If you have large data (which is normal with Hadoop clusters) to process, consider different data sources and the time it will take to copy/move data from those data source to HDFS because this will be your first step.
You would need to choose a data source which must have the lowest network latency so you can get data in and out, as fast as possible.
You also need to consider how you will move large amount of data from your current location to any cloud store. The best option would be to have a storage where you can send your data disk (HDD/Tape etc) because uploading multiple TB data will take great amount of time.
Amazon EMR (already available), Windows Azure (HadoopOnAzure in CTP) and Google (BigQuery in Preview, based on Google Dremel) provides pre-configured Hadoop clusters in cloud so you can choose where you would want to run your Hadoop job then you can consider the cloud storage.
Even if you choose one cloud data storage and decide to move to other because you want to use other Hadoop cluster in cloud, you sure can transfer the data however consider the time and data transfer support available to you.
For example, with HadooponAzure you can connect various data sources i.e. Amazon S3, Azure Blob Storage, SQL Server and SQL Azure etc so a variety of data sources are the best with any cloud Hadoop cluster.