I have developed a web application where user can choose machine learning framework/ number of iterations/ some other tuning parameter. How can I invoke Spark job from user interface by passing all the inputs and display response to user. Depending on the framework (dl4j/ spark mllib/ H2o) user can either upload input csv or the data can be read from Cassandra.
How can I call SPARK job from user interface?
How can I display the result back to user?
Please help.
You can take a look at this github repository.
In this what is being done is as soon as a GET request is arrived, it takes out the data from the Cassandra and then Collect the data and throws it back as the response.
So in your case :
What you can do is , as soon as you recieve a POST request , you can get the parameters from the request and perform the operations accordingly using these parameters and the collect the Result on the master and then throw it back to the user as the Response.
P.S: Collecting on Master is a bit tricky and lot of data can cause OOM. What you can do is save the results on hadoop and send back the URL to the Results or something like that.
For more info look into this blog related to this github:
https://blog.knoldus.com/2016/10/12/cassandra-with-spark/
Related
Let's say, hypothetically, I am working on a website which provides live score updates for sporting fixtures.
A script checks an external API for updates every few seconds. If there is a new update, the information is saved to a database, and then pushed out to the user.
When a new user accesses the website, a script queries the database and populates the page with all the information ingested so far.
I am using socket.io to push live updates. However, when someone is accessing the page for the first time, I have a couple of options:
I could use the existing socket.io infrastructure to populate the page
I could request the information when routing the user, pass it into res.render() as an argument and render the data using, for example, Pug.
In this circumstance, my instinct would be to utilise the existing socket.io infrastructure; purely because it would save me writing additional code. However, I am curious to know whether there are any other reasons for, or against, using either approach. For example, would it be more performant to render the data, initially, using one approach or the other?
I am creating a Nodejs app that consumes APIs of multiple servers in a sequential manner as the next request depends on results from previous requests.
For instance, user registration is done at our platform in PostgreSQL database. User feeds, chats, posts are stored at getStream servers. User roles and permissions are managed through CMS. If in a page we want to display a list of user followers with some buttons as per the user permissions then first I need to find list of my current user's followers from getStream then enrich them with my PostgreSQL DB then fetch their permissions from CMS. Since one request has to wait for another it takes long time to give response.
I need to serve all that data in a certain format. I have used Promise.all() where requests were not depending on each other.
I thought of a way to store pre-processed data that is ready to be served but I am not sure how to do that. What is the best way to solve this problem?
sequential manner as the next request depends on results from previous requests
you could try using async/await so that each request will run in a sequential manner.
I'm building a REST API with Nodejs with MongoDB as database.
This API only have GET routes to retrieve data. This data is aggregated from others sources via cron job.
I'm not sure how to correctly implement this and what are the best practices.
Do i need to create POST/PUT route to put data in database ?
Do i need to just put data directly in the database ?
Edit : (more informations)
Only the cron jobs would use POST route
The cron jobs are getting data from others REST API and some web scraping.
Is it a good idea to have my cron in the same application with the API, or if I have to make another application to manage my cron jobs and populate my database ?
I suggest creating an API which can be called using an accessKey for updating the data because you would not want to write your mongo db username and password in a shell file.
But if the cron job is a k8s cron job or just a file written in some programming language which can access the db in secure way and is hosted in the same network then you can go ahead with doing it from cron job.
If you'd like to control the data flow entirely through API at this point, creating a POST route would be the way to go. I am not sure how are your GET routes secured - if not at all, consider implementing or at least hard-coding some sort of security for routes that modify your data (oAuth2 or similar).
If you can access the database directly and it's desired, you can just directly insert/update the data.
The second option would be probably quicker, but the first one offers more space for expansion in the future and could be more useful overall.
So in the end, both options are valid, it's up to preference and your use case.
I have a very large nodejs controller that runs thru huge tables and checks data against other huge tables. This takes quite a while to run.
Today my page only shows the final result after 20 minutes.
Is there anyway to stream out the result continuously from the controller to the web page? Like a real time scrolling log of whats going on?
(or is Socket I.O. my only option)
You can try with node-cron and set the dynamic query for fetching data with different limits and append data in the front-end side.
I am not sure is it a proper way or not.
We really would want to get an input here about how the results from a Spark Query will be accessible to a web-application. Given Spark is a well used in the industry I would have thought that this part would have lots of answers/tutorials about it, but I didnt find anything.
Here are a few options that come to mind
Spark results are saved in another DB ( perhaps a traditional one) and a request for query returns the new table name for access through a paginated query. That seems doable, although a bit convoluted as we need to handle the completion of the query.
Spark results are pumped into a messaging queue from which a socket server like connection is made.
What confuses me is that other connectors to spark, like those for Tableau, using something like JDBC should have all the data (not the top 500 that we typically can get via Livy or other REST interfaces to Spark). How do those connectors get all the data through a single connection.
Can someone with expertise help in that sense?
The standard way I think would be to use Livy, as you mention. Since it's a REST API you wouldn't expect to get a JSON response containing the full result (could be gigabytes of data, after all).
Rather, you'd use pagination with ?from=500 and issue multiple requests to get the number of rows you need. A web application would only need to display or visualize a small part of the data at a time anyway.
But from what you mentioned in your comment to Raphael Roth, you didn't mean to call this API directly from the web app (with good reason). So you'll have an API layer that is called by the web app and which then invokes Spark. But in this case, you can still use Livy+pagination to achieve what you want, unless you specifically need to have the full result available. If you do need the full results generated on the backend, you could design the Spark queries so they materialize the result (ideally to cloud storage) and then all you need is to have your API layer access the storage where Spark writes the results.