Spark results accessible through API - apache-spark

We really would want to get an input here about how the results from a Spark Query will be accessible to a web-application. Given Spark is a well used in the industry I would have thought that this part would have lots of answers/tutorials about it, but I didnt find anything.
Here are a few options that come to mind
Spark results are saved in another DB ( perhaps a traditional one) and a request for query returns the new table name for access through a paginated query. That seems doable, although a bit convoluted as we need to handle the completion of the query.
Spark results are pumped into a messaging queue from which a socket server like connection is made.
What confuses me is that other connectors to spark, like those for Tableau, using something like JDBC should have all the data (not the top 500 that we typically can get via Livy or other REST interfaces to Spark). How do those connectors get all the data through a single connection.
Can someone with expertise help in that sense?

The standard way I think would be to use Livy, as you mention. Since it's a REST API you wouldn't expect to get a JSON response containing the full result (could be gigabytes of data, after all).
Rather, you'd use pagination with ?from=500 and issue multiple requests to get the number of rows you need. A web application would only need to display or visualize a small part of the data at a time anyway.
But from what you mentioned in your comment to Raphael Roth, you didn't mean to call this API directly from the web app (with good reason). So you'll have an API layer that is called by the web app and which then invokes Spark. But in this case, you can still use Livy+pagination to achieve what you want, unless you specifically need to have the full result available. If you do need the full results generated on the backend, you could design the Spark queries so they materialize the result (ideally to cloud storage) and then all you need is to have your API layer access the storage where Spark writes the results.

Related

CosmosDB return data from external API on read

I am attempting to write an Azure CosmosDB integration (Core SQL Api) that integrates with an external service to provide some of the query data. As an example, I need a query made on Cosmos DB to convert some of the data returned (e.g. ID's) by the query into real data by calling an external service via a REST API. This should only happen when querying certain columns.
I initially investigated using a JS stored procedure and/or a UDF to make this external call, but the JS environment seems to be extremely limited and doesn't provide any way to make external calls. I then tried using this https://github.com/Oblarg/cosmosdb-storedprocs-ts repository, which uses webpack to bundle all of node.js into the stored procedure, allowing node modules to be used in stored procedures. Whilst this does allow some node modules to be used, whenever I try and use "https", "fetch", or "axios" modules to make an HTTP GET request I get errors (the same code works fine in a normal node environment, but I'm not a JS expert and can't seem to work past these errors). After a day of attempts it seems like the stored procedure approach is not possible.
Is this the case or is there some way of making HTTP GET requests from a JS stored procedure? If not possible with stored procedures, are there any other techniques to achieve the requirement of reading data from a remote API when querying cosmos DB?
Thanks
There is no way to achieve this from CosmosDB directly, for queries you also cannot use the change feed as the document dont change, so really your only option is to use a function or some preprocessor app to handle it, as you say its not ideal but there is no other solution here. If it was an insert or an update then change feed would allow you to do this but for plain queries its not possible.

Bringing incremental data in from REST APIs into SQL azure

My needs are following:
- Need to fetch data from a 3rd party API into SQL azure.
The API's will be queried everyday for incremental data and may require pagination as by default any API response will give only Top N records.
The API also needs an auth token to work, which is the first call before we start downloading data from endpoints.
Due to last two reasons, I've opted for Function App which will be triggered daily rather than data factory which can query web APIs.
Is there a better way to do this?
Also I am thinking of pushing all JSON into Blob store and then parsing data from the JSON into SQL Azure. Any recommendations?
How long does it take to call all of the pages? If it is under ten minutes, then my recommendation would be to build an Azure Function that queries the API and inserts the json data directly into a SQL database.
Azure Function
Azure functions are very cost effective. The first million execution are free. If it takes longer than ten, then have a look at durable functions. For handling pagination, we have plenty of examples. Your exact solution will depend on the API you are calling and the language you are using. Here is an example in C# using HttpClient. Here is one for Python using Requests. For both, the pattern is similar. Get the total number of pages from the API, set a variable to that value, and loop over the pages; Getting and saving your data in each iteration. If the API won't provide the max number of pages, then loop until you get an error. Protip: Make sure specify an upper bound for those loops. Also, if your API is flakey or has intermittent failures, consider using a graceful retry pattern such as exponential backoff.
Azure SQL Json Indexed Calculated Columns
You mentioned storing your data as json files into a storage container. Are you sure you need that? If so, then you could create an external table link between the storage container and the database. That has the advantage of not having the data take up any space in the database. However, if the json will fit in the database, I would highly recommend dropping that json right into the SQL database and leveraging indexed calculated columns to make querying the json extremely quick.
Using this pairing should provide incredible performance per penny value! Let us know what you end up using.
Maybe you can create a time task by SQL server Agent.
SQL server Agent--new job--Steps--new step:
In the Command, put in your Import JSON documents from Azure Blob Storage sql statemanets for example.
Schedules--new schedule:
Set Execution time.
But I think Azure function is better for you to do this.Azure Functions is a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP.
It is more intuitive and efficient.
Hope this helps.
If you could set the default top N values in your api, then you could use web activity in azure data factory to call your rest api to get the response data.Then configure the response data as input of copy activity(#activity('ActivityName').output) and the sql database as output. Please see this thread :Use output from Web Activity call as variable.
The web activity support authentication properties for your access token.
Also I am thinking of pushing all JSON into Blob store and then
parsing data from the JSON into SQL Azure. Any recommendations?
Well,if you could dump the data into blob storage,then azure stream analytics is the perfect choice for you.
You could run the daily job to select or parse the json data with asa sql ,then dump the data into sql database.Please see this official sample.
One thing to consider for scale would be to parallelize both the query and the processing. If there is no ordering requirement, or if processing all records would take longer than the 10 minute function timeout. Or if you want to do some tweaking/transformation of the data in-flight, or if you have different destinations for different types of data. Or if you want to be insulated from a failure - e.g., your function fails halfway through processing and you don't want to re-query the API. Or you get data a different way and want to start processing at a specific step in the process (rather than running from the entry point). All sorts of reasons.
I'll caveat here to say that the best degree of parallelism vs complexity is largely up to your comfort level and requirements. The example below is somewhat of an 'extreme' example of decomposing the process into discrete steps and using a function for each one; in some cases it may not make sense to split specific steps and combine them into a single one. Durable Functions also help make orchestration of this potentially easier.
A timer-driven function that queries the API to understand the depth of pages required, or queues up additional pages to a second function that actually makes the paged API call
That function then queries the API, and writes to a scratch area (like Blob) or drops each row into a queue to be written/processed (e.g., something like a storage queue, since they're cheap and fast, or a Service Bus queue if multiple parties are interested (e.g., pub/sub)
If writing to scratch blob, a blob-triggered function reads the blob and queues up individual writes to a queue (e.g., a storage queue, since a storage queue would be cheap and fast for something like this)
Another queue-triggered function actually handles writing the individual rows to the next system in line, SQL or whatever.
You'll get some parallelization out of that, plus the ability to start from any step in the process, with a correctly-formatted message. If your processors encounter bad data, things like poison queues/dead letter queues would help with exception cases, so instead of your entire process dying, you can manually remediate the bad data.

Querying ArangoDB without leaving page

I'm relatively new to webdevelopment and have been using ArangoDB for most of that limited experience. I have a basic understanding of Node.js and creating express based CRUD apps with ArangoDB as the database.
I'm getting to a point though where I'd like to have the ability to query the database from inside the client. Say I would like to have a datalist-type element where the user types words into a searchbar. I'd like the ability to query the database from there rather than having to query the database for all of its files prior to creating the datalist. I have not found a single mention though of using database queries from the client side. I can't imagine that this is not possible. Surely when I search wikipedia through the search bar and it provides me with options I didn't just receive the entire wikipedia documents list upon loading the page? Please steer me in the right direction, I don't know how to tackle this problem.
Have a look at how to build dynamic forms, this will allow you to perform AJAX style calls from the browser window to a back end REST API service. This will allow your back end web service to gather the data for the response (from ArangoDB if required), and respond with that data, most likely in a JSON format.
Your UI can then take that response and dynamically update components in your DOM so that the user can see the data injected into the page without a page reload action taking place.
https://www.pluralsight.com/search?q=ajax is a great place to start.
Alternatively you can have a look at free content like https://www.youtube.com/watch?v=tNKD0kfel6o

Spark Machine learning design model from web application

I have developed a web application where user can choose machine learning framework/ number of iterations/ some other tuning parameter. How can I invoke Spark job from user interface by passing all the inputs and display response to user. Depending on the framework (dl4j/ spark mllib/ H2o) user can either upload input csv or the data can be read from Cassandra.
How can I call SPARK job from user interface?
How can I display the result back to user?
Please help.
You can take a look at this github repository.
In this what is being done is as soon as a GET request is arrived, it takes out the data from the Cassandra and then Collect the data and throws it back as the response.
So in your case :
What you can do is , as soon as you recieve a POST request , you can get the parameters from the request and perform the operations accordingly using these parameters and the collect the Result on the master and then throw it back to the user as the Response.
P.S: Collecting on Master is a bit tricky and lot of data can cause OOM. What you can do is save the results on hadoop and send back the URL to the Results or something like that.
For more info look into this blog related to this github:
https://blog.knoldus.com/2016/10/12/cassandra-with-spark/

Real-Time Database Messaging

We've got an application in Django running against a PGSQL database. One of the functions we've grown to support is real-time messaging to our UI when data is updated in the backend DB.
So... for example we show the contents of a customer table in our UI, as records are added/removed/updated from the backend customer DB table we echo those updates to our UI in real-time via some redis/socket.io/node.js magic.
Currently we've rolled our own solution for this entire thing using overloaded save() methods on the Django table models. That actually works pretty well for our current functions but as tables continue to grow into GB's of data, it is starting to slow down on some larger tables as our engine digs through the current 'subscribed' UI's and messages out appropriately which updates are needed as which clients.
Curious what other options might exist here. I believe MongoDB and other no-sql type engines support some constructs like this out of the box but I'm not finding an exact hit when Googling for better solutions.
Currently we've rolled our own solution for this entire thing using
overloaded save() methods on the Django table models.
Instead of working on the app level you might want to work on the lower, database level.
Add a PostgreSQL trigger after row insertion, and use pg_notify to notify external apps of the change.
Then in NodeJS:
var PGPubsub = require('pg-pubsub');
var pubsubInstance = new PGPubsub('postgres://username#localhost/tablename');
pubsubInstance.addChannel('channelName', function (channelPayload) {
// Handle the notification and its payload
// If the payload was JSON it has already been parsed for you
});
See that and that.
And you will be able to to the same in Python https://pypi.python.org/pypi/pgpubsub/0.0.2.
Finally, you might want to use data-partitioning in PostgreSQL. Long story short, PostgreSQL has already everything you need :)

Resources