Searching SQL data using Spark - search

I am creating the spark job that searchs records (SQL rows) relevant to a keyword using tf-idf model. What I am currently doing for testing is to spark-submit the job to get results. However, ideally I want to make this job as a web service so that external users can search records using REST API. This may generate a number of concurrent requests to run the job for multiple users when they search own keywords through the API.
I wonder if I should support this scenario with spark job server so that users can submit jobs via API, or if you have any suggestion for this particular case based on your past experience. Thanks.

This would be an inappropriate use of Spark. Spark is for analytics jobs. Those take time (maybe less time than old-school MapReduce but time nonetheless), and REST clients demand immediate results.
You are on the right track though. As data come in, you can use, for example, Spark Streaming and MLLib to process records according to your TD-IDF and then store the indexed results in your SQL database. Then your REST clients will simply query your data like with all the conventional web-with-SQL-backend applications our ancestors once built.
I suppose you could also look into giving admins the ability to start analytics jobs via a REST client too.

Related

Spark job as a web service?

A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.

Spark for Constraint Engine

I have just started learning Spark and have setup a small cluster too on top of Yarn and Hdfs. I have submitted small jobs for testing as well. However, I want to know if we can use Spark as a realtime constraint engine. Let me give an example. I want to build a web app where a user provides a JSON object via API. The user also provides some constraints (if some_value < 100 remove the the row etc). Can I perform these operations concurrently using Spark and provide quick response to the client (response to be provided as a response of the API). One important requirement being, I will have to provide response in a very short time, lesser than a second.
I see that Spark is all about submitting jobs and processing them concurrently. Can the above requirements be fulfilled by Spark?

Spark Application as a Rest Service

I have a question regarding a specific spark application usage.
So I want our Spark application to run as a REST API Server, like Spring Boot Applications, therefore it will not be a batch process, instead we will load the application and then we want to keep the application live (no call to spark.close()) and to use the application as Realtime query engine via some API which we will define. I am targeting to deploy it to Databricks. Any suggestions will be good.
I have checked Apache Livy, but not sure whether it will be good option or not.
Any suggestions will be helpful.
Spark isn't designed to run like this; it has no REST API server frameworks other than the HistoryServer and Worker UI built-in
If you wanted a long-running Spark action, then you could use Spark Streaming and issue actions to it via raw sockets, Kafka, etc. rather than HTTP methods
Good question let's discuss step by step
You can create it and it's working fine , following is example :
https://github.com/vaquarkhan/springboot-microservice-apache-spark
I am sure you must be thinking to create Dataset or Data frame and keep into memory and use as Cache (Redis,Gemfire etc ) but here is catch
i) If you have data in few 100k then you really not needed Apache Spark power Java app is good to return response really fast.
ii) If you have data in petabyte then loading into memory as dataset or data frame will not help as Apache Spark doesn’t support indexing since Spark is not a data management system but a fast batch data processing engine, and Gemfire you have flexibility to add index to fast retrieval of data.
Work Around :
Using Apache Ignite’s(https://ignite.apache.org/) In-memory indexes (refer Fast
Apache Spark SQL Queries)
Using data formats that supports indexing like ORC, Parquet etc.
So Why not use Sparing application with Apache Spark without using spark.close().
Spring application as micro service you need other services either on container or PCF/Bluemix/AWS /Azure/GCP etc and Apache Spark has own world and need compute power which is not available on PCF.
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
Once Spark job submit you have to wait results in between you cannot fetch data.
How to use Spark with Spring application as Rest API call :
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.
https://livy.apache.org/

running interactive sql queries over millions of parquet files

I have millions of streaming parquet files being written . I want to support running ad hoc interactive queries for debugging and analytics purpose ( added bonus if i can run streaming queries for some real time monitoring of key metrics as well).
What is a scalable solution for supporting this.
The two ways I have observed is running spark sql interactively over millions of parquet files (not too familiar with spark ecosystem but does this mean running a spark job for every sql user submits or do i need to run some streaming job and submit queries somehow) and second being using a presto sql engine on top of parquet (not exactly sure how presto ingests new incoming parquet files).
Any recommendations or pros and cons of either approach . Any better solutions considering i have > ~10Tb data produced every day .
Let me address your use cases :
Support running ad hoc interactive queries for debugging and analytics purpose
I would recommend building a presto cluster if you care about minimizing the latency of your queries and are willing to invest in many machines with a large amount of memory.
Reason: Presto would run fully in-memory without touching disk (in most cases)
A Spark Cluster can also do the job, however, it won't be as fast as Presto. The advantage of Spark over presto is its fault tolerance capabilities and its ability to fail over to disk in case of out of memory conditions which may be important for you given that you have too much data.
Run streaming queries for some real-time monitoring of key metrics as well
As long as you have basic queries, you can build dashboards on top of Presto which could run these queries every x minutes.
Having a considerable amount of processing may be a good reason to look at Spark streaming if real-time monitoring is important.
If it isn't then you could build an ETL (using Spark) for calculating your metrics, storing the data as a new hive table and then expose for querying via Presto/SparkSQL again.
How presto ingests new incoming parquet files?
I'm now aware of your architecture, but in any case, you need to provide Presto with a Hive connection (Hive Metastore to be precise).
Hive provides Presto with few schemas attached to the directories where you ingest your data. Presto dynamically sees the new data by default. Spark is not different by the way.
Presto has nothing to do with data ingestion. It only starts its job once the data is there.

A Spark long running program as web server

I have written multiple spark driver programs that load some data from HDFS to data frames and accomplish spark sql queries on it and persist the results in HDFS again. Now I need to provide a long running java program in order to receive requests and their some parameters(such as the number of top rows should be returned) from a web application (e.g. a dashboard) via post and get and send back the results to web application. My web application is somewhere out of the Spark cluster. Briefly my goal is to send requests and their accompanying data from web application via something such as POST to long running java program. then it receives the request and runs the corresponding spark driver (spark app) and returns the results for example in JSON format.
Is there any solution to develop this use case?
Is Livy a good choise? If your answer is positive what should I do?
Take a look at The Spark JobServer. I think it has the ability to shard RDD's between jobs which can be a huge performance boost.
https://github.com/spark-jobserver/spark-jobserver

Resources