I am using presto as querying layer over cassandra for various aggregations but facing an issue where if the node goes down or timeout occurs for some reason, the running query fails. I need to have some kind of fallback mechanism.
Is there any alternative to presto with which i can implement fallback mechanism or if there's any way to implement it in presto itself.
Some vendors like Qubole have implemented a retry mechanism for specifically these kind of issues (which are more visible in the cloud, especially if you use spot nodes on AWS or pre-emptible VMs on GCP.
Note : This works only with Qubole's Managed Presto Service.
Disclaimer: I work for Qubole
Related
A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.
I have an application using Hazelcast in the embedded mode. I use MapLoaders and indexes. I want to change my application so I can use an existing external Hazelcast cluster to which I will connect using the client api and here lies my problem. There is no way of defining those map loaders in the client app, so I need to use the server mode. Right now I'm using the JoinConfig so I can join the cluster and define my map loaders, but if I understand correctly, by joining the cluster my app will become a part of the cluster itself (and host some data partitions) and this is something I would like to avoid. So is there another way of connecting to this external cluster so my app doesn't start hosting the cache data, or is my approach correct and I shouldn't mind being a part of the cluster?
If you use Hazelcast Client, it just connects to the cluster, but it does not become part of the cluster. It seems like what you want.
Concerning Map Loaders, please check the related code sample.
We are currently running a hazelcast cluster using it to communicate information on a queue to be picked up by a single node in the cluster. We are vulnerable however to a "rogue" node that joins the cluster but without the right version of software to handle the request in a way that's proper.
Is there a way proactively remove rogue nodes of this nature in a way that prevents them from actively re-joining the cluster? I haven't been able to see a way from the documentation.
It looks like you are using default hazelcast xml. You better need to have a custom hazelcast xml with updated Group credentials.
I have a use case where in I launch local spark (embedded) inside an application server rather than going for spark rest job server or kernel. Because former(embedded spark) has very low latency compared to other. I am interested in
Drawbacks of this approach if there are any.
Can same be used in production?
P.S. Low latency is priority here.
EDIT: Size of the data being processed for most of the cases will be less than 100mb.
I don't think it is a drawback at all. If you have a look at the implementation of the Hive Thriftserver within the Spark project itself, they also manage SQLContext etc, in the Hive Server process. This is especially the case, if the amount of data is small and the driver can handle it easily. So I would also see this as a hint, that this okay for production use.
But I totally agree, the documentation or advice in general how to integrate spark into interactive customer-facing application is lacking behind the information for BigData pipelines.
I have a couple of use cases for Apache Spark applications/scripts, generally of the following form:
General ETL use case -
more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.
Streaming use case -
realtime analysis of the events as they arrive in the system.
For (1), I'll need to kick off the Spark application periodically.
For (2), just kick off the long running Spark Streaming process at boot time and let it go.
(Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos)
I'm trying to figure out the most common / best practice deployment strategies for Spark applications.
So far the options I can see are:
Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy:
how do you start/stop tasks - just using simple bash scripts?
how is scheduling managed? - simply use cron?
any resilience? (e.g. Who schedules the jobs to run if the driver server dies?)
Creating a separate webapp as the driver program.
creates a spark context programmatically to talk to the spark cluster
allowing users to kick off tasks through the http interface
using Quartz (for example) to manage scheduling
could use cluster with zookeeper election for resilience
Spark job server (https://github.com/ooyala/spark-jobserver)
I don't think there's much benefit over (2) for me, as I don't (yet) have many teams and projects talking to Spark, and would still need some app to talk to job server anyway
no scheduling built in as far as I can see
I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet.
Thanks very much!
Even though you are not using Mesos for Spark, you could have a look at
-Chronos offering a distributed and fault tolerant cron
-Marathon a Mesos framework for long running applications
Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit.
I hope I understood your problem correctly and this helps you a bit!