Does presto support adding data sources dynamically? - presto

Does presto support adding data sources dynamically?
If don't, how can I achieve the purpose of adding new catalog by watching .properties files without restarting cluster?

Currently Presto does not support addition or removal of catalog without server restart. There is a long running open issue about it which discusses the challenges related to implementing it https://github.com/prestodb/presto/issues/2445. I think the best you can do currently is to push the .properties changes to all nodes in the cluster and restart Presto daemons. You could invoke graceful shutdown on the worker nodes to minimize query failures and have something like monit automatically bring up Presto server if it is shutdown.
curl -v -XPUT --data '"SHUTTING_DOWN"' -H "Content-type: application/json" http://node-id:8081/v1/info/state
Restart of Presto daemon on the coordinator would still cause brief outage unless you have coordinator HA setup.

Related

How do you kill a Spark job from the CLI?

Killing Spark job using command Prompt
This is the thread that I hoped would answer my question. But all four answers explain how to kill the entire application.
How can I stop a job? Like a count for example?
I can do it from the Spark Web UI by clicking "kill" on the respective job. I suppose it must be possible to list running jobs and interact with them also directly via CLI.
Practically speaking I am working in a Notebook with PySpark on a Glue endpoint. If I kill the application the entire endpoint dies and I have to spin up a new cluster. I just want to stop a job. Cancelling it within the Notebook will just detach synchronization and the job keeps running, blocking any further commands from being executed.
Spark History Server provides REST API interface. Unfortunately, it only exposes monitoring capabilities for applications, jobs, stages, etc.
There is also a REST Submission interface that provides capabilities to submit, kill and check up on status of the applications. It is undocumented AFAIK, and is only supported on Spark standalone and Mesos clusters, no YARN. (Thats why there is no "kill" link in Jobs UI screen for Spark on YARN, I guess.)
So you can try using that "hidden" API, but if you know your application's Spark UI URL and job id of a job you want to kill, the easier way is something like:
$ curl -G http://<Spark-Application-UI-host:port>/jobs/job/kill/?id=<job_id>
Since I don't work with Glue I'd be interested to find out myself how its going to react, because the kill normally results in org.apache.spark.SparkException: Job <job_id> cancelled.
building on the answer by mazaneicha it appears that, for Spark 2.4.6 in standalone mode for jobs submitted in client mode, the curl request to kill an app with a known applicationID is
curl -d "id=<your_appID>&terminate=true" -X POST <your_spark_master_url>/app/kill/
We had a similar problem with people not disconnecting their notebooks from the cluster and hence hogging resources.
We get the list of running applications by parsing the webUI. I'm pretty sure there's less painful ways to manage a Spark cluster..
list the job in linux and kill it.
I would do
ps -ef |grep spark-submit
if it was started using spark-submit. Get the PID from the output and then
kill -9
Kill running job by:
open Spark application UI.
Go to jobs tab.
Find job among running jobs.
Click kill link and confirm.

Do we need to specify catalogs.properties file same in both presto coordinator and worker?

Well, I have two docker containers one of Presto coordinator and one for Presto Worker. It works fine but I need to specify the catalogs.properties files same in both coordinator and worker
Although my presto worker doesn't need to know about my catalogs.properties files they can fetch the details from master itself
But if I didn't specify then my presto launcher gets failed
Is there any way to stop the duplicity of catalogs.properties file in both master and slave
No, currently you need to configure your catalogs on each machine.
Note: in a typical production setup there will be some automation doing this for you, so there manual work is not multiplied then.
you may see this , it has configuration of presto on how to automatically make sure all workers have same catalog file. Note , this is just one of the features
https://github.com/niths4u/prestodb-cluster/

How to update configuration of a Cassandra cluster

I have a 3 node Cassandra cluster and I want to make some adjustments to the cassandra.yaml
My question is, how should I perform this? One node at a time or is there a way to make it happen without shutting down nodes?
Btw, I am using Cassandra 2.2 and this is a production cluster.
There are multiple approaches here:
If you edit the cassandra.yaml file, you need to restart cassandra to re-read the contents of that file. If you restart all nodes at once, your cluster will be unavailable. Restarting one node at a time is almost always safe (provided you have sane replication-factors and consistency-levels). If your cluster is configured to survive a rack or datacenter outage, then you can safely restart more nodes concurrently.
Many settings can be changed without a restart via JMX, though I don't have a documentation link handy. Changing via JMX WON'T change cassandra.yml though, so you'll need to update that also or your config will revert back to what's in the file when the node restarts.
If you're using DSE, OpsCenter's Lifecycle Manager feature makes updating configs a simple point-and-click affair (disclaimer, I'm biased as I'm an LCM dev).

Cassandra JMX need to connect all the nodes

I am trying to get to know Cassandra cfstats information from all the machines using JMX. This can be done using OpsCenter, but I do not want to use it. I started building my own utility. For now, my java program connects to JMX and fetching cfstats information such as estimateKeys, No of SSTables ..etc.
My requirement is, This is a java jar file, will run from one Cassandra node and should be able to connect to all the machines and fetch cfstats using their respective JMX per node.
I am planning to use java driver for this, as java driver will be able to connects all the machines in the cluster using system.peers coumnFamily. Once java driver connect to the machines, I will form the service:jmx:rmi using respective hostname and JMX port(7199). Then I will be able to connect to NodeProbe using this information.
My question is, after connecting to the another node using java driver, will I be able to retain state there and after forming service:jmx:rmi url, will this url really connects to the current node JMX and pull cfstats information from the current node. Because JMX host name it will take from the Cassandra-env.sh file. Can some one please help me in this.
Does this idea works or is there another best way to achieve this?
It's possible to use JMX remotely, but that's not the easiest thing to do.
But if you are writing your tool - maybe it's worth to check out a different connection. E.g. you can easily convert JMX calls to HTTP using Jolokia

How Can I run more than one cassandra server in single machine and form one cluster ring?

I would like know is there any way to run multiple Cassandra servers on a single machine, so tall the servers on that machine form one ring (cluster).
I would like know is there any way to run the cassandra servers in a single machine ?
There's always a way!
There is an excellent tool available that allows you to configure a multi-node cluster locally, but it's currently not supported under windows. When you build a cluster and start it, it will configure the ring for you. You can check out the ring using ./nodetool -h 127.0.0.1 -p 7100 ring after it has started.
*Just a side-note, the ccm tool starts the cluster as a background process.

Resources