we want to build presto production cluster on rhel machines
one of the machine is the presto coordinator , and all the others are presto workers
what is the suggestion of minimal presto workers for production env?
some more details about presto:
The Presto coordinator is the server that is responsible for parsing statements, planning queries, and managing Presto worker nodes. It is the “brain” of a Presto installation and is also the node to which a client connects to submit statements for execution. Every Presto installation must have a Presto coordinator alongside one or more Presto workers. For development or testing purposes, a single instance of Presto can be configured to perform both roles.
The coordinator keeps track of the activity on each worker and coordinates the execution of a query. The coordinator creates a logical model of a query involving a series of stages which is then translated into a series of connected tasks running on a cluster of Presto workers.
Coordinators communicate with workers and clients using a REST API.
Worker
A Presto worker is a server in a Presto installation which is responsible for executing tasks and processing data. Worker nodes fetch data from connectors and exchange intermediate data with each other. The coordinator is responsible for fetching results from the workers and returning the final results to the client.
When a Presto worker process starts up, it advertises itself to the discovery server in the coordinator, which makes it available to the Presto coordinator for task execution.
Workers communicate with other workers and Presto coordinators using a REST API.
Minimal number of Presto Workers is 1 independently on your environment type.
You may configure your Presto Coordinator node to run a worker too and get a minimal single-node setup to evaluate the features for example. In accordance with official guide you can do it by specifying the following parameters in config.properties:
coordinator=true
node-scheduler.include-coordinator=true
Minimal reasonable production amount of workers is unlikely possible to determine without additional information like number of users expected, the number and size of datasets, your infrastructure performance, etc...
Related
I couldn't figure out what is the difference between Spark driver and application master. Basically the responsibilities in running an application, who does what?
In client mode, client machine has the driver and app master runs in one of the cluster nodes. In cluster mode, client doesn't have any, driver and app master runs in same node (one of the cluster nodes).
What exactly are the operations that driver do and app master do?
References:
Spark Driver memory and Application Master memory
Spark yarn cluster vs client - how to choose which one to use?
As per the spark documentation
Spark Driver :
The Driver(aka driver program) is responsible for converting a user
application to smaller execution units called tasks and then schedules
them to run with a cluster manager on executors. The driver is also
responsible for executing the Spark application and returning the
status/results to the user.
Spark Driver contains various components – DAGScheduler,
TaskScheduler, BackendScheduler and BlockManager. They are responsible
for the translation of user code into actual Spark jobs executed on
the cluster.
Where in Application Master is
The Application Master is responsible for the execution of a single
application. It asks for containers from the Resource Scheduler
(Resource Manager) and executes specific programs on the obtained containers.
Application Master is just a broker that negotiates resources with the Resource Manager and then after getting some container it make sure to launch tasks(which are picked from scheduler queue) on containers.
In a nutshell Driver program will translate your custom logic into stages, job and task.. and your application master will make sure to get enough resources from RM And also make sure to check the status of your tasks running in a container.
as it is already said in your provided references the only different between client and cluster mode is
In client, mode driver will run on the machine where we have executed/run spark application/job and AM runs in one of the cluster nodes.
(AND)
In cluster mode driver run inside application master, it means the application has much more responsibility.
References :
https://luminousmen.com/post/spark-anatomy-of-spark-application#:~:text=The%20Driver(aka%20driver%20program,status%2Fresults%20to%20the%20user.
https://www.edureka.co/community/1043/difference-between-application-master-application-manager#:~:text=The%20Application%20Master%20is%20responsible,class)%20on%20the%20obtained%20containers.
As Apache Spark is a suggested distributed processing engine for Cassandra, I know that there is a possibility to run Spark executors along with Cassandra nodes.
My question is if the driver and Spark connector are smart enough to understand partitioning and shard allocation so data are processed in a hyper-converged manner.
Simply, does the executors read data stored from partitions that are hosted on nodes where an executor is running so no unnecessary data are transferred across the network as Spark does when it's run over HDFS?
Yes, Spark Cassandra Connector is able to do this. From the source code:
The getPreferredLocations method tells Spark the preferred nodes to fetch a partition from, so that the data for the partition are at the same node the task was sent to. If Cassandra nodes are collocated with Spark nodes, the queries are always sent to the Cassandra process running on the same node as the Spark Executor process, hence data are not transferred between nodes. If a Cassandra node fails or gets overloaded during read, the queries are retried to a different node.
Theoretically yes. Same for HDFS too. Howevet practically I have seen less of it on the cloud where separate nodes are used for spark and Cassandra when their cloud services are used. If you use IAsAS and setup your own Cassandra and Spark then you can achieve it.
I would like to add to Alex's answer:
Yes, Spark Cassandra Connector is able to do this. From the source
code:
The getPreferredLocations method tells Spark the preferred nodes to
fetch a partition from, so that the data for the partition are at the
same node the task was sent to. If Cassandra nodes are collocated with
Spark nodes, the queries are always sent to the Cassandra process
running on the same node as the Spark Executor process, hence data are
not transferred between nodes. If a Cassandra node fails or gets
overloaded during read, the queries are retried to a different node.
That this is a bad behavior.
In Cassandra when you ask to get the data of a particular partition, only one node is accessed. Spark can actually access 3 nodes thanks to the replication. So without shuffeling you have 3 nodes participating in the job.
In Hadoop however, when you ask to get the data of a particular partition, usually all nodes in the cluster are accessed and then Spark uses all nodes in the cluster as executors.
So in case you have a 100 nodes: In Cassandra, Spark will take advantage of 3 nodes. In Hadoop, Spark will take advantage of a 100 nodes.
Cassandra is optimized for real-time operational systems, and therefore not optimized for analytics like data lakes.
I am aware of the basics of YARN framework, however I still feel lack of some understanding, in regards to MapReduce.
With YARN, I have read that MapReduce is just one of the applications which can run on top of YARN; for example, with YARN, on same cluster various different jobs can run, MapReduce Jobs, Spark Jobs etc.
Now, the point is, each type of job has its "own" kind of "Job phases", for example, when we talk about MapReduce, it has various phases like, Mapper, Sorting, Shuffle, Reducer etc.
Specific to this scenario, who "decides", "controls" these phases? Is it MapReduce Framework?
As I understand, YARN is an infrastructure on which different jobs run; so when we submit a MapReduce Job, does it first go to MapReduce framework and then the code is executed by YARN? I have this doubt, because YARN is general purpose execution engine, so it won't be having knowledge of mapper, reducer etc., which is specific to MapReduce (and so different kind of Jobs), so does MapReduce Framework run on top of YARN, with YARN help executing the Jobs, and MapReduce Framework is aware of the phases it has to go through for a particular kind of Job?
Any clarification to understand this would be of great help.
If you take a look at this picture from Hadoop documentation:
You'll see that there's no particular "job orchestration" component, but a resource requesting component, called application master. As you mentioned, YARN does resource management and with regards to application orchestration, it stops at an abstract level.
The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
When applied to Spark, some of the components in that picture would be:
Client: the spark-submit process
App Master: Spark's application master that runs driver and application master (cluster mode) or just application master (client mode)
Container: spark workers
Spark's YARN infrastructure provides the application master (in YARN terms), which knows about Spark's architecture. So when the driver runs, either in cluster mode or in client mode, it still decides on jobs/stages/tasks. This must be application/framework-specific (Spark being the "framework" when it comes to YARN).
From Spark documentation on YARN deployment:
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN
You can extend this abstraction to map-reduce, given your understanding of that framework.
So when we submit a MapReduce job it will first go to the Resource Manager which is the master daemon of YARN. The Resource Manager then selects a Node Manager(which are slave processes of YARN) to start a container on which it will ask the Node Manager to start a very lightweight process known as Application Master. Then the Resource Manager will ask the Application Master to start execution of the job.
The Application Master will first go through the driver part of the job from where it would get to know of the resources that would be used for the job, and accordingly it will request the Resource Manager for those resources. Now the Resource Manager can assign the resources to the Application Master immediately or if the cluster is to occupied then that request would be rescheduled based on various scheduling algorithms.
After getting the resources the Application Master would go to the Name Node to get the metadata of all the blocks that would be required to be processed for this job.
After receiving the Metadata the Application Master would ask the Node Managers of the nodes where the blocks are stored(if those nodes are too busy then a node in the same rack, otherwise any random node depending on rack awareness) and ask the Node Managers to launch containers for processing their respective blocks.
The blocks would get processed independently and in parallel in their respective nodes. After the entire processing is done the result would be stored in HDFS.
I have followed the instructions here to enable the metrics export to Prometheus for spark. In order to enable metrics export not just from the job, but also from master and workers, I have enabled the jmx agent for all of spark driver, master, worker, and executor.
This causes a problem since spark worker and executor are collocated on the same machine and, thus, I need to pass in different jmx ports to them. This is not a problem if I have a 1-1 relationship between spark workers and executors, however, it breaks down in the multiple executors per worker scenario, as there is no way to specify a different port for a specific executor during the spark job submission.
The situation is even worse when the job is submitted in cluster mode, since the driver, worker, and executors are all potentially collocated on the same node.
How have you resolved this problem?
I'm building Apache Spark application that acts on multiple streams.
I did read the Performance Tuning section of the documentation:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
What I didn't get is:
1) Are the streaming receivers located on multiple worker nodes or is the driver machine?
2) What happens if one of the nodes that receives the data fails (power off/restart)
Are the streaming receivers located on multiple worker nodes or is the
driver machine
Receivers are located on worker nodes, which are responsible for the consumption of the source which holds the data.
What happens if one of the nodes that receives the data fails (power
off/restart)
The receiver is located on the worker node. The worker node get's it's tasks from the driver. This driver can either be located on a dedicated master server if you're running in Client Mode, or it can be on one of the workers if you're running in Cluster Mode. In case a node fails which doesn't run the driver, the driver will re-assign the partitions held on the failed node to a different one, which will then be able to re-read the data from the source, and do the additional processing needed to recover from the failure.
This is why a replayable source such as Kafka or AWS Kinesis is needed.