Spark, does size of master node on EMR matter? - apache-spark

When running a Spark ETL job on EMR, does the size of the master node instance matter? Based on my understanding, the master node does not handle processing/computation of data and is responsible for scheduling tasks, communicating with core and task nodes, and other admin tasks.
Does this mean if I have 10 TB of data that I need to transform and then write out, I can use 1 medium instance for master and 10 8xlarge for core nodes?
Based on reading, I see most people suggest master node instance type should be same as core instance type which I currently do and works fine. This would be 1 8xlarge for master and 10 8xlarge for core nodes.
According to AWS docs, we should use m4.large, so I'm confused what's right.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html
The master node does not have large computational requirements. For most clusters of 50 or fewer nodes, consider using an m4.large instance. For clusters of more than 50 nodes, consider using an m4.xlarge.

The way the question is asked is a little vague. Size does matter, i.e. load, etc. So I answer it from a slightly different perspective. That "most people ..." stuff is neither here nor there.
The way the Master was assigned in the past was a weakness of the EMR approach imho when I trialled it some 9 mths ago for a PoC. Allocate big resources for Workers and by default 1 went to the Master which was complete overkill.
So, if you did things standardly, you paid for non-needed larger than req'd resource for the Master Node. There is a way to define a smaller resource for the Master, but I am on hols and cannot find it back again.
However, look at the url here and you see now that during EMR Cluster
Config you can easily define a smaller Master Node or many such Master
Nodes for fail over, things have moved along since I last looked:
https://confusedcoders.com/data-engineering/how-to-create-emr-cluster-with-apache-spark-and-apache-zeppelin.
See also https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-launch.html for multiple such Master Nodes.
In general the Master Node can differ in terms of characteristics from the Workers, usually smaller, but may be not in all cases. That said, the EMR purpose would tend to point to smaller Master Node config.

Related

Can I upgrade a Cassandra cluster swapping in new nodes running the updated version?

I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.
Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.
To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.

The contact between Replication factor and Resource Usage

I am a Cassandra user in china. Recently we want to use Cassandra in our production environment. But I don't know the impact of data replica factor and resource consumption.
My stress test show that 3 replication factor use three times more resources than 1 replication factor. But I'm not sure it's right.
So, I would like to ask if there is a formula for replication factor and resource consumption? Or has anyone ever tested it?
I'm very grateful if anyone can reply me;
First of all, RF=3 means you need at least three servers (obviously). But really, it depends on what you mean by "resources." If that's mainly referring to disk space, then "yes" setting a RF=3 will use 3x the disk space that a single copy (RF=1) would.
So why would you want that? Because supporting data loads in highly-available (HA) scenarios is what Cassandra does really well. This means that Cassandra needs to be able to continue to serve requests if a node should fail. Achieving that means setting RF>1.
As for the remaining resources, if you're referring to network, CPU & RAM as well, then the answer is "it depends." An application can choose to query at different consistency levels, such as ONE, QUORUM, or ALL (and others). For ONE, it does just what it says: an operation (read or write) waits for acknowledgement from a single node.
So if an app is querying at a consistency of ONE, the answer is "no," it won't use three times the resources if RF=3.
Cassandra is distributed database so it stores the data based on partition and hash algorithm. We can configure replica of our data based on requirement and application nature. Default Cassandra cluster with minimum 3 node recommended for production but you should use or configure the replication factor(replica/copy of data) totally on your wish.
If you use 3 node cluster with RF=3 then your data will be distributed on each node (approx 1/3 data on each node). We need to consider the resource here for all 3 nodes like disk, CPU, Memory, I/O etc equally for better performance. However, we can tune multiple things(like consistency, compaction, network, OS) inside the Cassandra to improve the performance and resource effective. 3 copy of data will use more memory and disk as compared to 1 copy of data. But if you consider availability and performance you should use at least 2 copy of data. you can refer below link for more details regarding RF calculation etc:-
https://www.ecyrd.com/cassandracalculator/

Is it better to create many small Spark clusters or a smaller number of very large clusters

I am currently developing an application to wrangle a huge amount of data using Spark. The data is a mixture of Apache (and other) log files as well as csv and json files. The directory structure of my Google bucket will look something like this:
root_dir
web_logs
\input (subdirectory)
\output (subdirectory)
network_logs (same subdirectories as web_logs)
system_logs (same subdirectories as web_logs)
The directory structure under the \input directories is arbitrary. Spark jobs pick up all of their data from the \input directory and place it in the \output directory. There is an arbitrary number of *_logs directories.
My current plan is to split the entire wrangling task into about 2000 jobs and use the cloud dataproc api to spin up a cluster, do the job, and close down. Another option would be to create a smaller number of very large clusters and just send jobs to the larger clusters instead.
The first approach is being considered because each individual job is taking about an hour to complete. Simply waiting for one job to finish before starting the other will take too much time.
My questions are: 1) besides the cluster startup costs, are there any downside to taking the first approach? and 2) is there a better alternative?
Thanks so much in advance!
Besides startup overhead, the main other consideration when using single-use clusters per job is that some jobs might be more prone to "stragglers" where data skew leads to a small number of tasks taking much longer than other tasks, so that the cluster isn't efficiently utilized near the end of the job. In some cases this can be mitigated by explicitly downscaling, combined with the help of graceful decommissioning, but if a job is shaped such that many "map" partitions produce shuffle output across all the nodes but there are "reduce" stragglers, then you can't safely downscale nodes that are still responsible for serving shuffle data.
That said, in many cases, simply tuning the size/number of partitions to occur in several "waves" (i.e. if you have 100 cores working, carving the work into something like 1000 to 10,000 partitions) helps mitigate the straggler problem even in the presence of data skew, and the downside is on par with startup overhead.
Despite the overhead of startup and stragglers, though, usually the pros of using new ephemeral clusters per-job vastly outweigh the cons; maintaining perfect utilization of a large shared cluster isn't easy either, and the benefits of using ephemeral clusters includes vastly improved agility and scalability, letting you optionally adopt new software versions, switch regions, switch machine types, incorporate brand-new hardware features (like GPUs) if they become needed, etc. Here's a blog post by Thumbtack discussing the benefits of such "job-scoped clusters" on Dataproc.
A slightly different architecture if your jobs are very short (i.e. if each one only runs a couple minutes and thus amplify the downside of startup overhead) or the straggler problem is unsolveable, is to use "pools" of clusters. This blog post touches on using "labels" to easily maintain pools of larger clusters where you still teardown/create clusters regularly to ensure agility of version updates, adopting new hardware, etc.
You might want to explore my solution for Autoscaling Google Dataproc Clusters
The source code can be found here

Nifi GetEventHub is multiplying the data by the number of nodes

I have some flows that get the data from an azure eventhub, im using the GetAzureEventhub processor. The data that im getting is being multiplyed by the number of nodes that I have in the cluster, I have 4 nodes. If I indicate to the processor to just run on the primary node, the data is not replicated 4 times.
I found that the eventhub for each consumer group accepts up to 5 readers, I read this in this article, each reader will have its own separate offset and they consume the same data. So in conclussion Im reading the same data 4 times.
I have 2 questions:
How can I coordinate this 4 nodes in order to go throught the same reader?
In case this is not posible, how can indicate nifi to just one of the nodes to read?
Thanks, if you need any clarification, ask for it.
GetAzureEventHub currently does not perform any coordination across nodes so you would have to run it on primary node only to avoid duplication.
The processor would require refactoring to perform coordination across the nodes of the cluster and assign unique partitions to each node, and handle failures (i.e. if a node consuming partition 1 goes down, another node has to take over partition 1).
If the Azure client provided this coordination somehow (similar to the Kafka client) then it would require less work on the NiFi side, but I'm not familiar enough with Azure to know if it provides anything like this.

Is it ok to set all cassandra nodes as seeds?

I'm interested in speeding up the process of bootstrapping a cluster and adding/removing nodes (Granted, in the case of node removal, most time will be spend draining the node). I saw in the source code that nodes that are seeds are not bootstrapped, and hence do not sleep for 30 seconds while waiting for gossip to stabilize. Thus, if all nodes are declared to be seeds, the process of creating a cluster will run 30 seconds faster. My question is is this ok? and what are the downsides of this? Is there a hidden requirement in cassandra that we have at least one non-seed node to perform a bootstrap (as suggested in the answer to the following question)? I know I can shorten RING_DELAY by modifying /etc/cassandra/cassandra-env.sh, but if simply setting all nodes to be seeds would be better or faster in some way, that might be better. (Intuitively, there must be a downside to setting all nodes to be seeds since it appears to strictly improve startup time.)
Good question. Making all nodes seeds is not recommended. You want new nodes and nodes that come up after going down to automatically migrate the right data. Bootstrapping does that. When initializing a fresh cluster without data, turn off bootstrapping. For data consistency, bootstrapping needs to be on at other times. A new start-up option -Dcassandra.auto_bootstrap=false was added to Cassandra 2.1: You start Cassandra with the option to put auto_bootstrap=false into effect temporarily until the node goes down. When the node comes back up the default auto_bootstrap=true is back in effect. Folks are less likely to go on indefinitely without bootstrapping after creating a cluster--no need to go back and forth configuring the yaml on each node.
In multiple data-center clusters, the seed list should include at least one node from each data center. To prevent partitions in gossip communications, use the same list of seed nodes in all nodes in a cluster. This is critical the first time a node starts up.
These recommendations are mentioned on several different pages of 2.1 Cassandra docs: http://www.datastax.com/documentation/cassandra/2.1/cassandra/gettingStartedCassandraIntro.html.

Resources