Hi the official Spark documentation state:
While Spark can perform a lot of its computation in memory, it still
uses local disks to store data that doesn’t fit in RAM, as well as to
preserve intermediate output between stages. We recommend having 4-8
disks per node, configured without RAID (just as separate mount
points). In Linux, mount the disks with the noatime option to reduce
unnecessary writes. In Spark, configure the spark.local.dir variable
to be a comma-separated list of the local disks. If you are running
HDFS, it’s fine to use the same disks as HDFS.
I wonder what is the purpose of 4-8 per node
Is it for parallel write ? I am not sure to understand the reason why as it is not explained.
I have no clue for this: "If you are running HDFS, it’s fine to use
the same disks as HDFS".
Any idea what is meant here...
Purpose of usage 4-8 RAID disks to mirror partitions adding redundancy to prevent data lost in case of fault on hardware level. In case of HDFS the redundancy that RAID provides is not needed, since HDFS handles it by replication between nodes.
Reference
Related
While reading the document of AWS EFS, there is a thought coming into my mind: is it a good idea to mount a shared EFS to the EMR/Hadoop VMs during bootstrapping and use it as local disk for Spark jobs ?
It could take advantage of the performance of EFS
Maybe it could reduce the time for data transfer ?
Since all the VMs share a same EFS, is it possible to "tell" my Spark job that: hey, all the data you need to shuffle are already accessible by the target VMs, here is the path... (to you, the experts, to tell me). I think each Spark executor runs in its own "Yarn app space" which is private, maybe the other executors in other VMs can't access it ? If it is possible, it seems that it could save a lot of time for Spark shuffle.
Correct me if I am wrong and I'd like to listen to your opinions.
Thanks.
I am not able to find this configuration anywhere in the official documentation. Say I decide to install spark, or use a spark docker image. I would like to configure where the "spill to disk" happens so that I may mount a volume that can accommodate that. Where does the default location of the spill to disk occur and how is it possible to change it?
Cloud or bare metal worker nodes have spill location per node that is
local file system, not HDFS. This is standardly handled, but not by
you explicitly. A certain amount of the fs is used for spilling,
shuffling and is local fs, the rest for HDFS. You can name a location
or let HDFS handle that for local fs, or the fs can be an NFS, etc.
For Docker, say, you need simulated HDFS or some linux-like fs for Spark intermediate processing. See https://www.kdnuggets.com/2020/07/apache-spark-cluster-docker.html an excellent guide.
For Spark with YARN, use yarn.nodemanager.local-dirs. See https://spark.apache.org/docs/latest/running-on-yarn.html
For Spark Standalone, use SPARK_LOCAL_DIRS."Scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.
The Context
I'm currently running tests with Apache Cassandra on a single node cluster. I've ensured the cluster is up and running using nodetool status, I've done a multitude of reads and writes that suggest as such, and I'm confident my cluster is set up properly. I am now attempting to speed up my throughput by mounting a SSD onto the directory where Cassandra writes its data to.
My Solution
The write location of Cassandra data is generally to /var/lib/cassandra/data, however I've since switched mine using cassandra.yaml to write to another location, where I've mounted my SSD. I've ensured that Cassandra is writing to this location by checking the size of the data directory's contents through watch du -h and other methods. The directory I've mounted the SSD on includes table data, commitlog, hints, a nested data directory, and saved_caches.
The Problem
I've been using YCSB benchmarks (see https://github.com/brianfrankcooper/YCSB) to test the average throughput and ops/sec of Cassandra. I've noticed no difference in the average throughput when mounting HDD vs. SSD on the location where Cassandra writes its data to. I've analyzed disk access through dstat -cd --disk-util --disk-tps and found HDD caps out on CPU usage in multiple instances whereas SSD only spikes to around 80% on several occassions.
The Question
How can I speed up the throughput of Cassandra using a SSD over a HDD? I assume this is the correct place to mount my SSD, but does Cassandra not utilize its extra processing power? Any help would be greatly appreciated!
SSD should always win over the HDD in terms of latency, etc. It's just a law of physics. I think that your test simply didn't provide enough load on the system. Another problem could be that you mount only data to SSD, but not the commit logs - on HDDs they should be always put onto a separate disk to avoid clashes with data load. On SSDs they could be put on the same disk as data - please point all directories to SSD to see a difference.
I recommend to perform a comparison by using following tools:
perfscripts - it uses fio tool to emulate Cassandra-like workloads, and if you run it on the both HDDs & SSDs, then you will see the difference in latency. You may not even execute it - just look historic folder, where there are results for different disk types;
DSBench - it was recently released by DataStax team, who is specializing in benchmarking Cassandra and DSE. There are built-in workloads described in wiki, that you can use for testing. Only make sure that you run the load long enough to see the effect of compaction, etc.
I started learning Apache Cassandra. In the conf/cassandra.yaml I noticed the commitlog setting's comment as following:
commit log. when running on magnetic HDD, this should be a
separate spindle than the data directories.
If not set, the default directory is $CASSANDRA_HOME/data/commitlog.
Does that mean I should store the commitlog in different HDD than the data?
If yes, what's the reason behind this? And what will happen if I don't comply.
Thanks.
That is a recommendation from the days of spinning-disk. Due to its log-based storage engine, Cassandra is very dependent on disk I/O. So it was recommended to have your commit log and data directories on separate disks to avoid a potential bottleneck (latency) due to heavy disk activity.
If you are using solid state drives (SSDs, and with Cassandra you really should be) then you don't need to worry about this.
NOTE: This is also why using a NAS or SAN with Cassandra is considered to be an anti-pattern.
On the Spark's FAQ it specifically says one doesn't have to use HDFS:
Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.
So, what are the advantages/disadvantages of using Apache Spark with HDFS vs. other distributed file systems (such as NFS) if I'm not planning to use Hadoop MapReduce? Will I be missing an important feature if I use NFS instead of HDFS for the nodes storage (for checkpoint, shuffle spill, etc)?
After a few months and some experience with both NFS and HDFS, I can now answer my own question:
NFS allows to view/change files on a remote machines as if they were stored a local machine.
HDFS can also do that, but it is distributed (as opposed to NFS) and also fault-tolerant and scalable.
The advantage of using NFS is the simplicity of setup, so I would probably use it for QA environments or small clusters.
The advantage of HDFS is of course its fault-tolerance but a bigger advantage, IMHO, is the ability to utilize locality when HDFS is co-located with the Spark nodes which provides best performance for checkpoints, shuffle spill, etc.