EMR, Spark: proper place for a local shared cache - apache-spark

In our Spark application, we store the local application cache in /mnt/yarn/app-cache/ directory, which is shared between app containers on the same ec2 instance
/mnt/... is chosen because it is a fast NVMe SSD on r5d instances
This approach worked well for several years on EMR 5.x - /mnt/yarn belongs to the yarn user, and apps containers run from yarn, and it can create directories
In EMR 6.x things changed - containers now run from the hadoop user which does not have write access to /mnt/yarn/
hadoop user can create directories in /mnt/, but yarn can not, and I want to keep compatibility - the app should be able to run successfully on both EMR 5.x and 6.x
java.io.tmpdir also doesn't work - it is different for each container
What should be the proper place to store cache on NVMe SSD (/mnt, /mnt1) so it can be accessible by all containers and can be operable on both EMR 5.x and 6.x?

On your EMR cluster, you can add the yarn user to the super user group; by default, this group is called supergroup. You can confirm if this is the right group by checking the dfs.permissions.superusergroup in the hdfs-site.xml file.
You could also try modifying the following HDFS properties (in the file named above): dfs.permissions.enabled or dfs.datanode.data.dir.perm.

Related

Integrating hadoop yarn with mesos infra

I have created a hdfs cluster . I have to configure yarn so as to allow yarn application master to be able to create containers for job processing on the mesos cluster on demand .
how can i integrate the hdfs cluster with the mesos infra so that it can create containers on mesos ?
i need to figure out a way to run the containers created by the application master on another resources apart from the YARN cluster ( a client node, or edge node, or the resources spun through mesos infra ) . basically , i have to create an on-demand ,compute only cluster which can run the yarn apps once yarn is used up
Mesos was created as a more generic version of YARN, they're not really intended to be used together (YARN apps cannot be deployed to Mesos). Spark apps are about the only process in the whole Hadoop ecosystem that can be deployed (independently) to both.
Worth pointing out that Mesos was moved to Apache Attic (edit: and quickly moved out, it seems, but there's been no releases since then). In other words, it's seen as deprecated. With a bit of configuration, YARN can run plain Docker containers, if that's what you're using Mesos for. Apache Twill was a library for creating distributed applications on top of YARN, but that's also in the Apache Attic (and stayed).
You also don't need special configurations to communicate with HDFS from Mesos applications, only the hadoop-client dependency and a configured core-site.xml and hdfs-site.xml file

When running Spark on Kubernetes, is it possible to run as another user as root?

When I submit a Spark job to Kubernetes, everything in the containers is run as root. Is it possible to run the job as another user?
When I submit the job in Client mode, the driver runs as the user who submitted it but the executors run as root, which might lead to file access problems when accessing files created by the executors.
Unless the full customization of the K8s Pod is supported by Spark on K8s (in particular runAsUser feature) the only ways to control it (as I see for the moment) are:
- Build docker image specifying USER in Dockerfile
- Use some advanced K8s tools/controllers, eg Argo Events
- Customize spark-submit or submit Spark Pods directly as Kubernetes Pods through K8s APIs
Hope to see some improvements coming with Spark v3.0.0 soonish though.

Is it possible to run ANY application or program with HADOOP YARN?

I'm studying distributed computing recently and found out Hadoop Yarn is one of them.
So thought if I just establish Hadoop Yarn cluster, then every application will run distributed.
But now someone told me that HADOOP Yarn cannot do anything by itself and need other things like mapreduce, spark, and hbase.
If this is correct, then is that mean only limited tasks can be run with Yarn?
Or can I apply Yarn's distributed computing to all applications I want?
Hadoop is the name which refers to the entire system.
HDFS is the actual storage system. Think of it as S3 or a distributed Linux filesystem.
YARN is a framework for scheduling jobs and allocating resources. It handles these things for you, but you don't interact very much with it.
Spark and MapReduce are managed by Yarn. With these two, you can actually write your code/applications and give work to the cluster.
HBase uses the HDFS storage (with is file based) and provides NoSql storage.
Theoretically you can run more than just Spark and MapReduce on Yarn and you can use something else then Yarn (Kubernetes is in works or supported now). You can even write your own processing tool, queue/resource management system, storage... Hadoop has many pieces which you may use or not, depending on your case. But the majority of Hadoop systems use Yarn and Spark.
If you want to deploy Docker containers for example, just a Kubernetes cluster would be a better choice. If you need batch/real time processing with Spark, use Hadoop.
YARN itself is a resource manager. You will need to write code that can be deployed onto those resources, and then that could do anything, given that the nodes running the tasks are themselves capable of running the job. For example, you cannot distribute a Python library without first installing the dependencies for that script. Mesos is a bit more generalized / accessible than YARN, if you want more flexibility for the same affect.
YARN mostly supports running JAR files, shell scripts (at least, from Oozie) or Docker containers can be deployed to it as well (refer Apache docs)
You may also refer to the Apache Slider or Twill projects for more information.

Does EMR still have any advantages over EC2 for Spark?

I know this question has been asked before but those answers seem to revolve around Hadoop. For Spark you don't really need all the extra Hadoop cruft. With the spark-ec2 script (available via GitHub for 2.0) your environment is prepared for Spark. Are there any compelling use cases (other than a far superior boto3 sdk interface) for running with EMR over EC2?
This question boils down to the value of managed services, IMHO.
Running Spark as a standalone in local mode only requires you get the latest Spark, untar it, cd to its bin path and then running spark-submit, etc
However, creating a multi-node cluster that runs in cluster mode requires that you actually do real networking, configuring, tuning, etc. This means you've got to deal with IAM roles, Security groups, and there are subnet considerations within your VPC.
When you use EMR, you get a turnkey cluster in which you can 1-click install many popular applications (spark included), and all of the Security Groups are already configured properly for network communication between nodes, you've got logging already setup and pointing at S3, you've got easy SSH instructions, you've got an already-installed apparatus for tunneling and viewing the various UI's, you've got visual usage metrics at the IO level, node level, and job submission level, you also have the ability to create and run Steps -- which are jobs that can be run in the command line of the drive node or as Spark applications that leverage the whole cluster. Then, on top of that, you can export that whole cluster, steps included, and copy paste the CLI script into a recurring job via DataPipeline and literally create an ETL pipeline in 60 seconds flat.
You wouldn't get any of that if you built it yourself in EC2. I know which one I would choose... EMR. But that's just me.

Cassandra 2+ HPC Deployment

I am trying to deploy Cassandra on a Linux Based HPC cluster and I need some guidelines if possible. Specifically, what is the difference between running Cassandra locally and in cluster.
When managing locally (in which case it runs smoothly) we duplicate the original files for every node inside our Cassandra directory and we apply the appropriate changes for IP address, rcp, JMX etc... however, when managing a network which files do we need to install in each node. The whole package with all the files or just some of the required ones
like, bin/cassandra.in.sh, conf/cassandra.yaml, bin/cassandra.
I am a little bit confused on what to store in each node separately so to start working on the cluster.
You need to install Cassandra on each node (VM), i.e. the whole package and then update config files as neccessary. As described here to configure cluster in a single data center you need:
Install Cassandra on each node
Configure cluster name
Configure seeds
Configure snitch, if needed

Resources