I am setting up Spark on Hadoop Yarn cluster in AWS EC2 machines.
This cluster will be ephemeral (For few hours within a day) and hence i want to forward the container logs generated to s3.
I have seen Amazon EMR supporting this feature by forwarding logs to s3 every 5 minutes
Is there any built in configuration inside hadoop/spark that i can leverage ..?
Any other solution to solve this issue will also be helpfull.
Sounds like you're looking for YARN log aggregation.
Haven't tried changing it myself, but you can configure yarn.nodemanager.remote-app-log-dir to point to S3 filesystem, assuming you've setup your core-site.xml accordingly
yarn.log-aggregation.retain-seconds +
yarn.log-aggregation.retain-check-interval-seconds will determine how often the YARN containers will ship out their logs
The alternate solution would be to build your own AMI that has Fluentd or Filebeat pointing at the local YARN log directories, then setup those log forwarders to write to a remote location. For example, Elasticsearch (or one of the AWS log solutions) would be a better choice than just S3
Related
In our Spark application, we store the local application cache in /mnt/yarn/app-cache/ directory, which is shared between app containers on the same ec2 instance
/mnt/... is chosen because it is a fast NVMe SSD on r5d instances
This approach worked well for several years on EMR 5.x - /mnt/yarn belongs to the yarn user, and apps containers run from yarn, and it can create directories
In EMR 6.x things changed - containers now run from the hadoop user which does not have write access to /mnt/yarn/
hadoop user can create directories in /mnt/, but yarn can not, and I want to keep compatibility - the app should be able to run successfully on both EMR 5.x and 6.x
java.io.tmpdir also doesn't work - it is different for each container
What should be the proper place to store cache on NVMe SSD (/mnt, /mnt1) so it can be accessible by all containers and can be operable on both EMR 5.x and 6.x?
On your EMR cluster, you can add the yarn user to the super user group; by default, this group is called supergroup. You can confirm if this is the right group by checking the dfs.permissions.superusergroup in the hdfs-site.xml file.
You could also try modifying the following HDFS properties (in the file named above): dfs.permissions.enabled or dfs.datanode.data.dir.perm.
I am running Dataproc and submitting Spark Jobs using the default client-mode.
The logs for the jobs are visible in the GCP console and is available in the GCS bucket. However, I would like to see the logs in Stackdriver Logging.
Currently, the only way I found was to use cluster-mode instead.
Is there a way to push logs to Stackdriver when using client-mode?
This is something the Dataproc team is actively working on and should have a solution for you sometime soon. If you want to file a public feature request for tracking this that is an option, but I will try to update this response when this feature is usable by you.
Digging into it a bit, the reason why you can see the logs when using cluster-mode is that we have Fluentd configurations that pick up YARN container logs (userlogs) by default. When running in cluster-mode the driver runs in a YARN container and those logs are picked up by that configuration.
Currently, output produced by the driver is forwarded directly to GCS by the Dataproc agent. In the future there will be an option to have all driver output sent to Stackdriver when starting a cluster.
Update:
This feature is now in Beta and is stable to use. When creating a Cluster, the property "dataproc:dataproc.logging.stackdriver.job.driver.enable" can be used to toggle whether the cluster will send Job driver logs to Stackdriver. Additionally you can use the property "dataproc:dataproc.logging.stackdriver.job.yarn.container.enable" to have the cluster associate YARN container logs with the Job they were created by instead of the Cluster they ran on.
Documentation is available here
I'm running a spark job on EMR, (yarn, cluster-mode, transient - the cluster shuts down after the job is done) with debug mode turned on. all spark logs are uploaded to s3 as expected but I can't upload my own custom logs...
Using log4j, I'm trying to write them to the folowing path acording to the spark doc log4j.appender.algoLog.File=${spark.yarn.app.container.log.dir}/algoLog.log
It seems like the variable is undefined. It tries to write directly to root. /algoLog.log.
If I'm writing it to other arbitrary location. It just doesn't appear on s3.
where should I write my own log files if I want EMR to upload them to s3 after the cluster shut down?
Log4J isn't set up to write to object stores; it's notion of filesystem is different.
you may be able to get YARN to do it with its log collection. See How to keep YARN's log files?
I know this question has been asked before but those answers seem to revolve around Hadoop. For Spark you don't really need all the extra Hadoop cruft. With the spark-ec2 script (available via GitHub for 2.0) your environment is prepared for Spark. Are there any compelling use cases (other than a far superior boto3 sdk interface) for running with EMR over EC2?
This question boils down to the value of managed services, IMHO.
Running Spark as a standalone in local mode only requires you get the latest Spark, untar it, cd to its bin path and then running spark-submit, etc
However, creating a multi-node cluster that runs in cluster mode requires that you actually do real networking, configuring, tuning, etc. This means you've got to deal with IAM roles, Security groups, and there are subnet considerations within your VPC.
When you use EMR, you get a turnkey cluster in which you can 1-click install many popular applications (spark included), and all of the Security Groups are already configured properly for network communication between nodes, you've got logging already setup and pointing at S3, you've got easy SSH instructions, you've got an already-installed apparatus for tunneling and viewing the various UI's, you've got visual usage metrics at the IO level, node level, and job submission level, you also have the ability to create and run Steps -- which are jobs that can be run in the command line of the drive node or as Spark applications that leverage the whole cluster. Then, on top of that, you can export that whole cluster, steps included, and copy paste the CLI script into a recurring job via DataPipeline and literally create an ETL pipeline in 60 seconds flat.
You wouldn't get any of that if you built it yourself in EC2. I know which one I would choose... EMR. But that's just me.
I am trying to set up Hadoop permanently on Amazon EC2. Currently what I am doing is every morning launch EC2 instances and set up Hadoop. Is there any way i can avoid this tedious step? I am looking for an Hadoop image which can be loaded on EC2 and make things easy for me.
I know I can use EMR for hadoop services. But I dont know how to start a EMR (hadoop) cluster without submitting a job flow. I mean I need a hadoop cluster without any jobs running in it.
Ultimately my aim is to run bioinformatics applications like Distmap and Seal. For these applications to run there are many dependencies. So I need a free hadoop cluster to set up the environment and then run these applications.
I hope its clear what I am trying to do.
Thanks.
What you can do is one of the below:
Option 1. Start out with an EBS backed EC2 instance with your favourite Linux distro. Go ahead and install Hadoop software that you need. Create as many EC2 instances as the types of instances you are going to need (master / slaves /etc). You can create then your own AMIs in the AWS Console (right click on the EC2 instance and click "Create AMI"). You can then launch your own instances, as many as you need, based on this AMI. You can also create AMI's from instance-store backed instances, but that will mean dumping everything to S3 and creating an AMI from there. There are a lot of tutorials about this available, please leave a comment if you need directions :)
Option 2. Start out with a Hadoop based AMI, repeat the steps above after doing your own configurations / adding dependencies to them. I went ahead and searched for Hadoop AMI's from the AWS console and there are 48 in EU-West-1 (not sure what region you're working with).
Option 3. Start an EMR Cluster in interactive mode. There is also an option to keep the cluster alive after finishing job flows. If you also set the EC2 keys for the EMR instances, you should be able to SSH into them and have a functional Hadoop cluster (not sure about the dependencies though, you might be better of rolling your own).
I hope I understood correctly what you're trying to achieve and this helps a little bit.
This is more of a configure management and automation problem. Try CMT like chef and puppet to get this done according to your desire.