How to setup spark in aws ecs - apache-spark

i have a spark multimaster cluster managed by zookeeper and i want to move this multimaster arch to aws ecs, aws EMR is not an option,
the question is how does ecs load balance will work, taking in to the account that every component(spark-master,spark-worker,zookeeper,livy) is deployed in a different service ecs .. thanks

it can not be done spark components are connected thru a config file in compile time, so the solution is to use task definition loopback interface which provides an static ip number(127.0.0.1) and hostname(localhost), so e.g worker connecting to master will be spark://127.0.0.1:7077

Related

Spark cluster on Kubernetes without spark-submit

I have a spark application and want to deploy this on a Kubernetes cluster.
Following the below documentation I have managed to create an empty Kubernetes cluster, generated docker image using the Dockerfile provided under kubernetes/dockerfiles/spark/Dockerfile and deployed this on the cluster using spark-submit in a Dev environment.
https://spark.apache.org/docs/latest/running-on-kubernetes.html
However, in a 'proper' environment we have a managed Kubernetes cluster (bespoke unlike EKS etc.) and will have to provide pod configuration files to get deployed.
I believe you can supply Pod template file as an argument to the spark-submit command.
https://spark.apache.org/docs/latest/running-on-kubernetes.html#pod-template
How can I do this without spark-submit? And are there any example yaml files?
PS: we have limited access to this cluster, e.g. we can install Helm charts but not operator or controller.
You could try to use k8s Spark CRD https://github.com/GoogleCloudPlatform/spark-on-k8s-operator and provide a pod configuration through it.

Spark history-server stderr and stdout logs location when working on S3

I deployed spark-history server that supposes to serve multi-environments,
all spark clusters will write to one bucket and the history-server will read from that bucket.
I got everything working and set up but when I try to access the stdout/stderr of a certain task it's addressing the private IP of the worker that the task was running on (e.g- http://10.192.21.80:8081/logPage/?appId=app-20220510103043-0001&executorId=1&logType=stderr).
I want to access those logs from the UI, but of course, there is no access to those internal IP's (private subnets and private IP's), Isn't there a way to also upload those stderr/stdout logs to the bucket and then access it from the history server UI?
I couldn't find anything in the documentation
I am assuming that you are referring to running spark jobs AWS EMR here.
If you have enabled logging to a s3 bucket on your cluster [1] all the applications logs are logged in the s3 bucket path specified, while launching the cluster.
You should find the logs in the following path:
s3://<bucketpath>/<clusterid>/containers/<applicationId>/<containerId>/stdout.gz
s3://<bucketpath>/<clusterid>/containers/<applicationId>/<containerId>/stderr.gz
Hope the above information was helpful!
References:
[1] Configure cluster logging and debugging - Enable the debugging tool - https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html#emr-plan-debugging-logs-archive-debug

How do I deregister old IP from NLB and update it with the new one using lambda function?

I have an EMR cluster running on AWS and the target group of my NLB points to the IP of the master node of the EMR cluster, when this cluster terminates and is created again I want the target group to point to the new IP of the cluster using a lambda function.
How should I go about it? I will use AWS eventbridge to trigger lambda function every five minutes, thanks in advance for your help!
Your lambda function has to use deregister_targets and deregister_targets to add and remove targets from your target group.
It also needs permissions to do so in its execution role.

How to pass scripts during AWS ECS clusters provisioning?

I'm learning AWS ECS and able to run the task/containers using the ECS instances. But the problem is there are certain scripts that I was executing to the underlying EC2 instances (provisioned through ECS instances) manually after doing SSH (port 22) to them. Is there any better way which would automate this process of running the following scripts whenever the ECS cluster setup is done or whenever the autoscaling kicks-in:-
sudo sysctl -w vm.max_map_count=262144
In your launch template (LT) or launch configuration (LC) for the autoscaling group of container instances, you can specify user_data. This is place where you can add your bootstrap code for the instances.
You are already probably using the user_data, as your instances need to be provided with ecs cluster name to which they belong. So you can create new LT/LC with the sysctl command to execute when the instances are launched by your autoscaling group.

aws_emr_cluster - is it possible to retrieve the instance identifiers

I am creating an EMR cluster using the aws_emr_cluster resource in terraform.
I need to get access to the instance ID of the underlying EC2 hardware, specifically the MASTER node.
It does not appear in the attributes and neither when I perform an terraform show
The data definitely exists and is available in AWS.
Does anyone know how I can get at this value and how to do it it using terraform?
You won't be able to access the nodes (EC2 Instances) in an EMR Cluster through terraform. It is the same case for AutoScaling Groups too.
If terraform includes EMR or ASG nodes, state file will be changed everytime a change happens in EMR/ASG. So, storing the instance information won't be ideal for terraform.
Instead, you can use AWS SDK/CLI/boto3 to see them.
Thanks.

Resources