Spark HDFS size on AWS?

Spark HDFS size on AWS? - apache-spark

I ran the spark-ec2 script with --ebs-vol-size=1000 (and the 1000GB volumes are attached) but when I run hadoop dfsadmin -report shows only:
Configured Capacity: 396251299840 (369.04 GB)
per node. How do I increase the space or tell HDFS to use the full capacity?

Run lsblk and see where the volume is mounted. It is probably vol0. In your hdfs-site.xml , add /vol0 to dfs.data.dir value after comma to the existing default. Copy this to all slaves and restart cluster. You should see full capacity now

Related

Data Locality in Spark on Kubernetes colocated with HDFS pods

Revisiting the data locality for Spark on Kubernetes question: if the Spark pods are colocated on the same nodes as the HDFS data node pods then does data locality work ?
The Q&A session here: https://www.youtube.com/watch?v=5-4X3HylQQo seems to suggest it doesn't.

Locality is an issue Spark on Kubernetes. Basic Data locality does work if the Kubernetes provider provides a network topology plugins that are required to resolve where the data is and where the spark nodes should be run. and you have built kubernetes to include the code here
There is a method to test this data locality. I have copied it here for completeness:
Here's how one can check if data locality in the namenode works.
Launch a HDFS client pod and go inside the pod.
$ kubectl run -i --tty hadoop --image=uhopper/hadoop:2.7.2
--generator="run-pod/v1" --command -- /bin/bash
Inside the pod, create a simple text file on HDFS.
$ hadoop fs
-fs hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local
-cp file:/etc/hosts /hosts
Set the number of replicas for the file to the number of your cluster nodes. This ensures that there will be a copy of the file in the cluster node that your client pod is running on. Wait some time until this happens.
`$ hadoop fs -setrep NUM-REPLICAS /hosts`
Run the following hdfs cat command. From the debug messages, see which datanode is being used. Make sure it is your local datanode. (You can get this from $ kubectl get pods hadoop -o json | grep hostIP. Do this outside the pod)
$ hadoop --loglevel DEBUG fs
-fs hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local
-cat /hosts ... 17/04/24 20:51:28 DEBUG hdfs.DFSClient: Connecting to datanode 10.128.0.4:50010 ...
If no, you should check if your local datanode is even in the list from the debug messsages above. If it is not, then this is because step (3) did not finish yet. Wait more. (You can use a smaller cluster for this test if that is possible)
`17/04/24 20:51:28 DEBUG hdfs.DFSClient: newInfo = LocatedBlocks{ fileLength=199 underConstruction=false blocks=[LocatedBlock{BP-347555225-10.128.0.2-1493066928989:blk_1073741825_1001; getBlockSize()=199; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.128.0.4:50010,DS-d2de9d29-6962-4435-a4b4-aadf4ea67e46,DISK], DatanodeInfoWithStorage[10.128.0.3:50010,DS-0728ffcf-f400-4919-86bf-af0f9af36685,DISK], DatanodeInfoWithStorage[10.128.0.2:50010,DS-3a881114-af08-47de-89cf-37dec051c5c2,DISK]]}] lastLocatedBlock=LocatedBlock{BP-347555225-10.128.0.2-1493066928989:blk_1073741825_1001;`
Repeat the hdfs cat command multiple times. Check if the same datanode is being consistently used.

Reduce AWS EBS volume size

I want to reduce the size of the EBS volume from 250GB to 100GB. I know it can't be done directly from the console. That's why I have tried few links like Decrease the size of EBS volume in your EC2 instance and Amazon EBS volumes: How to Shrink ’em Down to Size which haven't helped me. May be this will work for plain data but in my case I have to do it on /opt which have installations and configuration.
Please let me know if it is possible to do, and how.

mount new volume to /opt2, copy all the files from /opt with rsync or something preserve the links etc. update your /etc/fstab and reboot.
if all good, umount the old volume from the ec2.

Hi Laurel and Jayesh Basically you guys have to follow following instructions:
First, Shut down the instance (MyInstance) to prevent any problem.
Create a new 6GIB EBS volume.
Mount the new volume (myVolume)
Copy data from the old volume to the new volume (myVolume)
Use rysnc to copy from old volume to the new volume (myVolume) sudo rsync -axv / /mnt/myVolume/.
Wait until it’s finished. ✋
Install GRUB
Install grub on myVolume using command
Log out from the instance and shut it down.
Detach old volume and attach the new volume (myVolume) to /dev/xvda
Start instance, you see an instance is now running with 6GIB EBS volume size.
Reference: https://www.svastikkka.com/2021/04/create-custom-ami-with-default-6gib.html

Disk space issue in AWS EMR Cluster

We have provisioned 11 nodes(1 master + 10 cores) EMR cluster in AWS. We have chosen disk space for each node as 100 GB.
When the cluster is provisioned, the EMR automatically allocated only 10GB to root partition(/dev/xvda1). After some days root partition disk space becomes full, due to this we couldn't run any job or install basic softwares like git using yum command.
[hadoop#<<ip address>> ~]$ df -BG
Filesystem 1G-blocks Used Available Use% Mounted on
devtmpfs 79G 1G 79G 1% /dev
tmpfs 79G 0G 79G 0% /dev/shm
/dev/xvda1 10G 10G 0G 100% /
/dev/xvdb1 5G 1G 5G 4% /emr
/dev/xvdb2 95G 12G 84G 12% /mnt
/dev/xvdf 99G 12G 83G 12% /data
Could you please help us, how to resolve this issue?
How to increase root partition(/dev/xvda1) disk space to 30GB?
By default all installation using yum or rpm goes to root partition(/dev/xvda1). How to by-pass softwares installing to root partition(/dev/xvda1)?
Whatever the solution, it should not disturb the existing EMR installation.
Help would be much appreciated.

Recently ran into same issue. Find the corresponding ec2 instance and in description tab find and click on the link root device. It points to a EBS Id, click on it. In the actions click on modify volume. After requesting required total space. you might have to aditionally run commands such as "growpart" to let the os adjust to the new size.

All EMR AMI's come with fixed root volume of 10GB and so will be all ec2 instances of your EMR cluster. All applications that you select on EMR will be installed on this root volume and are expected to take about 90% of this disk. At this moment, neither increasing this volume size nor application installation behavior can be altered. So, you should refrain from using this root volume to install application and rather install your custom apps on bigger volumes like /mnt/. You can also symlink some root directories to bigger volumes and then install your apps.

Seems like /var/aws/emr/packages takes most of the space (30%). Idk if this folder can be rm -rf /var/aws/emr/packages'd or should be symlinked to /mnt, but removing it seems to have worked for me.

EBS root volume size can also be increased while at the time of launching the EMR cluster. Default is 10GB
Once the EMR is up and running, then also we can increase the root volume. Refer to this AWS blog -> https://aws.amazon.com/premiumsupport/knowledge-center/ebs-volume-size-increase/

Changing commitlog directory in cassandra

Currently commitlog directory is pointing to Directory1. I want to change it different directory D2. How should the migration be ?

This is how we did it. We have a load-balanced client that talks to Cassandra 1.1.2, and each client lives on each Cassandra node.
Drain your service.
Wait for the load balancer to remove the node.
Stop your service on the local node to halt direct Client-Cassandra writes: systemctl stop <your service name>
At this point there should be no more writes and greatly reduced disk activity:
iostat 2 - Disk activity should be near zero
nodetool gossipinfo
Disable Cassandra gossip protocol to mark the node dead and halt Cassandra-Cassandra writes: nodetool disablegossip
Flush all contents of the commit log into SSTables: nodetool flush
Drain the node – this command is more important than nodetool flush, (and might include all the behaviour of nodetool flush): nodetool drain
Stop the cassandra process: systemctl stop cassandra
Modify Cassandra config file(s), e.g. vi /etc/cassandra/default.conf/cassandra.yaml
Start Cassandra: systemctl start cassandra
Wait 10-20 minutes. Tail Cassandra logs to follow along, e.g. tail -F /var/log/cassandra/system.log
Confirm ring is healthy before moving on to next node: nodetool ring
Re-start client service: systemctl start <your service here>
Note that there was no need for us to do manual copying of the commitlog files themselves. Flushing and draining took care of that. The files then slowly reappeared in the new commitlog_dir location.

You can change the commit log directory in cassandra.yaml (key: "commitlog_directory") and copy all logs to the new destination (see docs) :
commitlog_directory
commitlog_directory
The directory where the commit log is stored. Default locations:
Package installations: /var/lib/cassandra/commitlog
Tarball installations: install_location/data/commitlog
For optimal write performance, place the commit log be on a separate disk partition, or (ideally) a separate physical device from
the data file directories. Because the commit log is append only, an
HDD is acceptable for this purpose.
If you are using bitnami/cassandra containers, this should be done using this env var (see docs):
CASSANDRA_COMMITLOG_DIR: Directory where the commit logs will be
stored. Default: /bitnami/cassandra/data/commitlog

How to set up Spark with a single-node MemSql cluster?

I have a Single Node MemSql cluster:
RAM: 16GM
Core: 4
Ubuntu 14.04
I have Spark deployed on this Memsql for ETL purpose.
I am unable to configure spark on Memsql.
How do I set rotation policy for Spark Work directory: /var/lib/memsql-ops/data/spark/install/work/
How can I change the path?
How large should spark.executor.memory be set to avoid OutOfMemoryExceptions?
How to set different configuration settings for Spark which has been deployed on Memsql cluster?

Hopefully the following will fix your issue:
See spark.worker.cleanup.enabled and related configuration options: https://spark.apache.org/docs/1.5.1/spark-standalone.html
The config can be changed in /var/lib/memsql-ops/data/spark/install/conf/spark_{master,worker}.conf. once the configuration is changed, you must restart the spark cluster with memsql-ops spark-component-stop --all and then memsql-ops spark-component-start --all

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark HDFS size on AWS? - apache-spark

I ran the spark-ec2 script with --ebs-vol-size=1000 (and the 1000GB volumes are attached) but when I run hadoop dfsadmin -report shows only: Configured Capacity: 396251299840 (369.04 GB) per node. How do I increase the space or tell HDFS to use the full capacity?

Run lsblk and see where the volume is mounted. It is probably vol0. In your hdfs-site.xml , add /vol0 to dfs.data.dir value after comma to the existing default. Copy this to all slaves and restart cluster. You should see full capacity now

Related

Data Locality in Spark on Kubernetes colocated with HDFS pods

Reduce AWS EBS volume size

Disk space issue in AWS EMR Cluster

Changing commitlog directory in cassandra

How to set up Spark with a single-node MemSql cluster?

Categories

Resources