Equivalent of "du" command on a Amazon S3 bucket - linux

I'm looking for a solution to recursively get the size of all my folders on a Amazon S3 bucket which has a lot of embedded folders.
The perfect example is the Linux du --si command:
12M ./folder1
50M ./folder2
50M ./folder2/subfolder1
etc...
I'm also open to any graphical tool. Is there any command or AWS API for that?

Use awscli
aws s3 ls s3://bucket --recursive --human-readable --summarize

s3cmd du -H s3://bucket-name
This command tells you the size of the bucket (human readable). If you want to know the sizes of subfolders you can list the folders in the bucket (s3cmd ls s3://bucket-name) and then iterate through them.

Related

How do I run a python script and files located in an aws s3 bucket

I have python script pscript.py which takes input parameters -c input.txt -s 5 -o out.txt. The files are all located in an aws s3 bucket. How do I run it after creating an instance? Do I have to mount the bucket on EC2 instance and execute the code? or use lambda? I am not sure. Reading so many aws documentations kinda confusing.
Command line run is as follows:
python pscript.py -c input.txt -s 5 -o out.txt
You should copy the file from Amazon S3 to the EC2 instance:
aws s3 cp s3://my-bucket/pscript.py
You can then run your above command.
Please note that, to access the object in Amazon S3, you will need to assign an IAM Role to the EC2 instance. The role needs sufficient permission to access the bucket/object.

How to delete file after sync from EC2 to s3

I have a file system where files can be dropped into an EC2 instance and I have a shell script running to sync the newly dropped files to an s3 bucket. I'm looking to delete the files off the E2C instance once they are synced. Specifically the files are dropped into the "yyyyy" folder.
Below is my shell code:
#!/bin/bash
inotifywait -m -r -e create "yyyyy" | while read -r NEWFILE
do
if lsof | grep "$NEWFILE" ; then
echo "$NEWFILE";
else
sleep 15
aws s3 sync yyyyy s3://xxxxxx-xxxxxx/
fi
Instead of using aws s3 sync, you could use aws s3 mv (which is a 'move').
This will copy the file to the destination, then delete the original (effectively 'moving' the file).
Can also be used with --recursive to move a whole folder, or --include and --exclude to specify multiple files.

How to show the disk usage of each subdirectory in Linux?

I have a directory, /var/lib/docker, which contains several subdirectories:
/var/lib/docker$ sudo ls
aufs containers image network plugins swarm tmp trust volumes
I'd like to find out how big each directory is. However, using the du command as follows,
/var/lib/docker$ sudo du -csh .
15G .
15G total
I don't see the 'breakdown' for each directory. In the examples I've seen in http://www.tecmint.com/check-linux-disk-usage-of-files-and-directories/, however, it seems that I should see it. How might I obtain this overview to see which directory is taking up the most space?
Use asterisk to get info for each directory, like this:
sudo du -hs *
It will output something like the below:
0 backup
0 bin
70M boot
0 cfg
8.0K data
0 dev
140K docs
ncdu is also a nice way to analyze disk usage. It allows to quickly navigate through subdirectories and identify largest directories and files.
It should be available in most distributions official repositories.
Try using the max-depth argument. It prints the total disk space usage for a directory (or file, with --all) only if it is Nor fewer levels below the command line argument.
For e.g. the following command will show the disk space usage upto 3 level deep subdirectories
du --max-depth=3 -h
For informations on N-levels, use this du --max-depth=N -h where N is a positive integer.
Let shell expand the directory contents:
du -h *
Call du for each directory:
find . -type d | xargs du -csh
In addition,
du -h <directory> should do it.
In addition you can simply use :
du /directory
This would give of the space used by all of the sub directories inside the parent directory.

docker container size much greater than actual size

I am trying to build an image from debian:latest. After the build, the reported virtual size of the image from docker images command is 1.917 GB. I logged in to check the size (du -sh /)and it's 573 MB. I am pretty sure that this huge size is not possible normally. What is going on here? How to get the correct size of the image? More importantly when I push this repository the size is 1.9 GB and not 573 MB.
Output of du -sh /*
8.9M /bin
4.0K /boot
0 /dev
1.1M /etc
4.0K /home
30M /lib
4.0K /lib64
4.0K /media
4.0K /mnt
4.0K /opt
du: cannot access '/proc/11/task/11/fd/4': No such file or directory
du: cannot access '/proc/11/task/11/fdinfo/4': No such file or directory
du: cannot access '/proc/11/fd/4': No such file or directory
du: cannot access '/proc/11/fdinfo/4': No such file or directory
0 /proc
427M /root
8.0K /run
3.9M /sbin
4.0K /srv
0 /sys
8.0K /tmp
88M /usr
15M /var
Do you build that image via a Dockerfile? When you do that take care about your RUN statements. When you execute multiple RUN statements for each of those a new image layer is created which remains in the images history and counts on the images total size.
So for instance if one RUN statement downloads a huge archive file, a next one unpacks that archive, and a following one cleans up that archive the archive and its extracted files remain in the images history:
RUN curl <options> http://example.com/my/big/archive.tar.gz
RUN tar xvzf <options>
RUN <do whatever you need to do with the unpacked files>
RUN rm archive.tar.gz
There are more efficient ways in terms of image size to combine multiple steps in one RUN statement using the && operator. Like:
RUN curl <options> http://example.com/my/big/archive.tar.gz \
&& tar xvzf <options> \
&& <do whatever you need to do with the unpacked files> \
&& rm archive.tar.gz
In that way you can clean up files and folders that you need for the build process but not in the resulting image and keep them out of the images history as well. That is a quite common pattern to keep image sizes small.
But of course you will not have a fine-grained image history which you could make reuse of, then.
Update:
As well as RUN statements ADD statements also create new image layers. Whatever you add to an image that way it stays in history and counts on the total image size. You cannot temporarily ADD things and then remove them so that they do not count on the total size.
Try to ADD as less as possible to the image. Especially when you work with large files. Are there other ways to temporary get those files within a RUN statement so that you can do a cleanup during the same RUN execution? E.g. RUN git clone <your repo> && <do stuff> && rm -rf <clone dir>?
A good practice would be to only ADD those things that are meant to stay on the image. Temporary things should be added and cleaned up with a single RUN statement instead where possible.
The 1.9GB size is not the image, it's the image and its history. Use docker history textbox to check what takes so much space.
See also Why are Docker container images so large?
To reduce the size, you can change the way you build the image (it will depends on what you do, see answers from the link above), use docker export (see How to flatten a Docker image?) or use other extensions.

How to create instance from already uploaded VMDK image at S3 bucket

I have already uploaded my VMDK file to the S3 bucket using following command:
s3cmd put /root/Desktop/centos-ldaprad.vmdk --multipart-chunk-size-mb=10 s3://xxxxx
Now When I would like to create AWS Instance from the same VMDK available at S3 bucket:
ec2-import-instance centos-ldaprad.vmdk -f VMDK -t t2.micro -a x86_64 -b xxxxx -o <XXXX_ACCESS_KEY_XXXX> -w <XXXX_SECRET_KEY_XXX> -p Linux --dont-verify-format -s 5 --ignore-region-affinity
But It looks on present working directory for the source VMDK file. I will be really greatful if you can guide to how to point source VMDK at bucket instead of local source?
Does this --manifest-url url points to the S3 bucket? But when I have uploaded do not have any idea whether it has created any such file? If it creates where it would be created?
Another thing is using above ec2-import-instance when I am creating it searches for VMDK on present working directory and if found it will start uploading. But is there any provision to make upload in parts and also to resume in case of interruption?
It's not really the answer you were after, but I've attached the script I use to upload VMDKs and convert them to AMI images.
This uses the ec2-resume-import, so you can restart it if a upload partially fails.
http://pastebin.com/bD8c3gQu
It's worth pointing out that when I register the device I specify a block device mapping. This is because my images always include a separate boot partition, and a LVM based root partition.
--root-device-name /dev/sda1 -b /dev/sda=$SNAPSHOT_ID:10:true --region $REGION -a x86_64 --kernel aki-52a34525

Resources