I have a series of numbered files to be processed separately by each server. Each split file made using linux split and then xz compressed to save transfer time.
split_001 split_002 split_003 ... split_030
How can I push these files out to a group of 30 servers with ansible? It does not matter which server gets which file so long as they each have a single unique file.
I had used a bash file but I am looking for a better solution. Hopefully using ansible. Then I plan to run a shell command to run an at command to start the several hours or days of computation.
scp -oStrictHostKeyChecking=no bt_5869_001.xz usr13#<ip>:/data/
scp -oStrictHostKeyChecking=no bt_5869_002.xz usr13#<ip>:/data/
scp -oStrictHostKeyChecking=no bt_5869_003.xz usr13#<ip>:/data/
...
http://docs.ansible.com/ansible/copy_module.html
# copy file but iterate through each of the split files
- copy: src=/mine/split_001.xz dest=/data/split_001.xz
- copy: src=/mine/compute dest=/data/ owner=root mode=0755
- copy: src=/mine/start.sh dest=/data/ owner=root mode=0755
- shell: xz -d *.xz
- shell: at -f /data/start.sh now
For example:
tasks:
- set_fact:
padded_host_index: "{{ '{0:03d}'.format(play_hosts.index(inventory_hostname)) }}"
- copy: src=/mine/split_{{ padded_host_index }}.xz dest=/data/
You can do this with Ansible. However, this seems like the wrong general approach to me.
You have a number of jobs. You need them each to be processed, and you don't care which server processes which job as long as they only process each job once (and ideally do the whole batch as efficiently as possible). This is precisely the situation a distributed queueing system is designed to work in.
You'll have workers running on each server and one master node (which may run on one of the servers) that knows about all of the workers. When you need to add tasks to get done, you queue them up with the master, and the master distributes them out to workers as they become available - so you don't have to worry about having an equal number of servers as jobs.
There are many, many options for this, including beanstalkd, Celery, Gearman, and SQS. You'll have to do the legwork to find out which one works best for your situation. But this is definitely the architecture best suited to your problem.
Related
I have to scale up and down in one go nearly 200 microservices in my environment is there a possible way to automate scale up and scale down through a script.
While using xargs or parallel or something might work. I suggest using command substitution or similar means, as much as possible, to give Kubernetes right away a list of objects.
For example, you can try the below, which does a dry-run but shows you want would be scaled.
kubectl scale --replicas 2 $(kubectl get deploy -o name) --dry-run=client
The list of deployments, could come also from a text file or somewhere else.
There is for sure a way to script that but we are missing a bit of context about the why you want to do that.
Generaly speaking i would prefer go to the approach of using Horizontal Pod AUtoscaler or even things like knative that allow scaling from zero for doing automated scaling.
But if the reason is unusual such as a maintenance window or something eelse you can go use the scripting approach.
Then 2 you have 2 weapons at your disposal using the raw k8s api through a postman script or something similar or draw your best shell scripting techniques. Going further this way i can imagine something like this will work:
#!/bin/bash
set -eu -o pipefail
cat targets.txt | parallel echo "kubectl scale --replicas 0 -n {}"
where targets.txt is formated this way:
each line is one microservice with the following format "namespace obejctType/objectName"
Exemple:
default deploy/nginx
anotherns deploy/api
... more services
The advantage of using parallel over doing a simple bash loop is that this will be done in ... parallel :-)
This will save you a lot of time.
Also you might also want to include a display/check at the end showing the number of replicas but this wasn't specified in your question.
We have around 650 GB of data on google compute engine.
We need to move them to Cloud Storage to a Coldline bucket, so the best options we could find is to copy them with gsutil with parallel mode.
The files are from kilobytes to 10Mb max, and there are few million files.
The command we used is
gsutil -m cp -r userFiles/ gs://removed-websites/
On first run it copied around 200Gb and stopped with error
| [972.2k/972.2k files][207.9 GiB/207.9 GiB] 100% Done 29.4 MiB/s ETA 00:00:00
Operation completed over 972.2k objects/207.9 GiB.
CommandException: 1 file/object could not be transferred.
On second run it finished almost at the same place, and stopped again.
How can we copy these files successfully ?
Also the buckets that have the partial data are not being removed after deleting them. Console just says preparing to delete, and nothing happens, we waited more than 4 hours, any way to remove those buckets ?
Answering your first question, I can propose the several options. All of them based on data split and uploading by small portions of data.
You can try distributed upload from several machines.
https://cloud.google.com/storage/docs/gsutil/commands/cp#copying-tofrom-subdirectories-distributing-transfers-across-machines
In this case you are splitting data by safe chunks, like 50GB, and uploading it from several machines in parallel. But it requires machines, that is not required actually.
You still can try such splited upload on a single machine, but you need then some splitting mechanism, which will not upload all files at once, but by chunks. In this case, if some thing fails, you will need to reload only this chunk. In addition, you will have better accuracy and you'll be able to localize possible fail place if something happens.
Regarding, how you can delete them. Well, same technique as for upload. Divide data on chunks and delete them by chunks. Or, you can try to remove whole project, if it suitable for your situation.
Update 1
So, I checked gsutil interface and it is supports glob syntax. You can match with glob syntax, for example 200 folders, and launch this command 150 time (this will upload 200 x 500 = 30 000 folders).
You can use such approach and combine it with -m option, so this is partially that your script did, but might work faster. This will work for folders names and files as well.
If you provide examples of the folders names and files names it would be easier to propose appropriate glob pattern.
It could be that you are affected by gs-util issue 464. This happens when you are running multiple gs-util instances concurrently wit the -m option. Apparently these instances share a state directory which causes weird behavior.
One of the workarounds is to add parameters: -o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=24.
E.g.:
gsutil -o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=24 -m cp -r gs://my-bucket .
I've just run into the same issue, and turns out that it's caused by the cp command running into an uncopyable file (in my case, a broken symlink) and aborting.
Problem is, if you're running a massively parallel copy with -m, the broken file may not be immediately obvious. To figure out which one it is, try a dry run rsync -n instead:
gsutil -m rsync -n -r userFiles/ gs://removed-websites/
This will clearly flag the broken file and abort, and you can fix or delete it and try again. Alternatively, if you're not interested in symlinks, just use the -e option and they'll be ignored entirely.
I would like to copy multiple files simultaneously to speed up my process I currently used the follow
scp -r root#xxx.xxx.xx.xx:/var/www/example/example.example.com .
but it only copies one file at a time. I have a 100 Mbps fibre so I have the bandwidth available to really copy a lot at the same time, please help.
You can use background task with wait command.
Wait command ensures that all the background tasks are completed before processing next line. i.e echo will be executed after scp for all three nodes are completed.
#!/bin/bash
scp -i anuruddha.pem myfile1.tar centos#192.168.30.79:/tmp &
scp -i anuruddha.pem myfile2.tar centos#192.168.30.80:/tmp &
scp -i anuruddha.pem myfile.tar centos#192.168.30.81:/tmp &
wait
echo "SCP completed"
SSH is able to do so-called "multiplexing" - more connections over one (to one server). It can be one way to afford what you want. Look up keywords like "ControlMaster"
Second way is using more connections, then send every job at background:
for file in file1 file2 file3 ; do
scp $file server:/tmp/ &
done
But, this is answer to your question - "How to copy multiple files simultaneously". For speed up, you can use weaker encryption (rc4 etc) and also don't forget, that the bottleneck can be your hard drive - because SCP don't implicitly limit transfer speed.
Last thing is using rsync - in some cases, it can be lot faster than scp...
I am not sure if this helps you, but I generally archive (compression is not required. just archiving is sufficient) file at the source, download it, extract them. This will speed up the process significantly.
Before archiving it took > 8 hours to download 1GB
After archiving it took < 8 minutes to do the same
You can use parallel-scp (AKA pscp): http://manpages.ubuntu.com/manpages/natty/man1/parallel-scp.1.html
With this tool, you can copy a file (or files) to multiple hosts simultaneously.
Regards,
100mbit Ethernet is pretty slow, actually. You can expect 8 MiB/s in theory. In practice, you usually get between 4-6 MiB/s at best.
That said, you won't see a speed increase if you run multiple sessions in parallel. You can try it yourself, simply run two parallel SCP sessions copying two large files. My guess is that you won't see a noticeable speedup. The reasons for this are:
The slowest component on the network path between the two computers determines the max. speed.
Other people might be accessing example.com at the same time, reducing the bandwidth that it can give you
100mbit Ethernet requires pretty big gaps between two consecutive network packets. GBit Ethernet is much better in this regard.
Solutions:
Compress the data before sending it over the wire
Use a tool like rsync (which uses SSH under the hood) to copy on the files which have changed since the last time you ran the command.
Creating a lot of small files takes a lot of time. Try to create an archive of all the files on the remote side and send that as a single archive.
The last suggestion can be done like this:
ssh root#xxx "cd /var/www/example ; tar cf - example.example.com" > example.com.tar
or with compression:
ssh root#xxx "cd /var/www/example ; tar czf - example.example.com" > example.com.tar.gz
Note: bzip2 compresses better but slower. That's why I use gzip (z) for tasks like this.
If you specify multiple files scp will download them sequentially:
scp -r root#xxx.xxx.xx.xx:/var/www/example/file1 root#xxx.xxx.xx.xx:/var/www/example/file2 .
Alternatively, if you want the files to be downloaded in parallel, then use multiple invocations of scp, putting each in the background.
#! /usr/bin/env bash
scp root#xxx.xxx.xx.xx:/var/www/example/file1 . &
scp root#xxx.xxx.xx.xx:/var/www/example/file2 . &
I would like to put delay before file transfer on Linux or Solaris.
As far as i know scp commands has no delay option.
scp file_name root#dest_ip:/dest_path
So, how can i put delay (it can change between 250ms and 500ms)?
You can run a command at a specific time.
See eg here
at 1200
scp file_name root#dest_ip:/dest_path
However, there may be better approaches - for example, if you are waiting for a file to be ready.
I have two servers and I want to move a backup tar.bz file(50G) from one to other one.
I used AXEL to download file from source server. But now when I want to extract it, it gives me error unexpected EOF. The size of them are same and it seems like there is a problem in content.
I want to know if there is a program/app/script that can compare these two files and correct only damaged parts?! Or do I need to split it by hand and compare each part's hash code?
Problem is here that source server has limited bandwidth and low transfers speed so I cant transfer it again from zero.
You can use a checksum utility, such as md5 or sha, to see if the files are the same on either end. e.g.
$ md5 somefile
MD5 (somefile) = d41d8cd98f00b204e9800998ecf8427e
by running such a command on both ends and comparing the result, you can get some certainty as to if the files are the same.
As for only downloading the erroneous portion of a file, this would require checksums on both sides for "pieces" of the data, such as with the bittorrent protocol.
Ok, I found "rdiff" the best way to solve this problem. Just doing:
On Destination Server:
rdiff signature destFile.tar.bz destFile.sig
Then transferring destFile.sig to source server and execute rdiff there on Source Server again:
rdiff delta destFile.sig srcFile.tar.bz delta.rdiff
Then transferring delta.rdiff to destination server and execute rdiff once again on Destination Server:
rdiff patch destFile.tar.bz delta.rdiff fixedFile.tar.bz
This process really doesn't need a separate program. You can simply do it by using a couple of simple commands. If any of the md5sums don't add up, copy over the mismatched one(s) and concatenate them back together. To make comparing the md5sums easier, just run a diff between the output of the two files (or do an md5sum of the outputs to see if there is a difference at all without having to copy over the output).
split -b 1000000000 -d bigfile bigfile.
for i in bigfile.*
do
md5sum $i
done