What is the fastest and safest way to move an SVN repository from one host to another? - linux

I have two SVN repositories about 1.5 GB each that I need to move from one CentOS 5.4 Linux machine to another. They are up into the three to four thousand revision range.
I could simply scp -r them over. I did try starting that process and it was clear that it was going to take several hours, maybe all night, so I stopped it to reconsider.
I could use svnadmin dump with or without the --deltas option, then compress and scp the dump file.
Is there some better approach?

Yep rsync.
Specificly:
rsync -hxDPavilyzH source/ user#remote:/target/ --stats

svnsync is designed for this, and should be safe provided nothing else writes to the target repo until the copy is complete.
Hower, rsync should also be safe, and allow interruption (svnsync may as well, not sure).

Related

git forces refresh index after switching between Windows and Linux

I have a disk partition (format: NTFS) shared by Windows and Linux. It contains a git repository (about 6.7 GB).
If I only use Windows or only use Linux to manipulate the git repository everything is okay.
But everytime I switch the system, the git status command will refresh the index, which takes about 1 minute. If I run the git status in the same system again, it only take less than 1 second. Here is the result
# Just after switch from windows
[#5#wangx#manjaro:duishang_design] git status # this command takes more than 60s
Refresh index: 100% (2751/2751), done.
On branch master
nothing to commit, working tree clean
[#10#wangx#manjaro:duishang_design] git status # this time the command takes less than 1s
On branch master
nothing to commit, working tree clean
[#11#wangx#manjaro:duishang_design] git status # this time the command takes less than 1s
On branch master
nothing to commit, working tree clean
I guess there is some problem with the git cache. For example: Windows and Linux all use the .git/index file as cache file, but the git in Linux system can't recognize the .git/index changed by Windows. So it can only refresh the index and replace the .git/index file, which makes the next git status super fast and git status in Windows very slow (because the Windows system will refresh the index file again).
Is my guess correct? If so, how can I set the index file for different system? How can I solve the problem?
You are completely correct here:
The thing you're using here, which Git variously calls the index, the staging area, or the cache, does in fact contain cache data.
The cache data that it contains is the result of system calls.
The system call data returned by a Linux system is different from the system call data returned by a Windows system.
Hence, an OS switch completely invalidates all the cache data.
... how can I use set the index file for different system?
Your best bet here is not to do this at all. Make two different work-trees, or perhaps even two different repositories. But, if that's more painful than this other alternative, try out these ideas:
The actual index file that Git uses merely defaults to .git/index. You can specify a different file by setting GIT_INDEX_FILE to some other (relative or absolute) path. So you could have .git/index-linux and .git/index-windows, and set GIT_INDEX_FILE based on whichever OS you're using.
Some Git commands use a temporary index. They do this by setting GIT_INDEX_FILE themselves. If they un-set it afterward, they may accidentally use .git/index at this point. So another option is to rename .git/index out of the way when switching OSes. Keep a .git/index-windows and .git/index-linux as before, but rename whichever one is in use to .git/index while it's in use, then rename it to .git/index-name before switching to the other system.
Again, I don't recommend attempting either of these methods, but they are likely to work, more or less.
As torek mentioned, you probably don't want to do this. It's not generally a good idea to share a repo between operating systems.
However, it is possible, much like it's possible to share a repo between Windows and Windows Subsystem for Linux. You may want to try setting core.checkStat to minimal, and if that isn't sufficient, core.trustctime to false. That leads to the minimal amount of information being stored in the index, which means that the data is going to be as portable as possible.
Note, however, that if your repository has symlinks, that it's likely that nothing you do is going to prevent refreshes. Linux typically considers the length of a symlink to be its length in bytes, and Windows considers it to take one or more disk blocks, so there will be a mismatch in size between the operating systems. This isn't avoidable, since size is one of the attributes used in the index that can't be disabled.
This might not apply to the original poster, but if Linux is being used under the Windows Subsystem for Linux (WSL), then a quick fix is use git.exe even on the Linux side. Use an alias or something to make it seamless. For example:
alias git=git.exe
Auto line ending setting solved my issue as in this discussion. I am referring to Windows, WSL2, Portable Linux OS, and Linux as well which I have setup and working as my work requirement. I will update in case I face any issue while preferring this approach for updating code base from different filesystems (NTFS or Linux File System).
git config --global core.autocrlf true

rsync hang in the middle of transfer with a fixed position

I am trying to use rsync to transfer some big files between servers.
For some reasons, when the file is big enough (2GB - 4GB), the rsync would hang in the middle, with the exactly same position, i.e., the progress at which it hanged always stick to the same place even if I retried.
If I remove the file from the destination server first, then the rsync would work fine.
This is the command I used:
/usr/bin/rsync --delete -avz --progress --exclude-from=excludes.txt /path/to/src user#server:/path/to/dest
I have tried to add delete-during and delete-delay, all have no luck.
The rsync version is rsync version 3.1.0 protocol version 31
Any advice please? Thanks!
Eventually I solved the problem by removing compression option: -z
Still don't know why is that so.
I had the same problem (trying to rsync multiple files of up to 500GiB each between my home NAS and a remote server).
In my case the solution (mentioned here) was to add to "/etc/ssh/sshd_config" (on the server to which I was connecting) the following:
ClientAliveInterval 15
ClientAliveCountMax 240
"ClientAliveInterval X" will send a kind of message/"probe" every X seconds to check if the client is still alive (default is not to do anything).
"ClientAliveCountMax Y" will terminate the connection if after Y-probes there has been no reply.
I guess that the root cause of the problem is that in some cases the compression (and/or block diff) that is performed locally on the server takes so much time that while that's being done the SSH-connection (created by the rsync-program) is automatically dropped while that's still ongoing.
Another workaround (e.g. if "sshd_config" cannot be changed) might be to use with rsync the option "--new-compress" and/or a lower compression level (e.g. "rsync --new-compress --compress-level=1" etc...): in my case the new compression (and diff) algorithm is a lot faster than the old/classical one, therefore the ssh-timeout might not occur than when using its default settings.
The problem for me was I thought I had plenty of disk space on a drive but the partition was not using the whole disk and was half the size of what I expected.
So check the size of the space available with lsblk and df -h and make sure the disk you are writing to reports all the space available on the device.

How to use rsync instead of scp in my below shell script to copy the files?

I am using scp to copy the files in parallel using GNU parallel with my below shell script and it is working fine.
I am not sure how can I use rsync in place of scp in my below shell script. I am trying to see whether rsync will have better performance as compared to scp or not in terms of transfer speed.
Below is my problem description -
I am copying the files from machineB and machineC into machineA as I am running my below shell script on machineA.
If the files is not there in machineB then it should be there in machineC for sure so I will try copying the files from machineB first, if it is not there in machineB then I will try copying the same files from machineC.
I am copying the files in parallel using GNU Parallel library and it is working fine. Currently I am copying five files in parallel both for PRIMARY and SECONDARY.
Below is my shell script which I have -
#!/bin/bash
export PRIMARY=/test01/primary
export SECONDARY=/test02/secondary
readonly FILERS_LOCATION=(machineB machineC)
export FILERS_LOCATION_1=${FILERS_LOCATION[0]}
export FILERS_LOCATION_2=${FILERS_LOCATION[1]}
PRIMARY_PARTITION=(550 274 2 546 278) # this will have more file numbers
SECONDARY_PARTITION=(1643 1103 1372 1096 1369 1568) # this will have more file numbers
export dir3=/testing/snapshot/20140103
do_Copy() {
el=$1
PRIMSEC=$2
scp david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
}
export -f do_Copy
parallel --retries 10 -j 5 do_Copy {} $PRIMARY ::: "${PRIMARY_PARTITION[#]}" &
parallel --retries 10 -j 5 do_Copy {} $SECONDARY ::: "${SECONDARY_PARTITION[#]}" &
wait
echo "All files copied."
Is there any way of replacing my above scp command with rsync but I still want to copy 5 files in parallel both for PRIMARY and SECONDARY simultaneously?
rsync is designed to efficiently synchronise two hierarchies of folders and files.
While it can be used to transfer individual files, it won't help you very much used like that, unless you already have a version of the file at each end with small differences between them. Running multiple instances of rsync in parallel on individual files within a hierarchy defeats the purpose of the tool.
While triplee is right that your task is I/O-bound rather than CPU-bound, and so parallelizing the tasks won't help in the typical case whether you're using rsync or scp, there is one circumstance in which parallelizing network transfers can help: if the sender is throttling requests. In that case, there may be some value to running an instance of rsync for each of a number of different folders, but it would complicate your code, and you'd have to profile both solutions to discover whether you were actually getting any benefit.
In short: just run a single instance of rsync; any performance increase you're going to get from another approach is unlikely to be worth it.
You haven't really given us enough information to know if you are on a sensible path or not, but I suspect you should be looking at lsyncd or possibly even GlusterFS. These are different from what you are doing in that they are continuous sync tools rather than periodically run, though I suspect that you could run lsyncd periodically if that's what you really want. I haven't tried out lsyncd 2.x yet, but I see that they've added parallel synchronisation processes. If your actual scenario involves more than just the three machines you've described, it might even make sense to look at some of the peer-to-peer file sharing protocols.
In your current approach, unless your files are very large, most of the delay is likely to be associated with the overhead of setting up connections and authenticating them. Doing that separately for every single file is expensive, particularly over an ssh based protocol. You'd be better of breaking your file list into batches, and passing those batches to your copying mechanism. Whether you use rsync for that is likely to be of lesser importance, but if you first construct a list of files for an rsync process to handle, then you can pass it to rsync with the --files-from option.
You want to make sense of what the limiting factor is in your sync speed. Presumably it's one of Network bandwidth, Network latency, File IO, or perhaps CPU (checksumming or compression, but probably only if you have low end hardware).
It's likely also important to know something about the pattern of changes in files from one synchronisation run to another. Are there many unchanged files from the previous run? Do existing files change? Do those changes leave a significant number of blocks unchanged (eg database files), or only get appended (eg log files)? Can you safely count on metadata like file modification times and sizes to identify what's changed, or do you need to checksum the entire content?
Is your file content compressible? Eg if you're copying plain text, you probably want to use compression options in scp or rsync, but if you have already-compressed image or video files, then compressing again would only slow you down. rsync is mostly helpful if you have files where just part of the file changes.
You can download single files with rsync just as you would with scp. Just make sure not to use the rsync:// or hostname::path formats that call the daemon.
It can at the very least make the two remote hosts work at the same time. Additionally, if the files are on different physical disks or happen to be in cache, parallelizing them even on a single host can help. That's why I disagree with the other saying a single instance is necessarily the way to go.
I think you can just replace
scp david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
by
rsync david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data || rsync david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data
(note that the change is not only the command)
Perhaps you can get additional speed because rsync will use the delta=transfer algorithm compared to scp which will blindly copy.

Alternative to creating multipart .tar.gz files?

I have a folder with >20GB of images on a linux server, I need to make a backup and download it, so I was thinking about using "split" to create 1GB files. My question is: instead of splitting a .tar.gz and then having to join it again on my computer, is there a way I could create 20 x 1GB valid .tar.gz files, so I can then view/extract them separately?
Edit: I forgot to add that I need to do it without ssh access. I'm using mostly PHP.
You could try rsnapshot to backup using rsync/hardlinks instead. It not only solves the filesize issue but also gives you high storage and bandwidth efficiency when existing images aren't changed often.
Why not just use rsync?
FYI, rsync is a command-line tool that synchronises directories between two machines across the network. If you have Linux at both ends and ssh access properly configured, it's as simple as rsync -av server:/path/to/images/ images/ (make sure the trailing slashes are there). It also optimises subsequent synchronisations so that only changes are transmitted. You can even tell it to compress data in transit, but that usually doesn't help with images.
First I would give rsnapshot a miss if you don't have SSH access. (Though I do and love it)
I would assume you're likely backing up jpeg's and they are already compressed. Zipping them up doesn't make them much smaller, plus you don't need exactly 1GB files. It sounds like they can be a bit bigger or smaller.
So you could just write a script which bundles jpegs into a gz(or whatever) until it has put about 1gb worth in and then starts a new archive.
You could do all this in PHP easy enough.

Can /tmp in Linux ever fill up?

I'm putting some files in /tmp on a web server that are being used by a web application for a limited amount of time. If the files get left in the server's /tmp after the user quits using the application and this happens repeatedly, should i be concerned about the directory filling up? I read online that rebooting cleans out the /tmp directory, but this box doesn't get rebooted very much.
Tom
Yes, it will fill up. Consider implementing a cron job that will delete old files after a while.
Something like this should do the trick:
/usr/bin/find /tmp/mydata -type f -atime +1 -exec rm -f {} \;
This will delete files that have a modification time that's more than a day old.
Or as a crontab entry:
# run five minutes after midnight, every day
5 0 * * * /usr/bin/find /tmp/mydata -type f -atime +1 -exec rm -f {} \;
where /tmp/mydata is a subdirectory where your application stores its temporary files. (Simply deleting old files under /tmp would be a very bad idea, as someone else pointed out here.)
Look at the crontab and find man pages for the details. Don't go running scripts that delete files on your filesystem without understanding all the details - that's how bad things happen to good servers. :)
Of course, if you can just modify your application to delete temporary files when it's done with them, that would be a far better solution, generally.
You can't just blindly delete everything that hasn't been modified for a certain amount of time. A lot of programs store sockets in there, which never get modified but are still an integral part of the program working. Take for instance mysql from one of my servers:
srwxrwxrwx 1 mysql mysql 0 Sep 11 04:01 mysql.sock=
That's a valid, working "file" in /tmp. It just looks old because mysql hasn't been restarted in a while. Either limit your find with '-type f' or '-atime', or use one of the distro-provided tools others have mentioned.
The only thing you can write to without worrying it will fill up is /dev/null. Everything else will eventually run out of space if you keep dumping things in it.
One simple approach would be to have a cron job clean up all your /tmp files that are older than, say, a few days.
Yep It will be linked to one of your disks/partitions and can fill up.
It gets deleted on a reboot.
When the user quits the application you should clean the files up after them.
In which language is your web-application? A lot of languages propose temp files:
C
python
php
...
Search in your language if there is such a feature.
Just a warning: not all Linux installation clean the /tmp directory after each reboot
Some linux distros have a package that will clean up old files in /tmp for you. It isn't hard to implement your own, as mentioned above. One thing to look out for are long running processes, especially "zombies", which are ones that have died but which haven't finished cleaning up after themselves. If a process has a file open, just deleting it from /tmp won't actually reclaim its space - you have to kill the process or somehow coerce it to close the file. Many programs that write log or temporary files are designed to catch a signal (often SIGUSR1) and close and re-open any log or temporary files for that reason.
Many Linux distributions include something named 'tmpwatch', or similar which runs via cron and deletes things on a pre-defined gradient. Some are smart enough to go by the owner of the file .. stuff that is owned by daemon users gets cleaned out faster than stuff owned by regular users. Check on the mailing lists for your distro of choice to find out.
Still, you should have SNMP or some other kind of monitor watching how much room is available, if it fills up services like Apache aren't going to be happy. For instance, e-accelerator for PHP will need plenty of room, some mail scanners don't clean up properly, etc.

Resources