Reshard existing large SVN repository

Reshard existing large SVN repository - linux

we have a a pretty large SVN repository (50 GB, Over 100000 revisions). Working with it is pretty slow and my guess is that the reason for this is the flat directory structure in db/revs and db/revprops (Where each revision is one file).
We use the FSFS format with SVN 1.5 (On a linux server) but the repo was created with an older SVN version. Now I read that SVN 1.5 supports "sharding" and I understood that this feature distributes the revisions into multiple directories so a single directory doesn't contain so many files. This sounds pretty useful but unfortunately it looks like this feature is only used with repositories which are freshly created with SVN 1.5.
How can I convert a large existing linear repo to a sharded repo? The manual mentions the tool "fsfs-reshard.py" but this script says "This script is unfinished and not ready to be used on live data. Trust us.". So I definitely don't want to use that. Is there an alternative?

Will an svnadmin dump and svnadmin load do the trick? http://subversion.apache.org/faq.html#dumpload

The best way is as mentioned dump/load cylcle. But you can try the upgrade.
svnadmin upgrade
Make a copy of your repo first try the upgrade and test it....(don't miss to make a backup).

Because dump/restore process requires lot of disk space and processing time, I have published (in 2010) an improved version of fsfs-reshard.py which includes support for Subversion 1.6 FSFS format 5:
https://github.com/ymartin59/svn-fsfs-reshard
It supports switch between linear to sharded layouts, unpacking shards when required. Thanks to shard statistics computation, you may anticipate packed revision sizes selecting appropriate shard size.
Of course it must be used with care:
First test procedure on a repository copy if possible
Get a backup ready to be restored
Prevent access to repository when processing
Run svnadmin verify before put it live

Related

is it safe to use git in the same directory but from different OS (dual-boot)?

I am dual-booting Linux and Windows, and have git configured on both of them. When using Linux, is it safe to access the files directly from the Windows partition or should I just clone them from the remote repo for the new OS instead? thank you

The word safe is quite strong. You can definitely look around at the files in a foreign mount, but Git itself may behave oddly (and/or be extremely slow) if you try to do any actual work.
The reason is that Git's index (aka staging area or cache) contains cached data from the OS, and this cached data depends on the OS. Changing the underlying OS, then working with the repository, causes the cached data to be invalid.
Because the index itself is just a cache, it's possible to remove and rebuild it (rm .git/index && git reset). This undoes any staging action though. Using separate clones is certainly safer, and is how I would recommend working.

Check file history in Linux/Ubuntu/Centos

Can i check the file history like git or SVN in linux os. The modifications by date in Linux/Ubuntu/Centos. Any software that helps me do this?

Git and Subversion are software packages whose purpose in life is to keep track of content changes in the files of a project. The OSes usually do not care about files history; they don't provide such a feature.
Windows and macOS include backup tools that run automatically in background (if they are enabled) from time to time and can be used to access some (not all) past versions of the files. This functionality comes with the cost of disk space used to store the past versions of files.
Linux doesn't provide such a tool (but you can install one, if you need it.)
I guess you are out of luck. You cannot recover a previous version of the file but you can install a backup software to avoid reaching this situation in the future.

By default you can't. The filesystem simply stores the current state of the file, not its history (as 1615903 pointed out in the comments, there are some versioned filesystems that keep track of this kind of history, but they are largely unsupported in Linux - which means you probably aren't dealing with one, if you are, the filesystem documentation can guide you through the recovery of your file). It's possible that some forensics tool can at least attempt to recover some file's history but I'm not sure of that (and they will probably fail if the older file's sectors have been written on).
For the future, you can prepare in advance for similar problems by setting up some incremental backup (it can be done pretty easily with rsync), but it's still limited to the specific timeframes you set your script to run into.

Waht could be the reason for "svnrdump" and "svnadmin dump" to produce different sized dumps?

I tried to migrate multiple repositories to a different SVN server.
I have root access to the source server, so I first tried to dump the repositories locally on the server using "svnadmin dump". This works fine for the first couple of repositories until I encountered a repository that need more space to be dumped than the server has empty disk space.
So instead I switched to using "svnrdump dump" to dump the repositories onto a remote machine. As my root on the source server has no svn read access to the server I used my svn user account instead. That account has full read and write access to all the repositories.
To be sure I dumped all repositories (not just the missing one) again with "svnrdump dump"
After I was done I ended up with some repositories that were dumped two times (one time with svnadmin and one time with svnrdump).
I suddenly noticed that the size of one of the dumps was 115 MB for the dump created with "svnadmin dump" and only 78 MB for the dump created with "svnrdump dump".
The SVN server is a unix machine with SVN 1.6.17 and the remote machine used for svnrdump is a windows machine with Tortoise-SVN 1.9.4 and SVN 1.9.4.
So, now I am unsure if my dumps made with "svnrdump" are really correct.
Can the different size be because of the difference between the two accounts (root of the server on the one hand and svn user on the other hand)?
Or might it have something to do with the different versions of svn?
Regards,
Sebastian

When you used the 1.6.x client, the dump file contains the full info for every revision.
svn clients version 1.8.x and later only contain the delta info from revision to revision, so they're a lot smaller.
svnadmin dump has a switch --deltas to produce a dump using those deltas, which results in a smaller dump file. svnrdump does that unconditionally to reduce network traffic.

Write a system life saver for ubuntu for restoring broken system to a working state

I am thinking of writing a system life saver application for ubuntu, which can restore system to an earlier state. This could be much useful in situations of system break.
User can create restore point before and then use them to restore their system.
This would be used for packages initially and then later on for restoring previous versions of files,somewhat like system restore functionality in microsoft windows.
Here is the idea page Idea page
I have gone through some ideas of implementing it like that which is done in windows, by keeping information about the files in the filesystem, the filesystem is intelligent enough to be used for this feature. But we don't have such file system available in linux, one such file system is brtfs but using this will lead to users creating partitions, which will be cumbersome. So I am thinking of a "copy-on-write and save-on-delete" approach. When a restore point is created I will create a new directory for backup like "backup#1" in the restore folder created by application earlier and then create hard links for the files needed to be restored. Now if any file is deleted from its original location I would have its hard link with me which can be used to restore the file, when needed. But this approach doesn't work on modification. For modification I am thinking of creating hooks in the file system (using redirfs ) which will call my attached callbacks which will check for the modifications in various parts of the files. I will keep these all changes in the database and then reverse the changes as soon as a restore is needed.
Please suggest me some efficient approaches for doing this.
Thanks

Like the comments suggested, the LVM snapshot ability provides a good basis for such an undertaking. It would work on a per-partition level and saves only sectors changed in comparison with the current system state. The LVM howto gives a good overview.
You'll have to set up the system from the very start with LVM, though, and leave sufficient space for snapshots.

What's the best way to keep multiple Linux servers synced?

I have several different locations in a fairly wide area, each with a Linux server storing company data. This data changes every day in different ways at each different location. I need a way to keep this data up-to-date and synced between all these locations.
For example:
In one location someone places a set of images on their local server. In another location, someone else places a group of documents on their local server. A third location adds a handful of both images and documents to their server. In two other locations, no changes are made to their local servers at all. By the next morning, I need the servers at all five locations to have all those images and documents.
My first instinct is to use rsync and a cron job to do the syncing over night (1 a.m. to 6 a.m. or so), when none of the bandwidth at our locations is being used. It seems to me that it would work best to have one server be the "central" server, pulling in all the files from the other servers first. Then it would push those changes back out to each remote server? Or is there another, better way to perform this function?

The way I do it (on Debian/Ubuntu boxes):
Use dpkg --get-selections to get your installed packages
Use dpkg --set-selections to install those packages from the list created
Use a source control solution to manage the configuration files. I use git in a centralized fashion, but subversion could be used just as easily.

An alternative if rsync isn't the best solution for you is Unison. Unison works under Windows and it has some features for handling when there are changes on both sides (not necessarily needing to pick one server as the primary, as you've suggested).
Depending on how complex the task is, either may work.

One thing you could (theoretically) do is create a script using Python or something and the inotify kernel feature (through the pyinotify package, for example).
You can run the script, which registers to receive events on certain trees. Your script could then watch directories, and then update all the other servers as things change on each one.
For example, if someone uploads spreadsheet.doc to the server, the script sees it instantly; if the document doesn't get modified or deleted within, say, 5 minutes, the script could copy it to the other servers (e.g. through rsync)
A system like this could theoretically implement a sort of limited 'filesystem replication' from one machine to another. Kind of a neat idea, but you'd probably have to code it yourself.

AFAIK, rsync is your best choice, it supports partial file updates among a variety of other features. Once setup it is very reliable. You can even setup the cron with timestamped log files to track what is updated in each run.

I don't know how practical this is, but a source control system might work here. At some point (perhaps each hour?) during the day, a cron job runs a commit, and overnight, each machine runs a checkout. You could run into issues with a long commit not being done when a checkout needs to run, and essentially the same thing could be done rsync.
I guess what I'm thinking is that a central server would make your sync operation easier - conflicts can be handled once on central, then pushed out to the other machines.

rsync would be your best choice. But you need to carefully consider how you are going to resolve conflicts between updates to the same data on different sites. If site-1 has updated
'customers.doc' and site-2 has a different update to the same file, how are you going to resolve it?

I have to agree with Matt McMinn, especially since it's company data, I'd use source control, and depending on the rate of change, run it more often.
I think the central clearinghouse is a good idea.

Depends upon following
* How many servers/computers that need to be synced ?
** If there are too many servers using rsync becomes a problem
** Either you use threads and sync to multiple servers at same time or one after the other.
So you are looking at high load on source machine or in-consistent data on servers( in a cluster ) at given point of time in the latter case
Size of the folders that needs to be synced and how often it changes
If the data is huge then rsync will take time.
Number of files
If number of files are large and specially if they are small files rsync will again take a lot of time
So all depends on the scenario whether to use rsync , NFS , Version control
If there are less servers and just small amount of data , then it makes sense to run rysnc every hour.
You can also package content into RPM if data changes occasionally
With the information provided , IMO Version Control will suit you the best .
Rsync/scp might give problems if two people upload different files with same name .
NFS over multiple locations needs to be architect-ed with perfection
Why not have a single/multiple repositories and every one just commits to those repository .
All you need to do is keep the repository in sync.
If the data is huge and updates are frequent then your repository server will need good amount of RAM and good I/O subsystem

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string