How efficient is puppet with handling large files? To give you a concrete example:
Let's assume we're dealing with configuration data (stored in files) in the order of gigabytes. Puppet needs to ensure that the files are up-to-date with every agent run.
Question: Is puppet performing some file digest type of operation beforehand, or just dummy-copying every config file during agent runs?
When using file { 'name': source => <URL> }, the file content is not sent through the network unless there is a checksum mismatch between master and agent. The default checksum type is md5.
Beware of the content property for file. Its value is part of the catalog. Don't assign it with contents of large files via the file() or template() functions.
So yes, you can technically manage files of arbitrary size through Puppet. In practice, I try to avoid it, because all of Puppet's files should be part of a git repo or similar. Don't push your tarballs inside there. Puppet can deploy them by other means (packages, HTTP, ...).
Im not entirely certain now puppets file server works in the latest update but in previous versions Puppet read the file into memory and thats why it was not recommended using the file server to transfer files larger than 1gb. I suggest you go through these answers and see if it makes sense https://serverfault.com/a/398133
Related
is it possible to use a node local config file (hiera?) that is used by the puppet master to compile the update list during a puppet run?
My usecase is that puppet will make changes to users .bashrc file and to the users home directory, but I would like to be able to control which users using a file on the actual node itself, not in the site.pp manifest.
is it possible to use a node local config file (hiera?) that is used
by the puppet master to compile the update list during a puppet run?
Sure, there are various ways to do this.
My usecase is that puppet will make changes to users .bashrc file and
to the users home directory, but I would like to be able to control
which users using a file on the actual node itself, not in the site.pp
manifest.
All information the master has about the current state of the target node comes in the form of node facts, provided to it by the node in its catalog request. A local file under local control, whose contents should be used to influence the contents of the node's own catalog, would fall into that category. Puppet supports structured facts (facts whose values have arbitrarily-nested list and/or hash structure), which should be sufficient for communicating the needed data to the master.
There are two different ways to add your own facts to those that Puppet will collect by default:
Write a Ruby plugin for Facter, and let Puppet distribute it automatically to nodes, or
Write an external fact program or script in the language of your choice,
and distribute it to nodes as an ordinary file resource
Either variety could read your data file and emit a corresponding fact (or facts) in appropriate form. The Facter documentation contains details about how to write facts of both kinds; "custom facts" (Facter plugins written in Ruby) integrate a bit more cleanly, but "external facts" work almost as well and are easier for people who are unfamiliar with Ruby.
In principle, you could also write a full-blown custom type and accompanying provider, and let the provider, which runs on the target node, take care of reading the appropriate local files. This would be a lot more work, and it would require structuring the solution a bit differently than you described. I do not recommend it for your problem, but I mention it for completeness.
Well, question is not new but I still unable to find any nice solution.
I distributing binaries 100-300mb files via puppet fileserver, but it works really bad in case of performance I'm sure because of md5 checks. Now I have more than 100 servers and my puppet master works really hard to manage all that md5 computation checks. In puppet 3.x checksum for file{} does not work. I'm unable to update to puppet 4.x and I have no chance to change flow. files should came from puppet fileserver.
So I can't believe that there is no custom file type with fixed checksum option, but I can't find it :(
Or maybe there is any other way to download files from puppet fileserver ?
Please any advice will help!
rsync or pack as a native package impossible option to me.
It is indeed reasonable to suppose that using the default checksum algorithm (MD5) when managing large files will have a substantial performance impact. The File resource has a checksum attribute that is supposed to be usable to specify an alternative checksumming algorithm among those supported by Puppet (some of which are not actually checksums per se), but it was buggy in many versions of Puppet 3. At this time, it does not appear that the fix implemented in Puppet 4 has been backported to the Puppet 3 series.
If you need only to distribute files, and don't care about afterward updating them or maintaining their consistency via Puppet, then you could consider turning off checksumming altogether. That might look something like this:
file { '/path/to/bigfile.bin':
ensure => 'file',
source => 'puppet:///modules/mymodule/bigfile.bin',
owner => 'root',
group => 'root',
mode => '0644',
checksum => 'none',
replace => false
}
If you do want to manage existing files, however, then Puppet needs a way to determine whether a file already present on the node is up to date. That's the one of the two main purposes of checksumming. If you insist on distributing the file via the Puppet file server, and you are stuck on Puppet 3, then I'm afraid you are out of luck as far as lightening the load. Puppet's file server is tightly integrated with the File resource type, and not intended to serve general purposes. To the best of my knowledge, there is no third-party resource type that leverages it. In any case, the file server itself is a major contributor to the problem of File's checksum parameter not working -- buggy versions do not perform any type of checksumming other than MD5.
As an alternative, you might consider packaging your large file in your system's native packaging format, dropping it in your internal package repository, and managing the package (via a Package resource) instead of managing the file directly. That does get away from distributing it via the file server, but that's pretty much the point.
I need to transfer the file contents from multiple servers to a central repository as soon as there is a change in the file. Also the requirement is that only the changed contents should be transferred, not the whole file.
Could someone let me know if it is possible using Spring-Integration File Inbound/Outbound Adapters.
The file adapters only work on local files (but they can work if you can mount the remote filesystems).
The current file adapters do not support transferring parts of tiles, but we are working on File Tailing Adapters which should be in the code base soon. These will only work for text files, though (and only if you can mount the remote file system). For Windows (and other platforms that don't have a tail command, there's an Apache commons Tailer implementation but, again, it will only work for text files, and if you can mount the shares.
If you can't mount the remote files, or they are binary, there's no out of the box solution, but if you come up with a custom solution to transfer the data (e.g. google tailing remote files), its easy to then hook it into a Spring Integration flow to write the output.
I have a large Perforce depot and I believe my client currently has about 2GB of files that are in sync with the server, but what's the best way to verify my files are complete, in-sync, and up to date to a given change level (which is perhaps higher then a handful of files on the client currently)?
I see the p4 verify command, and it's MD5s, but these just seem to be from the server's various revisions for the file. Is there a way to compare the MD5 on the server with the MD5 of the revision required on my client?
I am basically trying to minimize bandwidth and time consumed to achieve a complete verification. I don't want to have to sync -f to a specific revision number. I'd just like a list of any files that are inconsistent with the change level I am attempting to attain. Then I can programmatically force a sync of those few files.
You want "p4 diff -se".
This should do an md5 hash of the client's file and compare it to the stored hash on the server.
Perforce is designed to work when you keep it informed about the checked out status of all your files. If you or other programmers in your team are using perforce and editing files that are not checked out then that is the real issue you should fix.
There is p4 clean -n (equivalent to p4 reconcile -w -n)
which would also get you a list of files that p4 would update. Of course you could also pass a changelist to align to.
You might want to disable checking for local files that it would delete tho!
If you don't have many incoming updates one might consider an offline local manifest file with sizes and hashes of all the files in the repository. Iterating over it and checking for existence, size and hash yielding missing or changed files.
In our company, having the p4 server on the intranet checking via local manifest it's actually not much faster than asking for p4 clean. But a little!! And it uses no bandwidth at all. Now over internet and VPN even better!!
we have a a pretty large SVN repository (50 GB, Over 100000 revisions). Working with it is pretty slow and my guess is that the reason for this is the flat directory structure in db/revs and db/revprops (Where each revision is one file).
We use the FSFS format with SVN 1.5 (On a linux server) but the repo was created with an older SVN version. Now I read that SVN 1.5 supports "sharding" and I understood that this feature distributes the revisions into multiple directories so a single directory doesn't contain so many files. This sounds pretty useful but unfortunately it looks like this feature is only used with repositories which are freshly created with SVN 1.5.
How can I convert a large existing linear repo to a sharded repo? The manual mentions the tool "fsfs-reshard.py" but this script says "This script is unfinished and not ready to be used on live data. Trust us.". So I definitely don't want to use that. Is there an alternative?
Will an svnadmin dump and svnadmin load do the trick? http://subversion.apache.org/faq.html#dumpload
The best way is as mentioned dump/load cylcle. But you can try the upgrade.
svnadmin upgrade
Make a copy of your repo first try the upgrade and test it....(don't miss to make a backup).
Because dump/restore process requires lot of disk space and processing time, I have published (in 2010) an improved version of fsfs-reshard.py which includes support for Subversion 1.6 FSFS format 5:
https://github.com/ymartin59/svn-fsfs-reshard
It supports switch between linear to sharded layouts, unpacking shards when required. Thanks to shard statistics computation, you may anticipate packed revision sizes selecting appropriate shard size.
Of course it must be used with care:
First test procedure on a repository copy if possible
Get a backup ready to be restored
Prevent access to repository when processing
Run svnadmin verify before put it live