Chef attributes versus data bags - attributes

I'm new to Chef, and after reading the documentation I'm still having trouble understanding when to use Attributes and when to use Data bags.
What kind of data should be stored as attributes, and what kind of data should be stored in data bags ?
Thanks

Well, it depends. Although data bags and attributes both hold data, the major difference between them is that attributes are exposed as node properties when recipe is run, but you don't have any clear overview what data bags were used (Except that you go through the recipes in run list).
What I personally store in attributes are:
Paths where something (files, programs) is installed, created
Software Versions
Urls, ports (to download from, servers listen on etc.)
Usernames
And in data bags:
Everything that cannot be exposed - in encrypted data bags (private keys, passwords)
user properties (name, shell, password hashes, public key, comment etc.)
Some other configurations, that are more like objects, but not simple string or number data, and that is not important to the node itself.
About the last point: An example is maven repositories list. Repository has properties: name, url, policy etc. And it is not important for node what repositories are configured - important is that it have maven installed.
Another example is user, only available usernames are in the attributes. All the other data is in data bag, although it can be exposed - no secret data there.

Of course this is one of those things where there isn't an easy answer. My rule of thumb is that anything that is one thing of many belongs in a data bag. For example if you have a list of users and groups that you want to create on a node using fnichol's users cookbook then that's a data bag. For tweaking parameters on a MySQL server then it's attributes.

Related

Is there a cross between a relational database and Git?

I'm looking for a programmatic way to store persistent data that's easily searchable on any field (like SQL) and keeps all of its history (like Git or a Wiki). I.e. if a bad value is saved, it can be reverted to a previous good value but without the complexity of having "previous value" tables for all tables.
I also need integrity between certain parts of the data, e.g. entities of both class A and class B must point to a valid entity of class C (not deleted), but class A and class B are not subsets or supersets of each other.
It must be as performant as a database but it doesn't have to be a database, it's just what I know and am used to. Is there a searchable persistent storage system that keeps its history and allows reversion of values (not the whole dataset at once)? If not, is there an easy way to use an existing product in this way?
You might try Fossil: https://www.fossil-scm.org/ It's an SCM that stores commits in a SQLite database and supports bidirectional syncing with git repos.

Combine CouchDB databases with replication while recording source db

I’m just starting out with CouchDB (2.1), and I’m planning to use it to replicate confidential per-user data from a mobile app up to my server. I’ve read that per-user databases are the best way to do this, and I’ve set that up. Each database has a mix of user-created documents of types Foo and Bar.
Now, I’d also like to be able to collect multi-user slices of that data together into one database and build views on it for admin reporting. Say I want a database which contains all the Foos from all users. So far so good, an entry in _replicator with a filter from each user database to one target does the job.
But looking at the combined database, I can’t tell which user a given Foo came from. I could write the user id into each document within the per-user database but that seems redundant and adds the complexity of validation. Is there any other way?
CouchDB's replicator simply tries to match up the exact state of a given document in the target database — and if it can't, it stores ± the exact source contents anyway (as a conflicting version).
Furthermore the _rev field of a document, which the replication system uses to check if a document needs to be updated, is actually based on (a hash over) the other document fields.
So unfortunately you can't add metadata during replication. This would indeed be handy for this and other per-user vs. shared replication situations, but it's not something CouchDB currently supports, and it would break some optimizations to add support for it.
I could write the user id into each document within the per-user database but that seems redundant and adds the complexity of validation. Is there any other way?
Including something like a .user field in each document is the right solution.
As far as being redundant, I wouldn't think of it that way — or at least, not as a bad thing. You'll find with CouchDB (and like other NoSQL stores) there's a trend to "denormalize" data to begin with. Especially given the things replication lets me do operationally and architecturally, I'd much rather have a self-contained document than one that relies on metadata derived from a database name.
I'm not sure exactly how in your case an extra field will make validation more complex, so I can't fully speak to that. You do want to make sure the user writing the document has set it "honestly", and so yes there is a bit more complication, but usually not too burdensome in most cases.

MEAN Stack: static list best practice

This is a general best practice question:
I am building a MEAN (mongo, express, angular, node) website. I have a user object that can have a gender [Mr or Miss] and a city [Paris, New York, Anything]
So this is quite a common problem: where should I store those lists that rarely change and never exceed, let's say, 50 rows.
1/ Is it better to have them stored in the database (mongo) with a foreign key in the user table. And so I have a gender table and a city table. But everytime I access these lists I need to read the base?
2/ Is it better to have them store in a file or in a controller? But this is a bit dangerous I think.
3/ Maybe there is another way that I don't know about.
I am not sure what is the best solution.
Are you concerned about an extra database call to get a list out?
If it was me I'd pick option 1 and I'd be storing it in a database. If you store value descriptions only front-end you'll run the risk of discrepancies if you end up updating your database's foreign keys but forget to update your controller or file and it seems rather untrustworthy. It also makes it more difficult to provide internationalization, because you'll have to start storing names and genders in files or controllers in multiple languages. Storing things is what a database is for and an additional call to get a list out is really not that big an impact on your performance.
Angular's $http object, which you are probably using to call your API has a caching option, which means you'll only need to retrieve the list once per app instantiation.
You could alternatively have a look at this post by Josh who found a way to pre populate a directive with JSON from the server before loading it.

How to selectively replicate private and shared portions of a CouchDB database?

We're looking into using CouchDB/CouchCocoa to replicate data to our mobile app.
Our system has a large number of users. Part of the database is private to each user -- for example their tasks. These I've been able to replicate without problem using filtered replication.
Here's the catch... The database also includes shared information only some of which pertains to a given user. How do I selectively replicate that shared information? For example a user's task might reference specific shared documents. Is there a way to make sure those documents are included in the replication without including all the shared documents?
From the documentation it seems that adding doc_ids to the replication (or adding another replication with those doc ids) might be one solution. Has any one tried this? Are there other solutions?
EDIT: Given the number of users it seems impractical to tag each shared document with all the users sharing it but perhaps that's the only way to do this?
Final solution mostly depends on your documents structure, but currently I see two use-cases:
As you keep everything within single database, probably you have some fields set to recognize, that document is shared or document is private, right? Example:
owner: "Mike"
participants: [] // if there is nobody mentioned, document looks like as private(?)
So you just need some filter that would handle only private documents and only shared ones: by tags, number of participants, references or somehow.
Also, if you need to replicate some documents only for specific user (e.g. only for Mike), than you need special view to handle all these documents and, yes, use replication by document ids, but this wouldn't be an atomic request: you need some service script to handle these steps. If shared documents are defined by references to them, than the only solution is the same: some service script, view that generated document reference tree and replication by doc._id's.
Review your architecture. Having per user database is normal use-case for CouchDB and follows way of data partitioning and isolation. So you may create per user database that would be private only for that user. For shared documents you may create additional databases playing with database members of security options. Each "shared" database will handle only certain number of participants by names or by groups, so there couldn't be any data leaks unless that was not a CouchDB bug(:
This approach looks too weird from first sight, but everything you've needed there is to create some management script that would handle database creation and publication, replications would be easy as possible and users data is in safe.
P.S. I've supposed that "sharing" operation makes document visible not for every one, but for some set of users. If I was wrong and "shared" state means "public" state than p2. will be more simpler: N users databases + 1 public one.

How do I secure data access in my new API?

I am designing an API, and I'd like to ask a few questions about how best to secure access to the data.
Suppose the API is allowing access to artists. Artists have albums, that have songs.
The users of the API have access to a subset of all the artists. If a user calls the API asking for some artist, it is easy to check if the user is allowed to do so.
Next, if the user asks for an album, the API has to check if the album belongs to an artist that the user is allowed to access. Accessing songs means that the API has to check the album and then the artist before access can be granted.
In database terms, I am looking at an increasing number of joins between tables for each additional layer that is added. I don't want to do all those joins, and I also don't want to store the user id everywhere in order to limit the number of joins.
To work around this, I came up with the following approach.
The API gives the user a reference to an object, for instance an artist object. The user can then ask that artist object for the albums, which returns a list object. The list object can be traversed, and album objects can be obtained from it. Likewise, from an album object a songlist object can be obtained and from that, the individual song objects.
Since the API trusts the artist object, it also trusts any objects (albums in this case) that the user gets from it, without further checks. And so forth for all the other objects. So I am delegating the security/trust to objects down the chain.
I would like to ask you what you think of it, what's good or bad about it, and of course, how you would solve this "problem".
Second, how would you approach this if the API should be RESTful? My approach seems less applicable in that case.
Is this a real program or rather a sample to illustrate a question?
Because it is not clear why you would restrict access to the artists and albums rather than just to individual media items or even tracks.
I don't think that the joins should cost you that much, any half-smart DB system will do them cheaply enough when you are making a fairly simple criteria match on multiple tables.
IMHO, the problem with putting that much security logic into queries is that it limits your ability to handle more complex DRM issues that are sure to bound up. For example, what if the album is a collection from multiple artists? What if the album contains a track which is a duet and I only have access to one artist? etc, etc.
My view is that in those situations, a convenient programming model with sensible exception is much more important than the performance of individual queries, which you could always cache or optimize in the future. What you are trying to do with queries sounds like premature optimization.
Design your programming model as flexible as possible. Define a sensible sense of extensions, then work on implementing the database and optimize queries after profiling the real system.
It is possible that doing the joins is much faster than your object approach (although it is more elegant). With the joins you have only one db request, with the objects you have many. (Or you have to retrieve all the "possible" data in the first request, which could also slow down things)
I recommend doing the joins. If there is a problem about the sql you can ask at stackoverflow :D
Another idea:
If you make urls like "/beatles/whitealbum/happinesisawarmgun"
then you would know the artist in the begining of the request and could get the permission at once without traversing - because the url contains the traversal information. Just a thought.
It is a good idea to include a security descriptor for each resource and not only to a top-level one. In your example the security descriptor is simply artist's ID or a list of artists' IDs, if you support duets etc. So I would think about adding the list of IDs to both the artists and the songs tables. You can add a string field where the artist IDs for the resource will be written in comma-separated way.
Such solution scales well, you can add more layers without increasing time needed for security check. Adding a new resource also doesn't require any additional penalty except for one more field to insert (based on resource's parent field). And of course, this solution supports special situations described above (like more than one artists etc.).
This kind of solution also doesn't violate RESTful architecture.
And the fact that each resource contains its own security descriptor generalizes the resource's access permissions, making it possible to implement some completely different security policy in future (for example, making access permissions more granular, based on albums, not only artists).

Resources