How do I secure data access in my new API?

How do I secure data access in my new API? - security

I am designing an API, and I'd like to ask a few questions about how best to secure access to the data.
Suppose the API is allowing access to artists. Artists have albums, that have songs.
The users of the API have access to a subset of all the artists. If a user calls the API asking for some artist, it is easy to check if the user is allowed to do so.
Next, if the user asks for an album, the API has to check if the album belongs to an artist that the user is allowed to access. Accessing songs means that the API has to check the album and then the artist before access can be granted.
In database terms, I am looking at an increasing number of joins between tables for each additional layer that is added. I don't want to do all those joins, and I also don't want to store the user id everywhere in order to limit the number of joins.
To work around this, I came up with the following approach.
The API gives the user a reference to an object, for instance an artist object. The user can then ask that artist object for the albums, which returns a list object. The list object can be traversed, and album objects can be obtained from it. Likewise, from an album object a songlist object can be obtained and from that, the individual song objects.
Since the API trusts the artist object, it also trusts any objects (albums in this case) that the user gets from it, without further checks. And so forth for all the other objects. So I am delegating the security/trust to objects down the chain.
I would like to ask you what you think of it, what's good or bad about it, and of course, how you would solve this "problem".
Second, how would you approach this if the API should be RESTful? My approach seems less applicable in that case.

Is this a real program or rather a sample to illustrate a question?
Because it is not clear why you would restrict access to the artists and albums rather than just to individual media items or even tracks.
I don't think that the joins should cost you that much, any half-smart DB system will do them cheaply enough when you are making a fairly simple criteria match on multiple tables.
IMHO, the problem with putting that much security logic into queries is that it limits your ability to handle more complex DRM issues that are sure to bound up. For example, what if the album is a collection from multiple artists? What if the album contains a track which is a duet and I only have access to one artist? etc, etc.
My view is that in those situations, a convenient programming model with sensible exception is much more important than the performance of individual queries, which you could always cache or optimize in the future. What you are trying to do with queries sounds like premature optimization.
Design your programming model as flexible as possible. Define a sensible sense of extensions, then work on implementing the database and optimize queries after profiling the real system.

It is possible that doing the joins is much faster than your object approach (although it is more elegant). With the joins you have only one db request, with the objects you have many. (Or you have to retrieve all the "possible" data in the first request, which could also slow down things)
I recommend doing the joins. If there is a problem about the sql you can ask at stackoverflow :D
Another idea:
If you make urls like "/beatles/whitealbum/happinesisawarmgun"
then you would know the artist in the begining of the request and could get the permission at once without traversing - because the url contains the traversal information. Just a thought.

It is a good idea to include a security descriptor for each resource and not only to a top-level one. In your example the security descriptor is simply artist's ID or a list of artists' IDs, if you support duets etc. So I would think about adding the list of IDs to both the artists and the songs tables. You can add a string field where the artist IDs for the resource will be written in comma-separated way.
Such solution scales well, you can add more layers without increasing time needed for security check. Adding a new resource also doesn't require any additional penalty except for one more field to insert (based on resource's parent field). And of course, this solution supports special situations described above (like more than one artists etc.).
This kind of solution also doesn't violate RESTful architecture.
And the fact that each resource contains its own security descriptor generalizes the resource's access permissions, making it possible to implement some completely different security policy in future (for example, making access permissions more granular, based on albums, not only artists).

Related

What are the RESTful API best practices in this case?

I am new to Node.js and have an application in which there are multiple organizations with multiple admins and multiple groups with multiple users who can make multiple posts. Like this:
Organization
Admins
Groups
Users
Posts
Admins have access to everything within the organization. Their primary goal is to observe and analyze the posts their users are making. Admins can:
Get all posts by organization
Get all posts by group
Get all posts by user
Because there are three specific ways in which posts can be queried, I have built three separate routes and handler functions for each of the ways to query posts:
/api/posts/organization/:organizationID
/api/posts/group/:groupID
/api/posts/user/:userID
As I have learned more about RESTful APIs, everything I see tells me that "path params are used to identify a specific resource or resources, while query parameters are used to sort/filter those resources."
This is a bit confusing for a beginner like me. It seems like "posts" are the "specific resource" here, so I should change my API to have one api/posts/ route and use query params to filter them. Is that right?

There is no single best practice, and your approach seems reasonable. Typically I would structure this as:
/users/:id/posts
/groups/:id:/posts
/organization/:id/posts
As this makes the relationship more clear (posts that belong to users).
But your approach, and using 1 endpoint with query parameters are all reasonable approaches. The most important thing is to be consistent and ideally find an existing style guide to follow.

What matters here that you need to be able to identify your resources. The simplest approach is:
/api/posts/{postId}
/api/organizations/{organizationID}
/api/groups/{groupID}
/api/users/{userID}
As of the admin vs regular user, it is called role-based access control (RBAC). So roles can be another resource, though if you never want to edit them, then just hardcode them and manage roles by users and groups.
The path is for the hierarchical part of resource identification the query is for non-hierarchical part, but they sort of overlap, which is not a big deal, you can even support both. So the URI is a unique identifier, but it is not exlusive, you can have multiple URIs, which identify the same resource.
As of the query part, I like to use it for filtering collections and return always an array even if it contains one or zero items. So with my approach:
/api/users/{userID}
/api/users/?id={userID}
These two are not exactly the same, because the second one returns an array with a single item. But this is not a standard, just my preferred approach.
I like the upper simple URIs instead of the heavily nested ones and add more depth only if it grows really big. It is like namespacing in a programming language. For a while you are ok with the global namespace, but if it grows big, then you split it up into multiple namespaces.
In your case I think I would do it the opposite direction as you did:
Get all posts by organization: /posts/?organizationId={id}
Get all posts by group: /posts/?groupId={id}
Get all posts by user: /posts/userId={id}
Another approach is:
Get all posts by organization: /organizations/{id}/posts/
Get all posts by group: /groups/{id}/posts
Get all posts by user: /users/{id}/posts
You can even support both approaches simultaneously or a different approach you like better.
Tbh. when you do something that is really REST, then the URI structure does not matter this much from REST client perspective, because it checks the description of the hyperlink it gots from the server and does not care much about the URI structure. So the response should contain something like the following:
{
"type":"link",
"operation":"listPostsForOrganization",
"method": "get",
"uri": "/api/organizations/123/posts/",
}
And you use the API with the client like this:
let organization = await api.getOrganizationForUser(session.currentUser)
let posts = await api.listPostsForOrganization(organization)

Reuse same database tables in different repositories (repositories overlap on the data they access)

Suppose I have database tables Customer, Order, Item. I have OrderRepository that accesses, directly with SQL/my ORM, both the Order and Items table. E.g. I could have a method, getItems on the OrderRespositry that returns all items of that order.
Suppose I now also create ItemRepository. Given I now have 2 repositories accessing the same database table, is that generally considered poor design? My thinking is, sometimes a user wants to update the details about an Item (e.g. name), but when using the OrdersRepository, it doesn't really make sense to not be able to access the items directly (you want to know about all the items in an order)
Of course, the OrderRepository could internally create* an ItemRepository and call methods like getItemsById(ids: string[]). However, consider the case that I want to get all orders and items ever purchased by a Customer. Assuming you had the orderIds for a customer, you could have a getOrders(ids: string[]) on the OrderRepository to fetch all the orders and then do a second query to fetch all the Items. I feel you make your life harder (and less efficient) in the sense you have to do the join to match items with orders in the app code rather than doing a join in SQL.
If it's not considered bad practice, is there some kind of limit to how much overlap Repositories should have with each other. I've spent a while trying to search for this on the web, but it seems all the tutorials/blogs/vdieos really don't go further than 1 table per entity (which may be an anti-pattern).
Or am I missing a trick?
Thanks
FYI: using express with TypeScript (not C#)
is a repository creating another repository considered acceptable. shouldn't only the service layer do that?

It's difficult to separate the Database Model from the DDD design but you have to.
In your example:
GetItems should have this signature - OrderRepostiory.GetItems(Ids: int[]) : ItemEntity. Note that this method returns an Entity (not a DAO from your ORM). To get the ItemEntity, the method might pull information from several DAOs (tables, through your ORM) but it should only pull what it needs for the entity's hydration.
Say you want to update an item's name using the ItemRepository, your signature for that could look like ItemRepository.rename(Id: int, name: string) : void. When this method does it's work, it could change the same table as the GetItems above but note that it could also change other tables as well (For example, it could add an audit of the change to an AuditTable).
DDD gives you the ability to use different tables for different Contexts if you want. It gives you enough flexibility to make really bold choices when it comes the infrastructure that surrounds your domain. So ultimately, it's a matter of what makes sense for your specific situation and team. Some teams would apply CQRS and the GETOrder and Rename methods will look completely different under the covers.

CQRS/Event Sourcing - Does one expect to receive an Aggregate Id from the user/request?

I am currently just trying to learn some new programming patterns and I decided to give event sourcing a shot.
I have decided to model a warehouse as my aggregate root in the domain of shipping/inventory where the number of warehouses is generally pretty constant (i.e. a company wont be adding warehouses too often).
I have run into the question of how to set my aggregateId, which should correspond to a warehouse, on my server. Most examples I have seen, including this one, show the aggregate ID being generated server side when a new aggregate is being created (in my case a warehouse), and then passed in the command request when referring to that aggregate for subsequent commands.
Would you say this is the correct approach? Can I expect the user to know and pass aggregate Ids when issuing commands? I realize this is probably domain dependent and could also be a UI/UX choice as well, just wondering what other's have done. It would make more sense to me if the number of my event sourced aggregates were more frequent, such as with meal tabs or shopping carts.
Thanks!

Heuristic: aggregate id, in many cases, is analogous to the primary key used to distinguish entities in a database table. Many of the lessons of natural vs surrogate keys apply.
Can I expect the user to know and pass aggregate Ids when issuing commands?
You probably can't depend on the human to know the aggregate ids. But the client that the human operator is using can very well know them.
For instance, if an operator is going to be working in a single warehouse during a session, then we might look up the appropriate identifier, cache it, and use it when constructing messages on behalf of the user.
Analog: when you fill in a web form and submit it, the browser does the work of looking at the form action and using that information to construct the correct URI, and similarly the correct HTTP Request.
The client will normally know what the ID is, because it just got it during a previous query.
Creation patterns are weird. It can, in some circumstances, make sense for the client to choose the identifier to be used when creating a new aggregate. In others, it makes sense for the client to provide an identifier for the command message, and the server decides for itself what the aggregate identifier should be.
It's messaging, so you want to be careful about coupling the client directly to your internal implementation details -- especially if that client is under a different development schedule. If you get the message contract right, then the server and client can evolve in any way consistent with the contract at any time.
You may want to review Greg Young's 10 year retrospective, which includes a discussion of warehouse systems. TL;DR - in many cases the messages coming from the human operators are events, not commands.

Would you say this is the correct approach?
You're asking if one of Greg Young's Event Sourcing samples represents the correct approach... Given that the combination of CQRS and Event Sourcing was essentially (re)invented by Greg, I'd say there's a pretty good chance of that.
In general, letting the code that implements the Command-side generate a GUID for every Command, Event, or other persistent object that it needs to write is by far the simplest implementation, since GUIDs are guaranteed to be unique. In a distributed system, uniqueness without coordination is a big thing.
Can I expect the user to know and pass aggregate Ids when issuing commands?
No, and you particularly can't expect a user to know the GUID of their assets. What you may be able to do is to present the user with a list of his or her assets. Each item in the list will have the GUID associated, but it may not be necessary to surface that ID in the user interface. It's just data that the underlying UI object carries around internally.
In some cases, users do need to know the ID of some of their assets (e.g. if it involves phone support). In that case, you can add a lookup API to address that concern.

What are the efficiency costs associated with using a custom ID in Mongodb

I plan on using this NPM package (shortid) to produce shorter IDs, primarily for use in URL's, I wish to use them, as directed, as the Mongodb id (at least for certain collections).
What are the costs associated with using custom IDs? Will it effect lookup time, write time etc. in any significant way?

These types of questions can quickly wander off into a battle of opinions so rather than stating an opinion I think providing some pros and cons and letting you decide which is better for this application would make more sense.
Assuming the format of the "shortid" will be stored as a string I think a response by Abigail Watson to a similar question on Google Groups sums up some of the larger points. Her response is primarily aimed at Meteor apps and so some of her pro/cons are associated with design decisions made by the Meteor team but you can see how you should be thinking about whether or not to use an ObjectId or a "shortid" is an application based decision.
Her entire response:
ObjectId Pros
it has an embedded timestamp in it.
it's the default Mongo _id type; ubiquitous
interoperability with other apps and drivers
ObjectId Cons
it's an object, and a little more difficult to manipulate in practice.
there will be times when you forget to wrap your string in new ObjectId()
it requires server side object creation to maintain _id uniqueness
which makes generating them client-side by minimongo problematic
String Pros
developers can create domain specific _id topologies
String Cons
developer has to ensure uniqueness of _ids
findAndModify() and getNextSequence() queries may be invalidated
Meteor's choice to go with a string, as I understand it, basically boils down to latency compensation and being able to generate the _id on the client-side in mini-mongo. The default ObjectId implementation didn't lend itself to being generated on the client as part of the latency compensation framework, so they decided to roll their own _id scheme.
Personally, I find the embedded timestamps in ObjectIds to be invaluable later in an application's lifecycle. They are more difficult to manipulate, and they add more debugging time to an application's development cycle. But for the extra 10 or 20 hours you put into debugging the ObjectIds, can return 10x or 100x savings down the road. Example: at work, we just salvaged a year's worth of production data because of the embedded timestamps, which has saved us probably hundreds of thousands of dollars of R&D time and effort.
ObjectId's are great if you can ensure that there's one central authority for generating them. They're also the preferred index type for any type of timeseries data. And while it may seem tempting to try to make a one-or-the-other decision for your entire app, I find choosing a string vs ObjectId (vs some other index scheme) really boils down to the topology of the data in the collection.
Some useful questions to maybe ask when choosing the _id for a collection:
Does the data in the collection need latency compensation?
Is it time-series data?
Will other applications or worker utilities be accessing the collection?
What is the topology of the data in the collection?
https://groups.google.com/d/msg/meteor-talk/f-ljBdZOwPk/oQYZQxCAKN8J
My two cents to throw into the mix is considering if the main reason to use a "shortid" is for shorter URLs why not create a URL property that is also indexed and used only for fetching documents with a URL id? You get to keep the ObjectId so you don't have to worry about sharding or dependency issues down the road while also having a shorter URL ID value.

Complex Finds in Domain Driven Design

I'm looking into converting part of an large existing VB6 system, into .net. I'm trying to use domain driven design, but I'm having a hard time getting my head around some things.
One thing that I'm completely stumped on is how I should handle complex find statements. For example, we currently have a screen that displays a list of saved documents, that the user can select and print off, email, edit or delete. I have a SavedDocument object that does the trick for all the actions, but it only has the properties relevant to it, and I need to display the client name that the document is for and their email address if they have one. I also need to show the policy reference that this document may have come from. The Client and Policy are linked to the SavedDocument but are their own aggregate roots, so are not loaded at the same time the SavedDocuments are.
The user is also allowed to specify several filters to reduce the list down. These to can be from properties that are stored on the SavedDocument or the Client and Policy.
I'm not sure how to handle this from a Domain driven design point of view.
Do I have a function on a repository that takes the filters and returns me a list of SavedDocuments, that I then have to turn into a different object or DTO, and fill with the additional client and policy information? That seem a little slow as I have to load all the details using multiple calls.
Do I have a function on a repository that takes the filters and returns me a list of SavedDocumentsForList objects that contain just the information I want? This seems the quickest but doesn't feel like I'm using DDD.
Do I load everything from their objects and do all the filtering and column selection in a service? This seems the slowest, but also appears to be very domain orientated.
I'm just really confused how to handle these situations, and I've not really seeing any other people asking questions about it, which masks me feel that I'm missing something.

Queries can be handled in a few ways in DDD. Sometimes you can use the domain entities themselves to serve queries. This approach can become cumbersome in scenarios such as yours when queries require projections of multiple aggregates. In this case, it is easier to use objects explicitly designed for the respective queries - effectively DTOs. These DTOs will be read-only and won't have any behavior. This can be referred to as the read-model pattern.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string