Limit GraphQL Queries by Breadth - security

There are a number of articles, tutorials and even modules that can limit malicious recursive queries by inspecting the query depth, cost or rate-limit a GraphQL server from a high rate of consecutive requests (DoS attack).
However, I haven't been able to find anything that would protect a GraphQL server from a "wide" query which simply pulls too many instances of a field in the same request, i.e.:
query MaliciousQuery {
alias1: fieldName { subfileld1 subfiled2 ...}
alias2: fieldName { subfileld1 subfiled2 ...}
...
alias10: fieldName { subfileld1 subfiled2 ...}
...
alias100: fieldName { subfileld1 subfiled2 ...}
...
alias1000: fieldName { subfileld1 subfiled2 ...}
...
}
Yes, GraphQL allows clients to ask for what they need but there are situations where we may want to limit the number of objects of a particular type, especially if fetching such an object is expensive. I should mention that pagination is not desirable in my use case.
One way, of course, is to limit the overall length of the query string but that is a crude way to accomplish this and will have unintended side effects with complex query requests that don't even refer to the actual expensive objects. Cost analysis can also be used but it seems like overkill for something this simple and will introduce other complexities, too.
It would be great if we can have a limiting directive on the schema where we can specify something like
#perRequestLimit(count: 5)
so clients cannot request more than, say, 5 of these expensive objects in a single query.
Is anyone aware of such a module, etc.? Or is there a different way to achieve this type of limiting?

A better approach would be to simply use query cost or query complexity analysis. There are existing libraries that can parse a requested query, determine the cost and reject the request if the cost exceeds the set allowed value. This allows you to easily control both the depth and breadth of queries, without having to "whitelist" individual queries, which would not be very scalable.

It appears that no such module/implementation exists even though I feel that this should be a core feature of GraphQL, especially since this can be a large DoS vulnerability and, as mentioned, cost/complexity analysis can be overkill in some cases (as in our particular scenario).
I ended up writing a simple #resourceLimit directive to restrict individual fields from being requested multiple times using aliases. This still gives us the option to add cost/complexity analysis down the road when we feel this may be necessary, but, for now, it serves our limited requirement.

Related

MongoDB most efficient Query Strategy

I state that I have already tried to look in the Mongo documentation, but I have not found what I am looking for. I've also read similar questions, but they always talk about very simple queries. I'm working with the Node's Mongo native driver. This is a scalability problem, so the collections I am talking about can have millions of records or some dozen.
Basically I have a query and I need to validate all results (which have a complex structure). Two possible solutions come to mind:
I create a query as specific as possible and try to validate the result directly on the server
I use the cursor to go through the documents one by one from the client (this would also allow me to stop if I am looking for only one result)
Here is the question: what is the most efficient way, in terms of latency, overall time, bandwidth use and computational weight server/client? There is probably no single answer, in fact I'd like to understand the pros and cons of the different approaches (and whichever approach you recommend). I know the solution should be determined on a case-by-case basis, however I am trying to figure out what could best cover most of the cases.
Also, to be more specific:
A) Being a complex query (several nested objects with ranges of values ​​and lists of values ​​allowed), performing the validation from the server would certainly save bandwidth, but is it always possible? And in terms of computation could it be more efficient to do it on the client?
B) I don't understand the cursor behavior: is it a continuously open stream until it is closed by server/client? In addition, does the next() result already take up resources on the server/client or does it happen to the call?
If anyone knows, I'd also like to know how Mongoose solved these "problems", for example in the case of custom validators.

How to get all documents from a collection in FaunaDB?

I already have an answer:
const faunadb = require('faunadb')
const q = faunadb.query
exports.handler = async (event, context) => {
const client = new faunadb.Client({
secret: process.env.FAUNADB_SERVER_SECRET
})
try {
// Getting the refs with a first query
let refs = await client.query(q.Paginate(q.Match(q.Index('skus'))))
// Forging a second query with the retrieved refs
const bigQuery = refs.data.map((ref) => q.Get(ref))
// Sending over that second query
let allDocuments = await client.query(bigQuery)
// All my documents are here!
console.log('#allDocuments: ', allDocuments);
//...
} catch (err) {
// ...
}
}
But I find it unsatisfying because I'm making 2 queries for what seems like one the most trivial DB call. It seems inefficient and wordy to me.
As I'm just learning about FaunaDB, there's probably something I don't grasp here.
My question could be split into 3:
Can I query for all documents in a single call?
If not, why not? What's the logic behind such a design?
Could I make such a query without an index?
FaunaDB's FQL language is quite similar to JavaScript (which helps a lot if you want to do conditional transactions etc).
In essence, FaunaDB also has a Map. Given that your index contains only one value that is the reference you can write this:
q.Map(
q.Paginate(q.Match(q.Index('skus'))),
q.Lambda(x => q.Get(x))
)
For this specific case, you actually do not need an index since each collection has a built-in default index to do a select all via the 'Documents' function.
q.Map(
q.Paginate(q.Documents(q.Collection('<your collection>'))),
q.Lambda(x => q.Get(x))
)
Now in case the index that you are using returns multiple values (because you would want to sort on something other than 'ref') then you need to provide the same amount of parameters to the Lambda as the amount of values that were defined in the index. Let's say my index has ts and ref in values because I want to sort them on time, then the query to get all values becomes:
q.Map(
q.Paginate(q.Match(q.Index('<your index with ts and ref values>'))),
q.Lambda((ts, ref) => q.Get(ref))
)
Values are used for range queries/sorting but also define what the index returns
Coming back to your questions:
- Can I query for all documents in a single call?
Absolutely, I would advice you to do so. Note that the documents you will get are paginated automatically. You can set the page size by providing a parameter to paginate and will get back an 'after' or 'before' attribute in case the page is bigger. That after or before can be again presented to the Paginate function as a parameter to get a next or previous page: https://docs.fauna.com/fauna/current/api/fql/functions/paginate
- Could I make such a query without an index?
No, but you can use the built-in index as explained above. FaunaDB protects users from querying without an index. Since it is a scalable database that could contain massive data and is pay-as-you-go it's a good idea to prevent users from shooting themselves in the foot :). Pagination and mandatory Indexes help to do that.
As to the why FQL is different. FQL is a language that is not declarative like many querying languages. Instead it's procedural, you write exactly how you fetch data. That has advantages:
By writing how data is retrieved you can exactly predict how a query behaves which is nice-to-have in a pay-as-you-go system.
The same language can be used for security rules or complex conditional transactions (update certain entities or many entities ranging over different collections depending on certain conditions). It's quite common in Fauna to write a query that does many things in one transaction.
Our flavour of 'stored procedures' called User Defined Functions are just written in FQL and not another language.
Querying is also discussed in this tutorial that comes with code in a GitHub repository which might give you a more complete picture: https://css-tricks.com/rethinking-twitter-as-a-serverless-app/
Can I query for all documents in a single call?
Yes, if your collection is small. The Paginate function defaults to fetching 64 documents per page. You can adjust the page size up to 100,000 documents. If your collection has more than 100,000 documents, then you have to execute multiple queries, using cursors to fetch subsequent documents.
See the Pagination tutorial for details: https://docs.fauna.com/fauna/current/tutorials/indexes/pagination
If not, why not? What's the logic behind such a design?
For an SQL database, SELECT * FROM table is both convenient and, potentially, a resource nightmare. If the table contains billions of rows, attempting to serve results for that query could consume the available resources on the server and/or the client.
Fauna is a shared database resource. We want queries to perform well for any user with any database, and that requires that we put sensible limits on the number of documents involved in any single transaction.
Could I make such a query without an index?
No, and yes.
Retrieving multiple results from Fauna requires an index, unless you are independently tracking the references for documents. However, with the Documents function, Fauna maintains an internal index so you don't need to create your own index to access all documents in a collection.
See the Documents reference page for details: https://docs.fauna.com/fauna/current/api/fql/functions/documents
Returning to your example code, you are executing two queries, but they could easily be combined into one. FQL is highly composable. For example:
let allDocuments = await client.query(
q.Map(
q.Paginate(q.Documents(q.Collection("skus"))),
q.Lambda("X", q.Get(q.Var("X")))
)
)
Your observation that FQL is wordy, is correct. Many functional languages exhibit that wordiness. The advantage is that any functions that accept expressions can be composed at will. One of the best examples of composability, and how to manage inter-document references, is presented in our E-commerce tutorial, specifically, the section describing the submit_order function: https://docs.fauna.com/fauna/current/tutorials/ecommerce#function

mongodb, Impact of collection data structure on performance

on the define the collection data structure, how to judge which structure is a good design or decision? This will affect the subsequent access to the database performance.
for example:
when the one data like this:
{
_id:'a'
index:1, //index 1~n
name:'john'
}
When n is large, meaning that data will be large and frequent deposited.
the collection data structure will be to one dimensional object:
{
_id:'a'
index:1,
name:'john'
}
.
.
.
{
_id:'a'
index:99,
name:'jule'
}
Or a composite two-dimensional object:
{
_id:'a'
info:[
{index:1,name:'john'},...,{index:99,name:'jule'}
]
}
composite two-dimensional object can effectively reduce the number of data, however, the search method is not convenient for writing, and whether it will actually reduce the effectiveness of searching or depositing a database.
Or the number of data is the key to affecting the effectiveness of the database.
"Better" means different things to different use cases. What works in your case might not necessarily work in other use cases.
Generally, it is better to avoid large arrays, due to:
MongoDB's document size limitation (16MB).
Indexing a large array is typically not very performant.
However, this is just a general observation and not a specific rule of thumb. If your data lends itself to an array-based representation and you're certain you'll never hit the 16MB document size, then that design may be the way to go (again, specific to your use case).
You may find these links useful to get started in schema design:
6 Rules of Thumb for MongoDB Schema Design: Part 1
Part 2
Part 3
Data Models
Use Cases
Query Optimization
Explain Results

CouchDB query for more dynamic values

I have more "Location documents" in my couchdb with longitude and latitude fields. How to find all location documents in database which distance to provided latitude and longitude is less than provided distance.
There is a way how to achieve it using vanilla CouchDB, but it‘s bit tricky.
You can use the fact you can apply two map functions during one request. Second map function can be created using list mechanics.
Lists are not very efficient from computational side, they can‘t cache results as views. But they have one unique feature – you can pass several arguments into list. Moreover, one of your arguments can be, for example, JS code, that is eval-ed inside list function (risky!).
So entire scheme looks like this:
Make view, that performs coarse search
Make list, that receives custom params and refines data set
Make client-side API to ease up querying this chain.
Can‘t provide exact code for your particular case, many details are not clear, but it seems that coarse search must group results to somehow linearly enumerated squares, and list perform more precise calculations.
Please note, that scheme might be inefficient for large datasets since it‘s computationally hungry.
Vanilla CouchDB isn't really built for geospacial queries.
Your best bet is to either use GeoCouch, CouchDB-Lucene or something similar.
Failing that, you could emit a Geohash from your map function, and do range queries over those.
Caveats apply. Queries around Geohash "fault lines" (equator, poles, longitude 180, etc) can give too many or too little results.
There are multiple JavaScript libraries that can help convert to/from Geohash, as well as help with some of those caveats.
CouchDB is not built for dynamic queries, so there is no good/fast way of implementing it in vanilla couchDB.
If you know beforehand which locations you want to calculate the distance from you could create a view for each location and call it with parameters ?startkey=0&endkey=max_distance
function(doc) {
function distance(...){ /* your function for calculating distance */ }
var NY = {lat:40,lon:73}
emit( distance(NY,doc), doc._id);
}
If you do not know the locations beforehand you could solve it by using a temporary view, but I would strongly advise against it since it's slow and should only be used for testing.

Transform MongoDB Data on Find

Is it possible to transform the returned data from a Find query in MongoDB?
As an example, I have a first and last field to store a user's first and last name. In certain queries, I wish to return the first name and last initial only (e.g. 'Joe Smith' returned as 'Joe S'). In MySQL a SUBSTRING() function could be used on the field in the SELECT statement.
Are there data transformations or string functions in Mongo like there are in SQL? If so can you please provide an example of usage. If not, is there a proposed method of transforming the data aside from looping through the returned object?
It is possible to do just about anything server-side with mongodb. The reason you will usually hear "no" is you sacrifice too much speed for it to make sense under ordinary circumstances. One of the main forces behind PyMongo, Mike Dirolf with 10gen, has a good blog post on using server-side javascript with pymongo here: http://dirolf.com/2010/04/05/stored-javascript-in-mongodb-and-pymongo.html. His example is for storing a javascript function to return the sum of two fields. But you could easily modify to return the first letter of your user name field. The gist would be something like:
db.system_js.first_letter = "function (x) { return x.charAt(0); }"
Understand first, though, that mongodb is made to be really good at retrieving your data, not really good at processing it. The recommendation (see for example 50 tips and tricks for mongodb developers from Kristina Chodorow by Oreilly) is to do what Andrew tersely alluded to doing above: make a first letter column and return that instead. Any processing can be more efficiently done in the application.
But if you feel that even querying for the fullname before returning fullname[0] from your 'view' is too much of a security risk, you don't need to do everything the fastest possible way. I'd avoided map-reduce in mongodb for awhile because of all the public concerns about speed. Then I ran my first map reduce and twiddled my thumbs for .1 seconds as it processed 80,000 10k documents. I realize in the scheme of things, that's tiny. But it illustrates that just because it's bad for a massive website to take a performance hit on some server side processing, doesn't mean it would matter to you. In my case, I imagine it would take me slightly longer to migrate to Hadoop than to just eat that .1 seconds every now and then. Good luck with your site
The question you should ask yourself is why you need that data. If you need it for display purposes, do that in your view code. If you need it for query purposes, then do as Andrew suggested, and store it as an extra field on the object. Mongo doesn't provide server-side transformations (usually, and where it does, you usually don't want to use them); the answer is usually to not treat your data as you would in a relational DB, but to use the more flexible nature of the data store to pre-bake your data into the formats that you're going to be using.
If you can provide more information on how this data should be used, then we might be able to answer a little more usefully.

Resources