I know this question is as old as time -- and their is no silver bullet. But I think there might be a solid pattern out there, and I would like not to invent the wheel.
Consider the following two schema options:
Approach 1) My original implementation
type Query {
note(id: ID!): Note
notes(input: NotesQueryInput): [Note!]!
}
Approach 2) My current experimental approach
type DatedId {
date: DateTime!
id: ID!
}
type Query {
note(id: ID!): Note
notes(input: NotesQueryInput): [DatedId!]!
}
The differences are:
with approach 1) the notes query will either return a list of potentially large Note objects
with approach 2) the notes query will return a much lighter payload BUT then will need to execute n additional queries
So my question is with the Apollo Client / Server stack with in-memory-cache which is the best approach. to achieve a responsive client with a scalable server.
Notes
With approach 1 -- my 500mb dyno (heroku server) ran out of memory.
I expect with either approach I will implement pagination with the connection / edge pattern
the graphql server is primarly to serve my own frontend.
If you're running out memory on the server, it may be time to upgrade. If you're running out of memory now, imagine what will happen when you have multiple users hitting your endpoint.
The only other way to get around that specific problem is to break up your query into several smaller queries. However, your proposed approach suffers from a couple of problems:
You will end up hammering your server and your database with significantly more requests
Your UI may take longer to load, depending on whether the requested data needs to be rendered immediately
Handling the scenario when one of your requests fails, or attempting to retry a failed request, may be challenging
You already suggested added pagination, and I think it would be a much better way to break up your single large query into smaller ones. Not only does pagination lend itself to a better user experience, but by enforcing a limit on the size of the page, you can effectively enforce a limit on the size of a given query.
One other option you may consider exploring is using deferred queries. This experimental feature was added specifically with expensive queries in mind. By making one or more fields on your Note type deferred, you would effectively return null for them initially, and their values would be sent down in a second "patch" response after they finally resolve. This works great for fields that are expensive to resolve, but may also help with fields that return a large amount of data.
Related
I have more questions based on this question and answer, which is now quite old but still seems to be accurate.
The suggestion of storing results in memory seems problematic in many scenarios.
A web farm where the end-user isn't locked to a specific server.
Very large result sets.
Smaller result sets with many different users and queries.
I see a few ways of handling paging based on what I've read so far.
Use OFFSET and LIMIT at potentially high RU costs.
Use continuation tokens and caches with the scaling concerns.
Save the continuation tokens themselves to go back to previous pages.
This can get complex since there may not be a one to one relationship between tokens and pages.
See Understanding Query Executions
In addition, there are other reasons that the query engine might need to split query results into multiple pages. These include:
The container was throttled and there weren't available RUs to return more query results
The query execution's response was too large
The query execution's time was too long
It was more efficient for the query engine to return results in additional executions
Are there any other, maybe newer options for paging?
I understand that there are two ways of iterating over a large result set in Cassandra:
Querying explicitly with tokens, as discussed in this article on "Displaying rows from an unordered partitioner with the TOKEN function". This appears to have been the only way of doing things prior to Cassandra 2.0.
Using "paging state".
Paging state appears to be the suggested way of doing things these days, but doing it the old token way still works.
Aside from it being the blessed way of doing things, which is of course a type of advantage, I'd love to understand what are the particular advantages of using the "new" method over the "old"? Is there a reason I should not use token in this way?
The use of paging or tokens is really depending on your requirements, and technical abilities. From my point of view, use of paging is good for fetching of data from big partition, or when you have not so much data in table, so you can use select * from table.
But if you have multiple servers in cluster, and big amounts of data, use of token will allow you to read data from specific servers (if you set routing key correctly), and in parallel (Spark Cassandra Connector uses token exactly for this reason) - this is big advantage over use of paging where you're using one coordinator node that needs to go to other nodes for data that it doesn't have. But for some people, it's not really easy to implement, because you need to cover edge cases, like, when token range doesn't start exactly at minimum value. I have example in Java how to do it if you need.
I agree with Alex this answer, I will add that when you do that the old school(with tokens), you have the hand on your tokens, this mean that if you are dealing with a big amount of data, that you can save your check points for example, that you can handle well restart after failure, or just pausing your job for example, or also to launch multi-threading jobs and separate teach worker data, the way spark workers deal with data for example are token based too.
The driver handle the paging automatically for you so you don't have to fetch pages with all the advantage of a native thing handled for you, but the use of token give you full hands on the way you are paginating with all the advantage that you can get from it(attacking a specific range, specific server)
I hope this helps !
I'm building a small social network (users have posts and posts have comments - very basic), using clustered nodejs server and redis as a distributed cache.
My approach to cache users posts is to have a sorted set that contains all the user's posts ids ordered by rate(which should be updated every time someone add a like or comment), and actual objects sorted as hash objects.
So the get user's posts flow should look like this:
1. using zrange to get a range of ids from the sorted set.
2. using multi/exec and hgetall to fetch all the objects at once.
I have a couple of questions:
1. in regards of performance issues, will my approach scale when the cache size getting bigger, or maybe I should use lua or something?
1. in case if I want to continue with current approach, where I should save the sorted set in case of redis crash, if I use the redis persistence this will affect the overall performance, I thought about using a dedicated redis server for the sets (I searched If it is possible to backup only part of the redis data but didn't found anything about it.
My approach => getTopObjects({userID}, 0, 20) :
self.zrange = function(setID, start, stop, multi)
{
return execute(this, "zrange", [setID, start, stop], multi);
};
self.getObject = function(key, multi)
{
return execute(this, "hgetall", key, multi);
};
self.getObjects = function(keys)
{
let multi = thisArg.client.multi();
let promiseArray = [];
for (var i = 0, len = keys.length; i < len; i++)
{
promiseArray.push(this.getObject(keys[i], multi));
}
return execute(this, "exec", [], multi).then(function(results)
{
//TODO: do something with the result.
return Promise.all(promiseArray);
});
};
self.getTopObjects = function(setID, start, stop)
{
//TODO: validate the range
let thisArg = this;
return this.zrevrange(setID, start, stop).then(function(keys)
{
return thisArg.getObjects(keys);
});
};
It's an interesting intellectual exercise, but in my opinion this is classic premature optimization.
1) It's probably way too early to have even introduced redis, let alone be thinking about whether redis is fast enough. Your social network is almost certainly just fine up to about 1,000 users running off raw SQL queries against Mysql / Postgres / Random RDS. If it starts to slow down, get data on slow running queries and fix them with query optimizations and appropriate indexes. That'll get you past 10,000 users.
2) Now you can start introducing redis. In general, I'd encourage you to think about your redis as purely caching and not permanent storage; it shouldn't matter if it gets blown away, it just means your site is slower for the next few seconds because your users are getting their page loads from SQL queries instead of redis hits (each query re-populating that user's sorted list of posts in redis, of course).
Your strategy and example code for using redis seem fine to me, but until you have actual data on how users use your site (which may be drastically different than your current expectations), it's simply impossible to know what types of SQL indexes you will need, what keys and lists are ideal for caching in redis, etc.
I faced similar issues, I needed a way to query the data more efficiently. Can't say for sure but I heard Redis being single threaded blocks the main thread when running lua scripts, i'm sure that's not good for a social networking site. I heard about Tarantool and it looks promising, currently trying to wrap my head around it.
If you are concerned about your cache size growing bigger, I think most social networks keep two weeks worth of data in the users cache, anything older than two weeks gets deleted and you simply implement a scrolling feature that works with pagination, once the user scrolls down, fetch the next two weeks worth of data and add it back to memory only for that specific user (don't forget to specify the new ttl for the newly added data). This helps keep your cache size lean.
What happens when redis or any in memory data tool you are using crashes, you simply reload data back into the memory. They all have features where you save data to files as backup. I'm thinking of implementing another database layer don't know lets say Cassandra or Mongodb that holds the timelines of each user since inception. Sure this creates another overhead cause you have to keep three data layers (e.g mysql, redis and mongodb) in sync!
If this looks like a lot of work, feel free to use a 3rd party service to host your in memory data, at least you can sleep easy, but it's gonna cost you.
That said, this is highly opinionated. Got tired of people telling me to wait until my site explodes with users or the so called premature optimization reply you got :)
Ok, so this is basically a yes/no answer. I have a node application running on Heroku with a postgres plan. I use Objection.js as ORM. I'm facing 300+ms response times on most of the endpoints, and I have a theory about why this is. I would like to have this confirmed.
Most of my api endpoints do around 5-10 eager loads. Objection.js handles eager loading by doing additional WHERE IN queries and not by doing one big query with a lot of JOINS. The reason for this is that it was easier to build that way and that it shouldn't harm performance too much.
But that made me thinking: Heroku Postgres doesn't run on the same heroku dyno as the node application I assume, so that means there is latency for every query. Could it be that all these latencies add up, causing a total of 300ms delay?
Summarizing: would it be speedier to build your own queries with Knex instead of generating them through Objection.js, in case you have a separately hosted database?
Speculating why this kind of slowdowns happen is mostly useless (I'll speculate a bit in the end anyways ;) ). First thing to do in this kind of situation would be to measure where that 300ms are used.
From database you should be able to see query times to spot if there are any slow queries causing the problems.
Also knex outputs some information about performance to console when you run it with DEBUG=knex:* environment variable set.
Nowadays also node has builtin profiling support which you can enable by setting --inspect flag when starting node. Then you will be able to connect your node process with chrome dev tools and see where node is using its time. From that profile you will be able to see for example if database query result parsing is dominating execution time.
The best way to figure out the slowness is to isolate the slow part of the app and inspect that carefully or even post that example to stackoverflow that other people can tell you why it might be slow. General explanation like this doesn't give much tool for other to help to resolve the real issue.
Objection.js handles eager loading by doing additional WHERE IN queries and not by doing one big query with a lot of JOINS. The reason for this is that it was easier to build that way and that it shouldn't harm performance too much.
With objection you can select which eager algorithm you like to use. In most of the cases (when there are one-to-many or many-to-many relations) making multiple queries is actually more performant compared to using join, because with join amount of data explodes and transfer time + result parsing on node side will take too much time.
would it be speedier to build your own queries with Knex instead of generating them through Objection.js, in case you have a separately hosted database?
Usually no.
Speculation part:
You mentioned that Most of my api endpoints do around 5-10 eager loads.. Most of the times when I have encountered this kind of slowness in queries the reason has been that app is querying too big chunks of data from database. When query returns for example tens of thousands rows, it will be some megabytes of JSON data. Only parsing that amount of data from database to JavaScript objects takes hundreds of milliseconds. If your queries are causing also high CPU load during that 300ms then this might be your problem.
Slow queries. Somtimes database doesn't have indexes set correctly so queries just will have to scan all the tables linearly to get the results. Checking slow queries from DB logs will help to find these. Also if getting response is taking long, but CPU load is low on node process, this might be the case.
Hereby I can confirm that every query takes approx. 30ms, regardless the complexity of the query. Objection.js indeed does around 10 separate queries because of the eager loading, explaining the cumulative 300ms.
Just FYI; I'm going down this road now ⬇
I've started to delve into writing my own more advanced SQL queries. It seems like you can do pretty advanced stuff, achieving similar results to the eager loading of Objection.js
select
"product".*,
json_agg(distinct brand) as brand,
case when count(shop) = 0 then '[]' else json_agg(distinct shop) end as shops,
case when count(category) = 0 then '[]' else json_agg(distinct category) end as categories,
case when count(barcode) = 0 then '[]' else json_agg(distinct barcode.code) end as barcodes
from "product"
inner join "brand" on "product"."brand_id" = "brand"."id"
left join "product_shop" on "product"."id" = "product_shop"."product_id"
left join "shop" on "product_shop"."shop_code" = "shop"."code"
left join "product_category" on "product"."id" = "product_category"."product_id"
left join "category" on "product_category"."category_id" = "category"."id"
left join "barcode" on "product"."id" = "barcode"."product_id"
group by "product"."id"
This takes 19ms on 1000 products, but usually the limit is 25 products, so very performant.
As mentioned by other people, objection uses multiple queries instead of joins by default to perform eager loading. It's a safer default than completely join based loading which can become really slow in some cases. You can read more about the default eager algorithm here.
You can choose to use the join based algoritm simply by calling the joinEager instead method of eager. joinEager executes one single query.
Objection also used to have a (pretty stupid) default of 1 parallel queries per operation, which meant that all queries inside an eager call were executed sequentially. Now that default has been removed and one should get a better performance even in cases like yours.
Your trick of using json_agg is pretty clever and actually avoids the slowness problems I referred to that can arise in some cases when using joinEager. However, that cannot be easily used for nested loading or with other database engines.
I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});
You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.