Node.js Redis expiry vs fresh content - node.js

I have a classified advertisements website ala craigslist.org - just on a much smaller scale.
I'm running MongoDB and is caching all API requests in Redis where the Mongo query is the key and the value is the MongoDB result document.
Pseudo code:
// The mongo query
var query = {section: 'home', category: 'garden', 'region': 'APAC', country: 'au', city: 'sydney', limit: 100}
// Getting the mongo result..
// Storing in Redis
redisClient.set(JSON.stringify(query), result);
Now a user creates a new post in the same category, but Redis now serves up a stale record because Redis has no idea that the dataset has changed.
How can we overcome this?
I could set an expiry in general or on that particular key, but essentially mem cached keys need to expire in the moment a user creates a new post for those keys where the result set would include the newly created record.
One way is to iterate though all the Redis keys and come up with a pattern to detect which keys should be deleted based on the characteristics of the newly created record. Only this approach seem "too clever" and not quite right.
So we want to mem cache at the same time as serving up fresh content instantly.

I would avoid caching bulk query results as a single key. Redis is for use cases where you need to access and update data at very high frequency and where you benefit from use of data structures such as hashes, sets, lists, strings, or sorted sets [1]. Also, keep in mind MongoDB will already have part of the database cached in memory so you might not see much in the way of performance gains.
A better approach would be to cache each post individually. You can add keys to sets to group them into categories or even just pages (like the 20 or so posts that the user expects to see on each page). This way, every time a user makes a new post or updates an existing one, you can update the corresponding key in your Redis cache as well.

Related

Caching relational data using redis

I'm building a small social network (users have posts and posts have comments - very basic), using clustered nodejs server and redis as a distributed cache.
My approach to cache users posts is to have a sorted set that contains all the user's posts ids ordered by rate(which should be updated every time someone add a like or comment), and actual objects sorted as hash objects.
So the get user's posts flow should look like this:
1. using zrange to get a range of ids from the sorted set.
2. using multi/exec and hgetall to fetch all the objects at once.
I have a couple of questions:
1. in regards of performance issues, will my approach scale when the cache size getting bigger, or maybe I should use lua or something?
1. in case if I want to continue with current approach, where I should save the sorted set in case of redis crash, if I use the redis persistence this will affect the overall performance, I thought about using a dedicated redis server for the sets (I searched If it is possible to backup only part of the redis data but didn't found anything about it.
My approach => getTopObjects({userID}, 0, 20) :
self.zrange = function(setID, start, stop, multi)
{
return execute(this, "zrange", [setID, start, stop], multi);
};
self.getObject = function(key, multi)
{
return execute(this, "hgetall", key, multi);
};
self.getObjects = function(keys)
{
let multi = thisArg.client.multi();
let promiseArray = [];
for (var i = 0, len = keys.length; i < len; i++)
{
promiseArray.push(this.getObject(keys[i], multi));
}
return execute(this, "exec", [], multi).then(function(results)
{
//TODO: do something with the result.
return Promise.all(promiseArray);
});
};
self.getTopObjects = function(setID, start, stop)
{
//TODO: validate the range
let thisArg = this;
return this.zrevrange(setID, start, stop).then(function(keys)
{
return thisArg.getObjects(keys);
});
};
It's an interesting intellectual exercise, but in my opinion this is classic premature optimization.
1) It's probably way too early to have even introduced redis, let alone be thinking about whether redis is fast enough. Your social network is almost certainly just fine up to about 1,000 users running off raw SQL queries against Mysql / Postgres / Random RDS. If it starts to slow down, get data on slow running queries and fix them with query optimizations and appropriate indexes. That'll get you past 10,000 users.
2) Now you can start introducing redis. In general, I'd encourage you to think about your redis as purely caching and not permanent storage; it shouldn't matter if it gets blown away, it just means your site is slower for the next few seconds because your users are getting their page loads from SQL queries instead of redis hits (each query re-populating that user's sorted list of posts in redis, of course).
Your strategy and example code for using redis seem fine to me, but until you have actual data on how users use your site (which may be drastically different than your current expectations), it's simply impossible to know what types of SQL indexes you will need, what keys and lists are ideal for caching in redis, etc.
I faced similar issues, I needed a way to query the data more efficiently. Can't say for sure but I heard Redis being single threaded blocks the main thread when running lua scripts, i'm sure that's not good for a social networking site. I heard about Tarantool and it looks promising, currently trying to wrap my head around it.
If you are concerned about your cache size growing bigger, I think most social networks keep two weeks worth of data in the users cache, anything older than two weeks gets deleted and you simply implement a scrolling feature that works with pagination, once the user scrolls down, fetch the next two weeks worth of data and add it back to memory only for that specific user (don't forget to specify the new ttl for the newly added data). This helps keep your cache size lean.
What happens when redis or any in memory data tool you are using crashes, you simply reload data back into the memory. They all have features where you save data to files as backup. I'm thinking of implementing another database layer don't know lets say Cassandra or Mongodb that holds the timelines of each user since inception. Sure this creates another overhead cause you have to keep three data layers (e.g mysql, redis and mongodb) in sync!
If this looks like a lot of work, feel free to use a 3rd party service to host your in memory data, at least you can sleep easy, but it's gonna cost you.
That said, this is highly opinionated. Got tired of people telling me to wait until my site explodes with users or the so called premature optimization reply you got :)

Store and retrieve data from coordinates in Redis

I'm using Redis as a caching system to store data to avoid unnecessary consume on an API. What i was thinking is after get the result from the API, store the coordinate (as a key, for example) and the data on Redis and in a later search before consume the external api again, passing the a new coordinate check in Redis if matches with the saved coordinate (in meters, or anything) and if it matches, bring me the data stored.
I already searched for at least 1 hour and could not find any relevant results that fits my needs. For example, GEOADD could not help me because it doesn't expire automatically by Redis.
The only solution would be storing the coordinates as key (example: -51.356156,-50.356945) with a json value, and check with a functional programming all the keys (coordinates) if it's matches with an other coordinate. But it seens not elegant, also with bad performance.
Any ideas?
I'm using Redis in NodeJS (Express).
If I understand correctly, you want to able to:
Cache the API's response for a given tuple of coordinates
Be able to perform an efficient radius search over the cached responses
Use Redis' expiration to invalidate old cache entries
To satisfy #1, you've already outlined the right approach - store each API call under its own key. You can name the key by your coordinates, or use their geohash value (computing it can be done in the client or with a temporary element in Redis). Also don't forget setting a TTL on that key and the global maxmemory eviction policy for eviction to actually work.
The 2nd requirement calls for using a Geo Set. Store the coordinates and the key names in it. Perform your query by calling GEORADIUS and then fetch the relevant keys according to the reply.
While fetching the keys from #2's query, you may find some of them have been evicted from the keyspace but not in the Geo Set. Call ZREM for each of these to keep a semblance of sync between your index (the Geo Set) and the keyspace. Additionally, you can also run a periodic background task that ZSCAN's the Geo Set and does housekeeping. That should take care of #3.

Cloud Functions Http Request return cached Firebase database

I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});
You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.

How to account for a failed write or add process in Mongodb

So I've been trying to wrap my head around this one for weeks, but I just can't seem to figure it out. So MongoDB isn't equipped to deal with rollbacks as we typically understand them (i.e. when a client adds information to the database, like a username for example, but quits in the middle of the registration process. Now the DB is left with some "hanging" information that isn't assocaited with anything. How can MongoDb handle that? Or if no one can answer that question, maybe they can point me to a source/example that can? Thanks.
MongoDB does not support transactions, you can't perform atomic multistatement transactions to ensure consistency. You can only perform an atomic operation on a single collection at a time. When dealing with NoSQL databases you need to validate your data as much as you can, they seldom complain about something. There are some workarounds or patterns to achieve SQL like transactions. For example, in your case, you can store user's information in a temporary collection, check data validity, and store it to user's collection afterwards.
This should be straight forwards, but things get more complicated when we deal with multiple documents. In this case, you need create a designated collection for transactions. For instance,
transaction collection
{
id: ..,
state : "new_transaction",
value1 : values From document_1 before updating document_1,
value2 : values From document_2 before updating document_2
}
// update document 1
// update document 2
Ooohh!! something went wrong while updating document 1 or 2? No worries, we can still restore the old values from the transaction collection.
This pattern is known as compensation to mimic the transactional behavior of SQL.

Potential issue with Couchbase paging

It may be too much turkey over the holidays, but I've been thinking about a potential problem that we could have with Couchbase.
Currently we paginate based on time, but I'm thinking a similar issue could occur with other values used for paging for example the atomic counter. I'll try to explain best I can, this would only occur in a load balanced environment.
For example say we have 4 servers load balanced and storing data to our Couchbase cluster. We sort our records based on timestamps currently. If any of the 4 servers writing the data starts to lag behind the others than our pagination would possibly be missing records when retrieving client side. A SQL DB auto-increment and timestamps for example can be created when the record is stored to the DB which will avoid similar issues. Using a NoSql DB like Couchbase you define the data you need to retrieve on before it is stored to the DB. So what I am getting at is if there is a delay in storing to the DB and you are retrieving in a pagination fashion while this delay has occurred, you run the real possibility of missing data. Since we are paging that data may never be viewed.
Interested in what other thoughts people have on this.
EDIT**
Response to Andrew:
Example a facebook or pintrest type app is storing data to a DB, they have many load balanced servers from the frontend writing to the db. If for some reason writing is delayed its a non issue with a SQL DB because a timestamp or auto increment happens when the data is actually stored to the DB. There will be no missing data when paging. asking for 1-7 will give you data that is only stored in the DB, 7-* will contain anything that is delayed because an auto-increment value has not been created for that record becuase it is not actually stored.
In Couchbase its different, you actually get your auto increment value (atomic counter) and then save it. So for example say a record is going to be stored as atomic counter number 4. For some reasons this is delayed in storing to the DB. Other servers are grabbing 5, 6, 7 and storing that data just fine. The client now asks for all data between 1 and 7, 4 is still not stored. Then the next paging request is 7 to *. 4 will never be viewed.
Is there a way around this? Can it be modelled differently in CB, or is this just a potential weakness in CB when needing to page results. As I mentioned are paging is timestamp sensitive.
Michael,
Couchbase is an eventually consistent database with respect to views. It is ACID with respect to documents. There are durability interfaces that let you manage this. This means that you can rest assured you won't lose data and that indexes will catch up eventually.
In my experience with Couchbase, you need to expect that the nodes will never be in-sync. There are many things the database is doing, such as compaction and replication. The most important thing you can do to enhance performance is to put your views on a separate spindle from the data. And you need to ensure that your main data spindles across your cluster can sustain between 3-4 times your ingestion bandwidth. Also, make sure your main document key hashes appropriately to distribute the load.
It sounds like you are discussing a situation where the data exists in your system for less time than it takes to be processed through the view system. If you are removing data that fast, you need either a bigger cluster or faster disk arrays. Of the two choices, I would expand the size of your cluster. I like to think of Couchbase as building a RAIS, Redundant Array of Independent Servers. By expanding the cluster, you reduce the coincidence of hotspots and gain disk bandwidth. My ideal node has two local drives, one each for data and views, and enough RAM for my working set.
Anon,
Andrew

Resources