I would appreciate if i can get some advise from gurus here about Redis. We all learnt that Redis is used to cache static data either from API or from SQL or NOSQL database i.e data that doesn't change. Good, here comes my question; In a situation where by you are retrieving the seme set of data from the same table or from a view(combination of tables with join but all are already put in one view) with parameters, is such query good to query good to be cached?
To simplify my question, if my query has a where clause that I am passing parameter to e.g Between two dates, retrieving data by item id with dates or by retrieving records based on criteria but from the same table or same view(meaning records returned will definitely change because of different parameter passed). Is this kind of data good to be cached using Redis?
I only want to know if such data is what is good for Redis Caching too or just static data or API.
Related
I have a backend api with express. I've implemented logging with winston and morgan.
My next requirement is to record a user's activity: timestamp, the user, and the content he've fetched or changed, into the database MySQL. I've searched web and found this. But since there is no answer yet, I've come to this.
My Thought:
I can add another query which INSERT all the information mentioned above, right before I response to the client, in my route handlers. But I'm curious if there could be another way to beautifully achieve it.
Select the best approach that suits your system from following cases.
Decide whether your activity log should be persistent or in memory, based on use case. Lets assume persistent and the Db is mySQL.
If your data is already is DB, there is no point of storing all the data again, you can just store keys/ids that are primary for identification, for the rows which you have performed CRUD. you can store as foreign keys in case if the operations performed are always fixed or serialised JSON in activity table.
For instance, the structure can be shown as below, where activity_data is serialised JSON value.
ID | activity_name | activity_data | start_date | end_date |
If there is a huge struggle while gathering the data again, at the end of storing activity before sending response, you can consider applying activity functions to the database abstraction layer or wrapper module created for mySQL (assuming).
For instance :
try {
await query(`SELECT * FROM products`);
//performActivity(insertion)
}catch{
//performErrorActivity(insertion)
}
Here, we need to consider a minor trade off regarding performance, as we are performing insertion operation at each step.
If we want to do it all at once, we need to maintain a collection that add up references of all activity in something like request.activityPayload or may be a cache and perform the insertion at last.
If you are thinking of specifically adding a new data-source for activity, A non-relational DB can be highly recommended to store/dump such data (MongoDB opinionated). This is because it doesn't focuses on schema structure as compare to relational DB as well you can achieve performance benefits as compare to mySQL specifically in case of activity storing.
Will try to make it as clear as possible so an example isn't required as this has to be a concept that I didn't grasp properly and I'm struggling with rather than a problem with data or Spark code itself.
I'm required to insert city data within their own database (MongoDB) and I'm trying to perform those upserts as fast as possible.
Take into account a sample DataFrame with the following, where I want to do some upserts against MongoDB based on, for example, year, city and zone.
year - city - zone - num_business - num_vehicles.
Having groupedBy those columns I'm just pending to perform the upsert into the DB.
Using the MongoDB Driver I'm required to instantiate several WriteConfigs to cope with multiple databases (1 database per city).
// the 'getDatabaseWriteConfigsPerCity' method filters the 'df' so it only contains the docs from a single city.
for (cityDBConnection <- getDatabaseWriteConfigsPerCity(df) {
cityDBConnection.getDf.foreach(
... // set MongoDB upsert criteria.
)
}
Doing it that way works but still, more performance can be gained when using foreachPartition as I hope that those records within the DF are spread to the executors are more data is concurrently being upsert.
However, I get erroneous results when using foreachPartition. Erroneus because they seem incomplete. Counters are way off and such.
I suspect this is because, among the partitions, same keys are in different partitions and it's not until those are merged in the master when those are inserted to MongoDB as a single record.
Is there any way I can make sure partitions contain the total of documents related to an upsert key?
Don't really know if I'm being clear enough, but if it's still too complicated I will update as soon as possible.
Is there any way I can make sure partitions contain the total of
documents related to an upsert key? if you do:
df.repartition("city").foreachPartition{...}
You can be sure that all records with same city are in the same partition (but there is probably more than 1 city per partition!)
In an NodeJS application I have to maintain a "who was online in the last N minutes" state. Since there is potentially thousands of online users - for performance reasons - I decided to not update my Postgresql user table for this task.
I choosed to use Redis to manage the online status. It's very easy and efficient.
But now I want to make complex queries to the user table, sorted by the online status.
I was thinking of creating a online table filled every minute from a Redis snapshot, but I'm not sure it's the best solution.
Following the table filling, will the next query referencing the online table take a big hit caused by the new indexes creation or loading?
Does anyone know a better solution?
I had to solve almost this exact same issue, but I took a different approach because I Didn't like the issues caused by trying to mix Redis and Postgres.
My solution was to collect the online data in a queue (Zero MQ in my case) but any queueing system should work, or a stream processing facility like Amazon Kinesis (The alternative I looked at.) I then inserted the data in batches into a second table (not the users table). I don't delete or update that table, only inserts and queries are allowed.
Doing things this way preserved the ability to do joins between the last online data and the users table without bogging down the database or creating many updates on the user tables. It has the side effect of giving us a lot of other useful data.
One thing to note that I have though about when thinking of other solutions to this problem is that your users table in transactional data(OLTP) while the latest online information is really analytics data (OLAP), so if you have a data warehouse, data lake, big data, or whatever term of the week you want to use for storing this type of data and querying against it that may be a better solution.
Right now the way I am doing my workflow is like this:
get a list of rows from a postgres database (let's say 10.000)
for each row I need to call an API endpoint and get a value, so 10.000 values returned from API
for each row that I have a value returned I need to update a field in the database. 10.000 rows updated
Right now I am doing a update after each API fetch but as you can imagine this isn't the most optimized way.
What other option do I have?
Probably bottleneck in that code is fetching the data from API. This trick only allows to send many small queries to DB faster without having to wait roundtrip time between each update.
To do multiple updates in single query you could use common table expressions and pack multiple small queries to single CTE query:
https://runkit.com/embed/uyx5f6vumxfy
knex
.with('firstUpdate', knex.raw('?', [knex('table').update({ colName: 'foo' }).where('id', 1)]))
.with('secondUpdate', knex.raw('?', [knex('table').update({ colName: 'bar' }).where('id', 2)]))
.select(1)
knex.raw trick there is a workaround, since .with(string, function) implementation has a bug.
Recently, I'm working on make a solution for storing user's search log/query log into a HBase table.
Let's simple the raw Query log:
query timestamp req_cookie req_ip ...
Data access patterns:
scan through all querys within a time range.
scan through all search history with a specified query
I came up with the following row-key design:
<query>_<timestamp>
But the query may be very long or in different encoding, put query directly into the rowkey seems unwise.
I'm looking for help in optimizing this schema, anybody handling this scenario before?
1- You can do a full table scan with a timerange. In case you need realtime responses you have to maintain a reverse row-key table <timestamp>_<query> (plan your region splitting policy carefully first).
Be warned that sequential row key prefixes will get some of your
regions very hot if you have a lot of concurrence, so it would be wise
to buffer writes to that table. Additionally, if you get more writes than a single region can handle you're going to implement some sort of sharding prefix (i.e modulo of the timestamp), although this will make your
retrievals a lot more complex (you'll have to merge the results of
multiple scans).
2- Hash the query string in a way that you always have a fixed-length row key without having to care about encoding (MD5 maybe?)