Postgres with elasticsearch (keep in sync) - nodeJS - node.js

I want to set up postgres and elasticsearch. But before throwing data into elasticsearch, I want to prevent data loss when network or server goes down. After reading on this topic: https://gocardless.com/blog/syncing-postgres-to-elasticsearch-lessons-learned/. I came up with 3 solutions.
Create a database table ei: store, and add any new/updated data to it.
During queries: insert data into store.
Select new data: SELECT data FROM store WHERE modified > (:last modified time from elasticsearch)
Send "new" data over to elasticsearch
Use redis to pub/sub requests, and make elasticsearch listen/subscribe for upcoming data. If elasticsearch breaks, the data will be in the queue
Catch any errors during transaction to elasticsearch and save data into a safe place (ei: store table mentioned above). Then have a cron job pushing this data back.
Of course the easiest thing would be to insert data to elasticsearch straight away. But doing so prevents data to be stored in a safe place during corruptions. 1 is too slow in my opinion, unlike 2. And 3 requires mantaining error handling code.
For now 2 is my option.
Are there better ways to do this? I'd like to hear your opinions and new suggestions
:D

Redis (2) isn't reliable.
What I decided to do add data to elasticsearch straight away and add data to updates table. Then run a sync() function straight after connecting to elasticsearch client (if cluster went down before) + run a cron job every 24 hours to launch sync(). All sync() does is selects newest data (time or id) from updates A and elasticsearch B and compares if there are records A > B. If so, insert data using bulk API.
Hope this helps :)
And I am still opened to suggestions and fedback...

Related

Best way to run a script for large userbase?

I have users stored in postgresql database (~10 M) and i want to send all of them emails.
Currently i have written a nodejs script which basically fetches users 1000 at a time (Offset and limit in sql) and queues the request in rabbit MQ. Now this seems clumsy to me, as if the node process fails at any time i have to restart the process (i am currently keeping track of number of users skipped per query, and can restart back at the previous number skipped found from logs). This might lead to some users receiving duplicate email and some not receiving any. I can create a new table with new column indicating whether email has been to that person or not, but in my current situation i cant do so. Neither can i create a new table nor can i add a new row to existing table. (Seems to me like idempotent problem?).
How would you approach this problem? Do you think compound indexes might help. Please explain.
The best way to handle this is indeed to store who received an email, so there's no chance of doing it twice.
If you can't add tables or columns to your existing database, just create a new database for this purpose. If you want to be able to recover from crashes, you will need to store who got the email somewhere so if you are given hard restrictions on not storing this in your main database, get creative with another storage mechanism.

Best practice for automatically updating database everyday?

I have an original source, Source A, where I am fetching data from. I am reformatting and filtering the data from Source A, and storing it in Source B.
I have created a REST API using Node/Express that uses Source B.
Source A gets updated once every day. I want to update Source B at the same rate. What is the best way to do it?
Ideas so far:
For every API call I get to my server, before returning the data, check if the data was last updated within a day. If not then update the data and send it. This would mean that one API call per day would be extremely slow.
Perform Cron Scheduling
I would like to know if there are more ways to do this and I would like a comparison of different ways? I would also like if any of you guys do something like that in production and what method has worked?
Note: In my case Source A is a CSV file on a github repo, and Source B is MongoDB collection.
The best case you can achieve here is automatic updates to the MongoDB collection whenever that github CSV file is updated. If you can hook your job execution into whatever event is triggering the CSV upload, you would be golden. If you have to go through github, look into github hooks and see if you can subscribe your workload to one of those events
There is a nice option 3 that you can do with mongo, by the way. Mongo queues are great for scheduling jobs at precise intervals.

Fetching 3.6 million records with Sequelize crashes Node, MariaDB Connector works. Any idea why?

as the title already says, I'm trying to run a raw SELECT query that results in 3.6 records.
When I use the MariaDB Connector (https://github.com/mariadb-corporation/mariadb-connector-nodejs) it works just fine and is done in ~2 minutes.
But Sequelize takes much longer and in the end, Node crashes after 20 to 30 minutes.
Do you guys have any idea how to fix this?
Thank you very much and have a great day!
Take care,
Patrick
When you perform your request, sequelize will perform a SELECT on the underlying database.
Then two thing will happend consecutively :
MariaDB will load all the data matching your criteria
MariaDB will send all the data to sequelize, that will :
Overload your app memory (all the data will be stored into node.js memory)
Crash sequelize because it is not made to handle that much data
When you perform request on huge dataset, use cursors. With cursors, MariaDB will load all the data but then, sequelize will get the data by group (For example, sequelize load 100 data, you treat it, then it load 100 data again, which means that at top you will have loaded 100 data on your node.js memory).
https://github.com/Kaltsoon/sequelize-cursor-pagination
https://mariadb.com/kb/en/cursor-overview/
Redesign your app to do more of the work in the database. Downloading 3.6M rows is poor design for any application.
If you would like to explain what you are doing with that many rows, maybe we can help create a workaround. (And it may even work faster than 2 minutes!)

Redis and Postgresql synchronization (online users status)

In an NodeJS application I have to maintain a "who was online in the last N minutes" state. Since there is potentially thousands of online users - for performance reasons - I decided to not update my Postgresql user table for this task.
I choosed to use Redis to manage the online status. It's very easy and efficient.
But now I want to make complex queries to the user table, sorted by the online status.
I was thinking of creating a online table filled every minute from a Redis snapshot, but I'm not sure it's the best solution.
Following the table filling, will the next query referencing the online table take a big hit caused by the new indexes creation or loading?
Does anyone know a better solution?
I had to solve almost this exact same issue, but I took a different approach because I Didn't like the issues caused by trying to mix Redis and Postgres.
My solution was to collect the online data in a queue (Zero MQ in my case) but any queueing system should work, or a stream processing facility like Amazon Kinesis (The alternative I looked at.) I then inserted the data in batches into a second table (not the users table). I don't delete or update that table, only inserts and queries are allowed.
Doing things this way preserved the ability to do joins between the last online data and the users table without bogging down the database or creating many updates on the user tables. It has the side effect of giving us a lot of other useful data.
One thing to note that I have though about when thinking of other solutions to this problem is that your users table in transactional data(OLTP) while the latest online information is really analytics data (OLAP), so if you have a data warehouse, data lake, big data, or whatever term of the week you want to use for storing this type of data and querying against it that may be a better solution.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Resources