Best practice for automatically updating database everyday? - node.js

I have an original source, Source A, where I am fetching data from. I am reformatting and filtering the data from Source A, and storing it in Source B.
I have created a REST API using Node/Express that uses Source B.
Source A gets updated once every day. I want to update Source B at the same rate. What is the best way to do it?
Ideas so far:
For every API call I get to my server, before returning the data, check if the data was last updated within a day. If not then update the data and send it. This would mean that one API call per day would be extremely slow.
Perform Cron Scheduling
I would like to know if there are more ways to do this and I would like a comparison of different ways? I would also like if any of you guys do something like that in production and what method has worked?
Note: In my case Source A is a CSV file on a github repo, and Source B is MongoDB collection.

The best case you can achieve here is automatic updates to the MongoDB collection whenever that github CSV file is updated. If you can hook your job execution into whatever event is triggering the CSV upload, you would be golden. If you have to go through github, look into github hooks and see if you can subscribe your workload to one of those events
There is a nice option 3 that you can do with mongo, by the way. Mongo queues are great for scheduling jobs at precise intervals.

Related

Check if DynamoDB - table has been updated, or is currently being updated, with AWS SDK for Python (Boto3)

Even though I skimmed through the official docs, I couldn't find a simple example showing how to check this.
Context:
One or more files are uploaded into a S3-bucket. This triggers a data preprocessing lambda-function, which then puts these changes into certain DynamoDB-tables. Later on, these tables need to be further processed.
In order to prevent the associated lambda-function to scan all DynamoDB-tables, there should be a way of knowing which tables have been updated.
Desirable solution:
Naively, there should be a way of doing the following via the Python SDK boto3:
for _table in entire_table_list:
if is_updated(_table):
.. trigger further processing
elif is_being_updated(_table):
.. do something else, maybe wait till the table update is done
As I have not found something akin to this, I'd be delighted to know how to handle this situation best.

Best way to run a script for large userbase?

I have users stored in postgresql database (~10 M) and i want to send all of them emails.
Currently i have written a nodejs script which basically fetches users 1000 at a time (Offset and limit in sql) and queues the request in rabbit MQ. Now this seems clumsy to me, as if the node process fails at any time i have to restart the process (i am currently keeping track of number of users skipped per query, and can restart back at the previous number skipped found from logs). This might lead to some users receiving duplicate email and some not receiving any. I can create a new table with new column indicating whether email has been to that person or not, but in my current situation i cant do so. Neither can i create a new table nor can i add a new row to existing table. (Seems to me like idempotent problem?).
How would you approach this problem? Do you think compound indexes might help. Please explain.
The best way to handle this is indeed to store who received an email, so there's no chance of doing it twice.
If you can't add tables or columns to your existing database, just create a new database for this purpose. If you want to be able to recover from crashes, you will need to store who got the email somewhere so if you are given hard restrictions on not storing this in your main database, get creative with another storage mechanism.

Incremental Data Storage

I have time series daily data which I run a model on. The model runs in Spark.
I only want to run the model daily, and append the results to the historic results. It is important to have a 'merged single data source' containing historical data for the model to run successfully.
I have to use an AWS service to store the results. If I store in S3, I will end up storing backfill + 1 file per day (too many files). If I store in Redshift, it doesn't merge + upsert, therefore becoming complicated. The customer facing data is in Redshift, so dropping the table and reloading daily is not an option.
I am not sure how to cleverly (defined as minimal cost and subsequent processing) store the incremental data without re-processing everything daily to get a single file.
S3 is still your best shot. Since your job doesn't seems need to be accessed on a real-time fashion, it's more of a rolling data set.
If you are worried about the amount of file it generates, there is at least 2 things you can do:
S3 object lifecycle management
You can define your objects to be removed or transition to another storage class(cheaper) after x days.
More examples: https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-configuration-examples.html
S3 notification
Basically you can set up a listener in your S3 bucket, 'listening for' all the objects that match your specified prefix and suffix, to trigger other AWS services. One easy thing you can do is to trigger a Lambda, do your processing and then you can do whatever you would like.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/enable-event-notifications.html
Use S3 as your database whenever it's possible. It's damn cheap and it's AWS's backbone.
You can also switch to an ETL. A very efficient one, which is OpenSource, specialized in BigData, fully automatizable and easy to use is the Pentaho Data Integrator.
It comes equipped with ready made plugins for S3, Redshift (and others), and there is a single step to compare with previous values. From my experience it runs pretty fast. Plus it works for you during the night and sends you a morning mail saying every thing went OK (or not).
Note to the moderators: this is a agnostic point of view, I could have recommended many others, but this one seams the most suited for the OP's need.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Node.js - Scaling with Redis atomic updates

I have a Node.js app that preforms the following:
get data from Redis
preform calculation on data
write new result back to Redis
This process may take place several times per second. The issue I now face is that I wish to run multiple instances of this process, and I am obviously seeing out of date date being updated due to each node updating after another has got the last value.
How would I make the above process atomic?
I cannot add the operation to a transaction within Redis as I need to get the data (which would force a commit) before I can process and update.
Can anyone advise?
Apologies for the lack of clarity with the question.
After further reading, indeed I can use transactions however the area I was struggling to understand was that I need separate out the read from the update, and just wrap the update in the transaction along with using WATCH on the read. This causes the update transaction to fail if another update has taken place.
So the workflow is:
WATCH key
GET key
MULTI
SET key
EXEC
Hopefully this is useful for anyone else looking to an atomic get and update.
Redis supports atomic transactions http://redis.io/topics/transactions

Resources