Processing a stream in Node where action depends on asynchronous calls

Processing a stream in Node where action depends on asynchronous calls - node.js

I am trying to write a node program that takes a stream of data (using xml-stream) and consolidates it and writes it to a database (using mongoose). I am having problems figuring out how to do the consolidation, since the data may not have hit the database by the time I am processing the next record. I am trying to do something like:
on order data being read from stream
look to see if customer exists on mongodb collection
if customer exists
add the order to the document
else
create the customer record with just this order
save the customer
My problem is that two 'nearby' orders for a customer cause duplicate customer records to be written, since the first one hasn't been written before the second one checks to see if it there.
In theory I think I could get around the problem by pausing the xml-stream, but there is a bug preventing me from doing this.

Not sure that this is the best option, but using async queue was what I ended up doing.
At the same time as I was doing that a pull request for xml-stream (which is what I was using to process the stream) that allowed pausing was added.

Is there a unique field on the customer object in the data coming from the stream? You could add a unique restriction to your mongoose schema to prevent duplicates at the database level.
When creating new customers, add some fallback logic to handle the case where you try to create a customer but that same customer is created by another save at the same. When this happens try the save again but first fetch the other customer first and add the order to the fetched customer document

Related

Best way to run a script for large userbase?

I have users stored in postgresql database (~10 M) and i want to send all of them emails.
Currently i have written a nodejs script which basically fetches users 1000 at a time (Offset and limit in sql) and queues the request in rabbit MQ. Now this seems clumsy to me, as if the node process fails at any time i have to restart the process (i am currently keeping track of number of users skipped per query, and can restart back at the previous number skipped found from logs). This might lead to some users receiving duplicate email and some not receiving any. I can create a new table with new column indicating whether email has been to that person or not, but in my current situation i cant do so. Neither can i create a new table nor can i add a new row to existing table. (Seems to me like idempotent problem?).
How would you approach this problem? Do you think compound indexes might help. Please explain.

The best way to handle this is indeed to store who received an email, so there's no chance of doing it twice.
If you can't add tables or columns to your existing database, just create a new database for this purpose. If you want to be able to recover from crashes, you will need to store who got the email somewhere so if you are given hard restrictions on not storing this in your main database, get creative with another storage mechanism.

How to seed data with CloudKit?

I need to create some records in CloudKit for each user when they start an app.
I can't just write a seed function that create records. Because when the user starts the app in two devices, they will each write their own seed record.
What I want instead is for the first device to write to CloudKit gets to create the record. And then second device will simply update the values of those records no recreate them.
How can I achieve this?

You have a few options available to you, but all could potentially lead to race-conditions when attempting to write both at the same time, but the actuality of it happening is minimal.
No matter which approach is taken, you should always take the stance of query first. Check if the record exists, update it if needed, then write the new/updated values.
So, in your example:
The first app would query for the record, and create the record - because no record exists.
The second app to launch would query for the record, find it, then do nothing, because the record exists.
Each record in CloudKit maintains a modificationDate. So if you are really concerned about overwriting data that shouldn't be overridden, then you can add attentional queries and date checks to determine if the write should happen.

Handle Duplicate insertion node js

What is the best way to handle duplicate insertion?
Either we should check before insertion if item already exist then notify user for duplicate entry or we can handle error message and let user know that its a duplicate entry.
Using first approach will cost us an extra database call.
Or if there is any other better approach to handle this please let me know.

Duplicate insertion is at database level.
Your call to the api must be coming from front end.So you need to
ensure that duplicate call is avoided at first place e.g you should
disable the button as soon as user clicks it first time.
Or
You can add database schema level check like primary key so that if
duplicate data comes error is thrown and same can be forwarded to
user.
Or
add checks mentioned in
http://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html
Checking Whether data exists before insertion is a expensive call and that too you will have to hit on master so try to avoid that.

The best approach is to use a primary key based on the data. If this is not possible with your data then you'll have to query the database before insertion.

How to create a appropriate database model for the IM

recently we're developing the IM feature for our app. And we would save the chat record with core data. The strategy we make are:
every account has a separate sqlite file.
every chat has a separate table (dynamic created, refer to this article ), however, the table structure is the same. such as,
sender_id
msg_id
content
msg_send_time
...
If we put all the chat message in a table, and we fetch the records by "fromid and toid" to get a specific dialog records. However, if we have thousands of thousands message in this table, we doubt the fetch request would be very slow. so we create a specific table for each dialog.
So, is there any better solution for this problem?

Creating "tables" for conversations dynamically is a very bad idea. This will create so much overhead that it will make your code completely inefficient.
Instead, use single entity (not table, mind you, Core Data is not a database) to capture the messages. Filter by user IDs.
This will perform without a glitch with 100.000s of messages, far more than should be stored or displayed on a mobile device.

How to update fields automatically

In my CouchDB database I'd like all documents to have an 'updated_at' timestamp added when they're changed (and have this enforced).
I can't modify the document with validation functions
updates functions won't run unless they're called specifically (so it'd be possible to update the document and not call the specific update function)
How should I go about implementing this?

There is no way to do this now without triggering _update handlers. This is nice idea to track documents changing time, but it faces problems with replications.
Replications are working on top of public API and this means that:
In case of enforcing such trigger you'll have replications broken since it will be impossible to sync data as it is without document modification. Since document get modified, he receives new revision which may easily lead to dead loop if you replicate data from database A to B and B to A in continuous mode.
In other case when replications are fixed there will be always way to workaround your trigger.

I can suggest one work around - you can create a view which emits a current date as a key (or a part of it):
function( doc ){
emit( new Date, null );
}
This will assign current dates to all documents as soon as the view generation gets triggered (which happens after first request to it) and will reassign new dates on each update of a specific document.
Although the above should solve your issue, I would advice against using it for the reasons already explained by Kxepal: if you're on a replicated network, each node will assign its own dates. So taking this into account, the best I can recommend is to solve the issue on the client side and just post the documents with a date already embedded.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Processing a stream in Node where action depends on asynchronous calls - node.js

Not sure that this is the best option, but using async queue was what I ended up doing. At the same time as I was doing that a pull request for xml-stream (which is what I was using to process the stream) that allowed pausing was added.

Related

Best way to run a script for large userbase?

How to seed data with CloudKit?

Handle Duplicate insertion node js

How to create a appropriate database model for the IM

How to update fields automatically

Categories

Resources