How to seed data with CloudKit? - core-data

I need to create some records in CloudKit for each user when they start an app.
I can't just write a seed function that create records. Because when the user starts the app in two devices, they will each write their own seed record.
What I want instead is for the first device to write to CloudKit gets to create the record. And then second device will simply update the values of those records no recreate them.
How can I achieve this?

You have a few options available to you, but all could potentially lead to race-conditions when attempting to write both at the same time, but the actuality of it happening is minimal.
No matter which approach is taken, you should always take the stance of query first. Check if the record exists, update it if needed, then write the new/updated values.
So, in your example:
The first app would query for the record, and create the record - because no record exists.
The second app to launch would query for the record, find it, then do nothing, because the record exists.
Each record in CloudKit maintains a modificationDate. So if you are really concerned about overwriting data that shouldn't be overridden, then you can add attentional queries and date checks to determine if the write should happen.

Related

Best way to run a script for large userbase?

I have users stored in postgresql database (~10 M) and i want to send all of them emails.
Currently i have written a nodejs script which basically fetches users 1000 at a time (Offset and limit in sql) and queues the request in rabbit MQ. Now this seems clumsy to me, as if the node process fails at any time i have to restart the process (i am currently keeping track of number of users skipped per query, and can restart back at the previous number skipped found from logs). This might lead to some users receiving duplicate email and some not receiving any. I can create a new table with new column indicating whether email has been to that person or not, but in my current situation i cant do so. Neither can i create a new table nor can i add a new row to existing table. (Seems to me like idempotent problem?).
How would you approach this problem? Do you think compound indexes might help. Please explain.
The best way to handle this is indeed to store who received an email, so there's no chance of doing it twice.
If you can't add tables or columns to your existing database, just create a new database for this purpose. If you want to be able to recover from crashes, you will need to store who got the email somewhere so if you are given hard restrictions on not storing this in your main database, get creative with another storage mechanism.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Can ElasticSearch delete all and insert new documents in a single query?

I'd like to swap out all documents for a specific index's type. I'm thinking about this like a database transaction, where I'd:
Delete all documents inside of the type
Create new documents
Commit
It appears that this is possible with ElasticSearch's bulk API, but is there a more direct way?
Based on the following statement, from the elasticsearch Delete by Query API Documentation:
Note, delete by query bypasses versioning support. Also, it is not recommended to delete "large chunks of the data in an index", many times, it’s better to simply reindex into a new index.
You might want to reconsider removing entire types and recreating them from the same index. As this statement suggests, it is better to simply reindex. In fact I have a scenario where we have an index of manufacturer products and when a manufacturer sends an updated list of products, we load the new data into our persistent store and then completely rebuild the entire index. I have implemented the use of Index Aliases to allow for masking the actual index being used. When products changes occur a process is started to rebuild the new index in the background (a process that currently takes about 15 minutes) and then switch the alias to the new index once the data load is complete and delete the old index. So this is completely seamless and does not cause any downtime for our users.

Processing a stream in Node where action depends on asynchronous calls

I am trying to write a node program that takes a stream of data (using xml-stream) and consolidates it and writes it to a database (using mongoose). I am having problems figuring out how to do the consolidation, since the data may not have hit the database by the time I am processing the next record. I am trying to do something like:
on order data being read from stream
look to see if customer exists on mongodb collection
if customer exists
add the order to the document
else
create the customer record with just this order
save the customer
My problem is that two 'nearby' orders for a customer cause duplicate customer records to be written, since the first one hasn't been written before the second one checks to see if it there.
In theory I think I could get around the problem by pausing the xml-stream, but there is a bug preventing me from doing this.
Not sure that this is the best option, but using async queue was what I ended up doing.
At the same time as I was doing that a pull request for xml-stream (which is what I was using to process the stream) that allowed pausing was added.
Is there a unique field on the customer object in the data coming from the stream? You could add a unique restriction to your mongoose schema to prevent duplicates at the database level.
When creating new customers, add some fallback logic to handle the case where you try to create a customer but that same customer is created by another save at the same. When this happens try the save again but first fetch the other customer first and add the order to the fetched customer document

Get Timestamp after Insert/Update

In azure table storage. Is there a way to get the new timestamp value after an update or insert. I am writing a 3-phase commit protocol to get table storage to support distributed transactions , and it involes multiple writes to the same entity. So the operation order goes like this, Read Entity, Write Entity (Lock Item), Write Entity (Commit new values). I would like to get the new timestamp after the lock item operation so I don't have to unecessarily read the item again before doing the commit new value operation. So does any one know how to efficiently get the new timestamp value after a savechanges operation?
I don't think you need to do anything special/extra. When you read your entity you will get an Etag for it. When you save that entity (setting someLock=true) that save will only succeed if nobody else have updated the entity since your read. Hence you know you have the lock. And then you can do your second write as you please.
I don't believe it is possible. I would use your own timestamp and/or guid to mark entries.
If you're willing to go back to the Update REST API call, it does return the time that the response was generated. It probably won't be exactly the same as the time stamp on the record, but it will be close I'm sure.
You may need to hack your Azure table. drivers
In the Azure python lib (TableStorage) for example, the Timestamp is simply skipped over.
# exclude the Timestamp since it is auto added by azure when
# inserting entity. We don't want this to mix with real properties
if name in ['Timestamp']:
continue

Resources