I have an online SQL database that needs to sync with my app's Core Data persistent store. Is there a simple way of writing to Core Data only data that currently does not exist?
For example:
If the SQL database currently holds 3 records: A B and C.
Core Data currently holds only A and B.
I just want to add C to Core Data during the syncing process.
I could do a series of loops to check each record but is there an easier way? All the records will be unique and therefore maybe there is a method for setting the Core Data attribute to act like a primary key.
Give the online database records a "last updated" attribute. Keep track of when your Core Data store was last updated. When it's time to sync, process only the remote records changed since your last local update.
Looping through all records, using find-or-create in Core Data, will be slow, because you'll be creating an NSManagedObject instance for each object in your datastore. Keep the logic needed for sync-or-not outside of Core Data for speed.
Related
I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop
Not really sure this is an explicit question or just a query for input. I'm looking at Azure Data Factory to implement a data migration operation. What I'm trying to do is the following:
I have a No SQL DB with two collections. These collections are associated via a common property.
I have a MS SQL Server DB which has data that is related to the data within the No SQL DB Collections via an attribute/column.
One of the NoSQL DB collections will be updated on a regular basis, the other one on a not so often basis.
What I want to do is be able to prepare a Data Factory pipline that will grab the data from all 3 DB locations combine them based on the common attributes, which will result in a new dataset. Then from this dataset push the data wihin the dataset to another SQL Server DB.
I'm a bit unclear on how this is to be done within the data factory. There is a copy activity, but only works on a single dataset input so I can't use that directly. I see that there is a concept of data transformation activities that look like they are specific to massaging input datasets to produce new datasets, but I'm not clear on what ones would be relevant to the activity I am wanting to do.
I did find that there is a special activity called a Custom Activity that is in effect a user defined definition that can be developed to do whatever you want. This looks the closest to being able to do what I need, but I'm not sure if this is the most optimal solution.
On top of that I am also unclear about how the merging of the 3 data sources would work if the need to join data from the 3 different sources is required but do not know how you would do this if the datasets are just snapshots of the originating source data, leading me to think that the possibility of missing data occurring. I'm not sure if a concept of publishing some of the data someplace someplace would be required, but seems like it would in effect be maintaining two stores for the same data.
Any input on this would be helpful.
There are a lot of things you are trying to do.
I don't know if you have experience with SSIS but what you are trying to do is fairly common for either of these integration tools.
Your ADF diagram should look something like:
1. You define your 3 Data Sources as ADF Datasets on top of a
corresponding Linked service
2. Then you build a pipeline that brings information from SQL Server into a
temporary Data Source (Azure Table for example)
3. Next you need to build 2 pipelines that will each take one of your NoSQL
Dataset and run a function to update the temporary Data Source which is the ouput
4. Finally you can build a pipeline that will bring all your data from the
temporary Data Source into your other SQL Server
Steps 2 and 3 could be switched depending on which source is the master.
ADF can run multiple tasks one after another or concurrently. Simply break down the task in logical jobs and you should have no problem coming up with a solution.
I am not sure if I understand the idea of reference tables correctly. In my mind it is a table that contains the same data in every shard. Am I wrong? I am asking because I have no idea how should I insert data to the reference table to make the data multiply in every shard. Or maybe it is impossible? Can anyone clarify this issue?
Yes, the idea of a Reference Table is that the same data is contained on every shard. If you have small numbers of shards and data changes are rare, you can open multiple connections in your application and apply the changes to multiple DBs simultaneously. Or you can construct a management script that iterates periodically across all shards to update reference data or performs a bulk-insert of a fresh image of rows.
The new feature previewing in Azure SQL Database called Elastic DB Jobs, which allows you to define a SQL script for operations that you want to take place on all shards, and then runs the script asynchronously with eventual completion guarantees. You can potentially use this to update reference tables. Details on the feature are here.
I have an application that uses two merged core data models mapped to two different data stores (both are Sqlite stores) via use of model configurations (each unique configuration within each model is mapped to its own data store). The persistent store coordinator does a good job in saving relevant data into a correct store. However, the problem is that when the stores initially created by core data on very first save operation their data schemas look absolutely identically and correspond to a union of the two merged models.
Is there any way to control core data so it creates the stores solely based on the configuration/model mapped into that store?
I guess not, because if core data can generate partially schemas into different persistent store then it might destroy the relationships between them and thus cause problems. At least in current stage i dont think Apple tends to complete this.
I wrote a Console Application that reads list of flat files
and Parse the data type on a row basis
and inserts records one after other in respective tables.
there are few Flat Files which contains about 63k records(rows).
for such files, my program is taking about 6 hours for one file of 63k
records to complete.
This is a test data file. In production i have to deal with 100 time more load.
I am worried, if i can do this any better to speed up.
Can any one suggest a best way to handle this job?
Work Flow is as below:
Read the FlatFile from Local Machine using File.ReadAllLines("location")
Create a Record Entity object after parsing each field of the row.
Insert this current row in to the Entity
Purpose of making this as console application is,
this application should be run(scheduled application) on weekly basis
and there is conditional logic in it, based on some variable there will be
full table replace or
update a existing table or
delete records in table.
You can try to use 'bulk insert' operation for inserting a huge data into database.