Can use a little guidance here. I get a daily update of a 15,000+ record database in XML format. This is the source data for my app in which I am using Core Data. The contents of the XML dump changes on a daily basis in the following ways:
1) Some records will be deleted.
2) New records will be added.
3) Existing records may be modified.
What is the best way to update Core Data with the daily changes from this XML file? My thinking is that I am going to have to iterate through the pList and somehow compare that to what is already in Core Data. Not sure how to do this.
I did a search on the site and found this article but not sure if this is what I need to do: Initialize Core Data With Default Data
Thank you in advance.
Darin
You didn't say specifically, but I'm guessing that your total database size is 15,000+ records, and that your XML update contains values for all of them. Here are some ideas to consider.
Do the XML records contain a date of last modification? If not, can you add that? Then note the last time your Core Data version was updated, and ignore all XML records older than that.
For the records that are deleted, you'll have to find them in Core Data and then delete them. You'll probably see better performance if you set your fetch request's result type to NSManagedObjectIDResultType. The NSPersistentObjects don't need to be fully realized in order to delete them.
If you're stuck with undated XML, try adding an entity just for change detection. Store the 6 digit pin number, and the -hash of the entire original XML string for the relevant record. Upon update, fetch the pin/hash pair and compare. If the hash values are the same, it's unlikely that the data has changed.
This is going to turn into an optimization problem. The best way to proceed will depend on the characteristics of your data: number of attributes, size of records, size of the delta in each daily update. Structure your fetch request predicates to minimize the number of fetch requests you perform (for instance, by using "IN" operator to pass multiple 6-digit pin numbers). Consider using NSDictionaryResultType if there's just one attribute you need. Measure first, optimize second.
Related
For example, your FRC fetches a news feed and groups the articles into sections by date of publication.
And then you want to limit the size of each section to be up to 10 articles each.
One option I’ve considered is having separate NSFetchedResultsControllers for each day and setting a fetch limit. But that seems unnecessary as the UI only really needs a single FRC (not to mention that the number of days is unbounded).
Edit:
I’m using a diffable data source snapshot.
If it was me, I'd leave the NSFetchedResultsController alone for this and handle it in the table view. Implement tableView(_:, numberOfRowsInSection:) so that it never returns a value greater than 10. Then the table will never ask for more than 10 rows in a section, and your UI will be as you want.
Since I’m using a diffable data source snapshot, I am able to take the snapshot I receive in the FRC delegate callback and use it to create a new snapshot, keeping only the first K items in a section.
I have an enormous dataset (over 300 million documents). It is a system for archiving data and rollback capability.
The rollback capability is a cursor which iterates trough the whole dataset and performs few post requests to some external end points, it's a simple piece of code.
The data being iterated over needs to be send ordered by the timestamp (filed in the document). The DB was down for some time, so backup DB was used, but has received older data which has been archived manually, and later all was merged with the main DB.
Older data breaks the order. I need to sort this dataset, but the problem is the size; there is not enough RAM available to perform this operation at once. How I can achieve this sorting?
PS: The documents do not contain any indexed fields.
There's no way to do an efficient sort without an index. If you had an index on the date field then things would already be sorted (in a sense), so getting things in a desired order is very cheap (after the overhead of the index).
The only way to sort all entries without an index is to fetch the field you want to sort for every single document and sort them all in memory.
The only good options I see are to either create an index on the date field (by far the best option) or increase the RAM on the database (expensive and not scalable).
Note: since you have a large number of documents it's possible that even your index wouldn't be super scalable -- in that case you'd need to look into sharding the database.
Requirements:
A single ElasticSearch index needs to be constructed from a bunch of flat files that gets dropped every week
Apart from this weekly feed, we also get intermittent diff files, providing additional data that was not a part of the original feed (insert or update, no delete)
The time to parse and load these files (weekly full feed or the diff files) into ElasticSearch is not very huge
The weekly feeds received in two consecutive weeks are expected to have significant differences (deletes, additions, updates)
The index is critical for the apps to function and it needs to have close to zero downtime
We are not concerned about the exact changes made in a feed, but we need to have the ability to rollback to the previous version in case the current load fails for some reason
To state the obvious, searches need to be fast and responsive
Given these requirements, we are planning to do the following:
For incremental updates (diff) we can insert or update records as-is using the bulk API
For full updates we will reconstruct a new index and swap the alias as mentioned in this post. In case of a rollback, we can revert to the previous working index (backups are also maintained if the rollback needs to go back a few versions)
Questions:
Is this the best approach or is it better to CRUD documents on the previously created index using the built-in versioning, when re-constructing an index?
What is the impact of modifying data (delete, update) to the underlying lucene indices/shards? Can modifications cause fragmentation or inefficiency?
At first glance, I'd say that your overall approach is sound. Creating a new index every week with the new data and swapping an alias is a good approach if you need
zero downtime and
to be able to rollback to the previous indices for whatever reason
If you were to keep only one index and CRUD your documents in there, you'd not be able to rollback if anything goes wrong and you could end up in a mixed state with data from the current week and data from the week earlier.
Every time you update (even one single field) or delete a document, the previous version will be flagged as deleted in the underlying Lucene segment. When the Lucene segments have grown sufficiently big, ES will merge them and wipe out the deleted documents. However, in your case, since you're creating an index every week (and eventually delete the index from the week prior), you won't land into a situation where you'll have space and/or fragmentation issues.
I wrote a Console Application that reads list of flat files
and Parse the data type on a row basis
and inserts records one after other in respective tables.
there are few Flat Files which contains about 63k records(rows).
for such files, my program is taking about 6 hours for one file of 63k
records to complete.
This is a test data file. In production i have to deal with 100 time more load.
I am worried, if i can do this any better to speed up.
Can any one suggest a best way to handle this job?
Work Flow is as below:
Read the FlatFile from Local Machine using File.ReadAllLines("location")
Create a Record Entity object after parsing each field of the row.
Insert this current row in to the Entity
Purpose of making this as console application is,
this application should be run(scheduled application) on weekly basis
and there is conditional logic in it, based on some variable there will be
full table replace or
update a existing table or
delete records in table.
You can try to use 'bulk insert' operation for inserting a huge data into database.
I have been working with databases recently and before that I was developing standalone components that do not use databases.
With all the DB work I have a few questions that sprang up.
Why is a database query faster than a programming language data retrieval from a file.
To elaborate my question further -
Assume I have a table called Employee, with fields Name, ID, DOB, Email and Sex. For reasons of simplicity we will also assume they are all strings of fixed length and they do not have any indexes or primary keys or any other constraints.
Imagine we have 1 million rows of data in the table. At the end of the day this table is going to be stored somewhere on the disk. When I write a query Select Name,ID from Employee where DOB="12/12/1985", the DBMS picks up the data from the file, processes it, filters it and gives me a result which is a subset of the 1 million rows of data.
Now, assume I store the same 1 million rows in a flat file, each field similarly being fixed length string for simplicity. The data is available on a file in the disk.
When I write a program in C++ or C or C# or Java and do the same task of finding the Name and ID where DOB="12/12/1985", I will read the file record by record and check for each row of data if the DOB="12/12/1985", if it matches then I store present the row to the user.
This way of doing it by a program is too slow when compared to the speed at which a SQL query returns the results.
I assume the DBMS is also written in some programming language and there is also an additional overhead of parsing the query and what not.
So what happens in a DBMS that makes it faster to retrieve data than through a programming language?
If this question is inappropriate on this forum, please delete but do provide me some pointers where I may find an answer.
I use SQL Server if that is of any help.
Why is a database query faster than a programming language data retrieval from a file
That depends on many things - network latency and disk seek speeds being two of the important ones. Sometimes it is faster to read from a file.
In your description of finding a row within a million rows, a database will normally be faster than seeking in a file because it employs indexing on the data.
If you pre-process you data file and provided index files for the different fields, you could speedup data lookup from the filesystem as well.
Note: databases are normally used not for this feature, but because they are ACID compliant and therefore are suitable for working in environments where you have multiple processes (normally many clients on many computers) querying the database at the time.
There are lots of techniques to speed up various kinds of access. As #Oded says, indexing is the big solution to your specific example: if the database has been set up to maintain an index by date, it can go directly to the entries for that date, instead of reading through the entire file. (Note that maintaining an index does take up space and time, though -- it's not free!)
On the other hand, if such an index has not been set up, and the database has not been stored in date order, then a query by date will need to go through the entire database, just like your flat-file program.
Of course, you can write your own programs to maintain and use a date index for your file, which will speed up date queries just like a database. And, you might find that you want to add other indices, to speed up other kinds of queries -- or remove an index that turns out to use more resources than it is worth.
Eventually, managing all the features you've added to your file manager may become a complex task; you may want to store this kind of configuration in its own file, rather than hard-coding it into your program. At the minimum, you'll need features to make sure that changing your configuration will not corrupt your file...
In other words, you will have written your own database.
...an old one, I know... just for if somebody finds this: The question contained "assume ... do not have any indexes"
...so the question was about the sequential dataread fight between the database and a flat file WITHOUT indexes, which the database wins...
And the answer is: if you read record by record from disk you do lots of disk seeking, which is expensive performance wise. A database always loads pages by concept - so a couple of records all at once. Less disk seeking is definitely faster. If you would do a mem buffered read from a flat file you could achieve the same or better read values.