Every time i am fetching data from core data the sequence of data is different. I just want to fetch data in a sequence order, Means the data is stored in to core data, I want the same sequence at the time of fetching data.
I don't want to use ascending order and IDs.
Please help me. if there is any other way to solve this.
Thanks
Core Data doesn't keep track of the order in which you added the data. The information's just not there. If you want to retain that order, your entity description needs to include some value you can use to sort the fetch results. That could be a unique integer ID, or a timestamp, or something else you think of. You say you don't want that, but with Core Data it's necessary.
Related
I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.
I am a noob coming from a SQL background. Please forgive my SQLness in advance.
Currently working on a new client project in which I need to store mass amounts of data which is fairly simple, but I'm not sure what is the best method for doing this in Couchdb.
This is for employee vehicle tracking. I need to store VIN (Vehicle Identification Number), GPSLat, GPSLong, DateTimeStamp.
The VIN will never change, but will have tons of GPS data associated with it.
In a SQL structure, it would be pretty simple - VIN table, then another relational table for the data.
In a Couchdb key-pair, would you recommend storing the VIN over and over, or do I create 1 VIN as a key-pair and store the GPS data some way? If so, how should I keep storing the GPS data as a key-pair?
Thank you in advance to you experts on key-pair storing.
You can do it both ways. Couchdb supports linked documents in which you can create a document once and reference it in other documents and then write a view which can emit both the orignal document and the contents of the linked document using include_docs=true.
But since you have a very small data, just one json field, I suggest that you store the field within the document itself since you will have to link the vin id with the document containing rest of the vehicle details anyway.
The linked document strategy is useful when you have lots of fields in the parent document that you want to retrieve with the child document.
I think it depends on how you want to access the data.
If you regularly access all the linked data, but updates are infrequent, then you could amend the document. Keep in mind though that as you keep updating the same document you have to retrieve it each time, which may get slower as the document could get quite large.
I'd probably store a document for each timestamp. You can use this for very fast access. Your views can support lists so if you have:
[ {vin as key}, {datestamp}]
You would have very quick access to say all entries for a given vin/date using the startkey/endkey parameters to the view.
Similarly you could have a view for the maximum date stamp by vin.
Thanks guys, so it sounds like the best way for storage is related to how I want to retrieve the data.
Based on the answers above, it sounds like it would be best to store the VIN each time.
Once data is inserted, it won't be updated again.
From a SQL standpoint, my most common query would be something like:
SELECT * FROM emp_vehicledata ORDER BY datetimestamp DESC (sometimes I could have a between dates)
So I wouldn't need to go throughout a documents amended revision history to get a specific set of dates, it looks like a new key-pair with the VIN is best for every entry.
Am I understanding Couchdb storage correctly? Thanks again
I am working on a project that involves a lot of data, and at first I was doing it all in plist, and I realized it was getting out of hand and I would have to learn Core Data. I'm still not entirely sure whether I can do what I want in Core Data, but I think it should work out. I've set up a data model, but I'm not sure if it's the right way to do it. Please read on if you think you can help out and let me know if I'm on the right track. Please bear with me, because I am trying to explain it as thoroughly as I can.
I've got the basic object with attributes set up at the root level; say a person with attributes like a name, date of birth, etc. Pretty simple. You set up one entity like this "Person" in your model, and you can save as many of them as you want in your data and retrieve them as an array, right? It could be sorted based on an attribute in the Person, such as the date they were added to the database.
Now where I get a bit more confused is when I want to store several different collections of data with each person. For example a list of courses and associated test marks. In a plist I would have stored an array of dictionaries that stored this, sorted by the date assessed. The way I set this up in my data model was that I added an entity called "Tests" and a "to-many" relationship from Person to Tests, and then when I pull that I get an NSSet that I can order by a timestamp again? Is there a better way to do this?
Similarly the Person may have a set of arrays of numerical data (the kind that you could graph over time,eg. Nike+ stores your running data like distance vs time, and a person would have multiple runs associated with them, hence a set of arrays, each with their own associated date of collection). The way I set this up is a little different, with a "Runs" attribute with just a timestamp attribute, and that is connected from Person via a to-many relationship, with inverse "forPerson". Then the Runs entity is connected to another entity via a to-many relationship that has attributes to store numerical data and the time. This would once again I would use a time/order attribute to sort them.
So the main question I have is whether using an internal attribute like timestamp to sort a set would be the right way to load in a "array" from core data. Searching forums/stack overflow about how to store NSArrays in core data seem overly complicated compared to this, giving me the sense that I'm misunderstanding something.
Thanks for your help. Sorry for all the text, but I'm new to Core Data and I figure setting up the data model properly is essential before starting to code methods for getting/saving data. If necessary, I can set up a sample model to demonstrate this and post a picture of it.
CoreData will give you NSSets by default. These are convertible to arrays by calling allObjects or sortedArrayUsingDescriptors, if you want a sorted array. The "ordered" property on the relationship description gives you an NSOrderedSet in the managed object. Hashed sets provide quicker adds, access and membership checks, with a penalty (relative to ordered sets) for the sort.
Does Core Data allow one to generate a statement like select FirstName, LastName from Employee instead of the entire row?
In the example of Departments/Employees, let's say I want to write a navagation controller style application to display the Departments available, and when clicked, the Employees available in that Department. Lets suppose that the Employee object is huge for whatever reason. I don't see why I would need to retrieve a huge set of objects in the EmployeesViewController just to display their names in a list view. Is there anyway I can just request the Name field (perhaps two: FirstName, LastName) for all Employees in a given Department?
Just as a side note:
Beware of the dangers of premature optimization. You shouldn't bother trying to tweak a fetch unless you've tested and found that the bare bones fetch is actually a problem. Core Data has a large number of optimizations under the hood that make it far more efficient than it would appear at first glance. In most cases, its an utter waste of time to tweak a fetch.
If you have a table that displays thousands of objects, you will usually only have a few dozen live objects in memory at anyone time. By default, Core Data fetches objects as faults i.e. ghost of the objects without the values of their attributes loaded. Only when you access a specific attribute of a specific object will that object load completely into memory.
If you come from an SQL background, you may be used to having to manually juggle objects created from SQL. Core Data handles all that for you and does so much more efficiently than you can do manually. You're intuitive assumption you developed working with SQL about the degree of manual optimization you need to do will be way off when applied to Core Data.
Instead, pick the simplest and easiest method first and optimize only if you test and find a bottleneck. Most of the time, you will find that your fretting over optimization was completely unwarranted.
If your employee object is very large, and loading it is too expensive, consider partitioning it with one-to-one relationships. If you have a list of fields like:
Employee:
id
name
SSN
DOB
home_address
home_phone
department
office_number
office_phone
manager
job_title
salary_class
start_date
...
You could break this up into:
Employee
id
department
manager
...
EmployeePersonalInfo
employee_id
SSN
DOB
home_address
home_phone
EmployeeJobDescription
employee_id
job_title
start_date
salary_class
And so on. For your most common objects, limiting the fetched data to the most pertinent and commonly-accessed field is good practice.
Yes. Use -[NSFetchRequest setPropertiesToFetch:]. Only available in iOS 3 & OS X 10.6 and later, though.
We have a decent sized object-oriented application. Whenever an object in the app is changed, the object changes are saved back to the DB. However, this has become less than ideal.
Currently, transactions are stored as a transaction and a set of transactionLI's.
The transaction table has fields for who, what, when, why, foreignKey, and foreignTable. The first four are self-explanatory. ForeignKey and foreignTable are used to determine which object changed.
TransactionLI has timestamp, key, val, oldVal, and a transactionID. This is basically a key/value/oldValue storage system.
The problem is that these two tables are used for every object in the application, so they're pretty big tables now. Using them for anything is slow. Indexes only help so much.
So we're thinking about other ways to do something like this. Things we've considered so far:
- Sharding these tables by something like the timestamp.
- Denormalizing the two tables and merge them into one.
- A combination of the two above.
- Doing something along the lines of serializing each object after a change and storing it in subversion.
- Probably something else, but I can't think of it right now.
The whole problem is that we'd like to have some mechanism for properly storing and searching through transactional data. Yeah you can force feed that into a relational database, but really, it's transactional data and should be stored accordingly.
What is everyone else doing?
We have taken the following approach:-
All objects are serialised (using the standard XMLSeriliser) but we have decorated our classes with serialisation attributes so that the resultant XML is much smaller (storing elements as attributes and dropping vowels on field names for example). This could be taken a stage further by compressing the XML if necessary.
The object repository is accessed via a SQL view. The view fronts a number of tables that are identical in structure but the table name appended with a GUID. A new table is generated when the previous table has reached critical mass (a pre-determined number of rows)
We run a nightly archiving routine that generates the new tables and modifies the views accordingly so that calling applications do not see any differences.
Finally, as part of the overnight routine we archive any old object instances that are no longer required to disk (and then tape).
I've never found a great end all solution for this type of problem. Some things you can try is if your DB supports partioning (or even if it doesn't you can implement the same concept your self), but partion this log table by object type and then you can further partion by date/time or by your object ID (if your ID is a numeric this works nicely not sure how a guid would partion).
This will help maintain the size of the table and keep all related transactions to a single instance of an object to itself.
One idea you could explore is instead of storing each field in a name value pair table, you could store the data as a blob (either text or binary). For example serialize the object to Xml and store it in a field.
The downside of this is that as your object changes you have to consider how this affects all historical data if your using Xml then there are easy ways to update the historical xml structures, if your using binary there are ways but you have to be more concious of the effort.
I've had awsome success storing a rather complex object model that has tons of interelations as a blob (the xml serializer in .net didn't handle the relationships btw the objects). I could very easily see myself storing the binary data. A huge downside of storing it as binary data is that to access it you have to take it out of the database with Xml if your using a modern database like MSSQL you can access the data.
One last approach is to split the two patterns, you could define a Difference Schema (and I assume more then one property changes at a time) so for example imagine storing this xml:
<objectDiff>
<field name="firstName" newValue="Josh" oldValue="joshua"/>
<field name="lastName" newValue="Box" oldValue="boxer"/>
</objectDiff>
This will help alleviate the number of rows, and if your using MSSQL you can define an XML Schema and get some of the rich querying ability around the object. You can still partition the table.
Josh
Depending on the characteristics of your specific application an alternative approach is to keep revisions of the entities themselves in their respective tables, together with the who, what, why and when per revision. The who, what and when can still be foreign keys.
Although I would be very careful to use this approach, since this is only viable for applications with a relatively small amount of changes per entity/entity type.
If querying the data is important I would use true Partitioning in SQL Server 2005 and above if you have enterprise edition of SQL Server. We have millions of rows partitioned by year down to day for the current month - you can be as granular as your application demands with a maximum number of 1000 partitions.
Alternatively , if you are using SQL 2008 you could look into filtered indexes.
These are solutions that will enable you to retain the simplified structure you have whilst providing the performance you need to query that data.
Splitting/Archiving older changes obviously should be considered.