What is the best way to store and search through object transactions? - search

We have a decent sized object-oriented application. Whenever an object in the app is changed, the object changes are saved back to the DB. However, this has become less than ideal.
Currently, transactions are stored as a transaction and a set of transactionLI's.
The transaction table has fields for who, what, when, why, foreignKey, and foreignTable. The first four are self-explanatory. ForeignKey and foreignTable are used to determine which object changed.
TransactionLI has timestamp, key, val, oldVal, and a transactionID. This is basically a key/value/oldValue storage system.
The problem is that these two tables are used for every object in the application, so they're pretty big tables now. Using them for anything is slow. Indexes only help so much.
So we're thinking about other ways to do something like this. Things we've considered so far:
- Sharding these tables by something like the timestamp.
- Denormalizing the two tables and merge them into one.
- A combination of the two above.
- Doing something along the lines of serializing each object after a change and storing it in subversion.
- Probably something else, but I can't think of it right now.
The whole problem is that we'd like to have some mechanism for properly storing and searching through transactional data. Yeah you can force feed that into a relational database, but really, it's transactional data and should be stored accordingly.
What is everyone else doing?

We have taken the following approach:-
All objects are serialised (using the standard XMLSeriliser) but we have decorated our classes with serialisation attributes so that the resultant XML is much smaller (storing elements as attributes and dropping vowels on field names for example). This could be taken a stage further by compressing the XML if necessary.
The object repository is accessed via a SQL view. The view fronts a number of tables that are identical in structure but the table name appended with a GUID. A new table is generated when the previous table has reached critical mass (a pre-determined number of rows)
We run a nightly archiving routine that generates the new tables and modifies the views accordingly so that calling applications do not see any differences.
Finally, as part of the overnight routine we archive any old object instances that are no longer required to disk (and then tape).

I've never found a great end all solution for this type of problem. Some things you can try is if your DB supports partioning (or even if it doesn't you can implement the same concept your self), but partion this log table by object type and then you can further partion by date/time or by your object ID (if your ID is a numeric this works nicely not sure how a guid would partion).
This will help maintain the size of the table and keep all related transactions to a single instance of an object to itself.
One idea you could explore is instead of storing each field in a name value pair table, you could store the data as a blob (either text or binary). For example serialize the object to Xml and store it in a field.
The downside of this is that as your object changes you have to consider how this affects all historical data if your using Xml then there are easy ways to update the historical xml structures, if your using binary there are ways but you have to be more concious of the effort.
I've had awsome success storing a rather complex object model that has tons of interelations as a blob (the xml serializer in .net didn't handle the relationships btw the objects). I could very easily see myself storing the binary data. A huge downside of storing it as binary data is that to access it you have to take it out of the database with Xml if your using a modern database like MSSQL you can access the data.
One last approach is to split the two patterns, you could define a Difference Schema (and I assume more then one property changes at a time) so for example imagine storing this xml:
<objectDiff>
<field name="firstName" newValue="Josh" oldValue="joshua"/>
<field name="lastName" newValue="Box" oldValue="boxer"/>
</objectDiff>
This will help alleviate the number of rows, and if your using MSSQL you can define an XML Schema and get some of the rich querying ability around the object. You can still partition the table.
Josh

Depending on the characteristics of your specific application an alternative approach is to keep revisions of the entities themselves in their respective tables, together with the who, what, why and when per revision. The who, what and when can still be foreign keys.
Although I would be very careful to use this approach, since this is only viable for applications with a relatively small amount of changes per entity/entity type.

If querying the data is important I would use true Partitioning in SQL Server 2005 and above if you have enterprise edition of SQL Server. We have millions of rows partitioned by year down to day for the current month - you can be as granular as your application demands with a maximum number of 1000 partitions.
Alternatively , if you are using SQL 2008 you could look into filtered indexes.
These are solutions that will enable you to retain the simplified structure you have whilst providing the performance you need to query that data.
Splitting/Archiving older changes obviously should be considered.

Related

How do I find out right data design and right tools/database/query for below requirement

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

homogeneous vs heterogeneous in documentdb

I am using Azure DocumentDB and all my experience in NoSql has been in MongoDb. I looked at the pricing model and the cost is per collection. In MongoDb I would have created 3 collections for what I was using: Users, Firms, and Emails. I noted that this approach would cost $24 per collection per month.
I was told by the people I work with that I'm doing it wrong. I should have all three of those things stored in a single collection with a field to describe what the data type is. That each collection should be related by date or geographic area so one part of the world has a smaller portion to search.
and to:
"Combine different types of documents into a single collection and add
a field across all to separate them in searching like a type field or
something"
I would never have dreamed of doing that in Mongo, as it would make indexing, shard keys, and other things hard to get right.
There might not be may fields that overlap between the objects (example: Email and firm objects)
I can do it this way, but I can't seem to find a single example of anyone else doing it that way - which indicates to me that maybe it isn't right. Now, I don't need an example, but can someone point me to some location that describes which is the 'right' way to do it? Or, if you do create a single collection for all data - other than Azure's pricing model, what are the advantages / disadvantages in doing that?
Any good articles on DocumentDb schema design?
Yes. In order to leverage CosmosDb to it's full potential need to think of a Collection is an entire Database system and not as a "table" designed to hold only one type of object.
Sharding in Cosmos is exceedingly simply. You just specify a field that all of your documents will populate and select that as your partition key. If you just select a generic value such as key or partitionKey you can easily separate the storage of your inbound emails, from users, from anything else by picking appropriate values.
class InboundEmail
{
public string Key {get; set;} = "EmailsPartition";
// other properties
}
class User
{
public string Key {get; set;} = "UsersPartition";
// other properties
}
What I'm showing is still only an example though. In reality your partition key values should be even more dynamic. It's important to understand that queries against a known partition are extremely quick. As soon as you need to scan across multiple partitions you'll see much slower and more costly results.
So, in an app that ingests a lot of user data. Keeping a single user's activity together in one partition might make sense for that particular entity.
If you want evidence that this is the appropriate way to use CosmosDb, consider the addition of the new Gremlin Graph APIs. Graphs are inherently heterogenous as they contain many different entities and entity types as well as the relationships between them. The query boundary of Cosmos is at the collection level so if you tried putting your entities all in different collections none of the Graph API or queries would work.
EDIT:
I noticed in the comments you made this statement And you would have an index on every field in both objects. CosmosDb does automatically index every field of every document. They use a special proprietary path based indexing mechanism that ensures every path of your JSON tree has indices on it. You have to specifically opt out of this auto indexing feature.

On demand loading in Schemacrawler?

So I have a rather large database where I want to show its metadata (schemas, tables, and columns) in a tree browser (I use Schemacrawler to fetch all the DB data, and JTree for the tree). However, because there are so many tables and columns, the TableColumnRetriever class takes ages to fetch the columns which causes a bottleneck in my implementation.
My idea is now to do lazy loading on the columns so they will only be fetched when a user clicks on a table. Is there a way in Schemacrawler to only fetch schemas and tables at the beginning (maybe set the SchemaInfoLevel to minimum?) and then later fetch the columns based on an input table/schema?
ps: I implement everything in Java.
SchemaCrawler does not have exactly the functionality you are looking for, to load do incremental loads of metadata. The reason is that SchemaCrawler builds an interconnected object model graph, where you can get from say a table to a foreign-key to another table using Java object references. However, there may be a couple of ways to address you issue that you can consider. One is that you can ask your end-users to provide you a description of what tables they are interested in, in the form of a regular expression. You can use SchemaCrawler’s powerful grep functionality to quickly retrieve that information. Another way is to take advantage of the fact that schemas do not change very often. You can cache the schema metadata when your application starts for the first time, using SchemaCrawler’s built-in functionality. If you end-user wants to refresh the schema at any time, they can do that with the understanding that it may take time. Otherwise, your application’s performance will be very good using cached data.
Sualeh Fatehi, SchemaCrawler

PouchDB structure

i am new with nosql concept, so when i start to learn PouchDB, i found this conversion chart. My confusion is, how PouchDB handle if lets say i have multiple table, does it mean that i need to create multiple databases? Because from my understanding in pouchdb a database can store a lot of documents, but a document mean a row in sql or am i misunderstood?
The answer to this question seems to be surprisingly under-documented. While #llabball clearly gave a decent answer, I don't think that views are always the way to go.
As you can read here in the section When not to use map/reduce, Nolan explains that for simpler applications, the key is to abuse _ids, and leverage the power of allDocs().
In other words, if you had two separate types (say artists, and albums), then you could prefix the id of each type to obtain an easily searchable data set. For example _id: 'artist_name' & _id: 'album_title', would allow you to easily retrieve artists in name order.
Laying out the data this way will result in better performance due to not requiring extra indexes, and less code. Clearly however, if your data requirements are more complex, then views are the way to go.
... does it mean that i need to create multiple databases?
No.
... a document mean a row in sql or am i misunderstood?
That's right. The SQL table defines column header (name and type) - that are the JSON property names of the doc.
So, all docs (rows) with the same properties (a so called "schema") are the equivalent of your SQL table. You can have as much different schemata in one database as you want (visit json-schema.org for some inspiration).
How to request them separately? Create CouchDB views! You can get all/some "rows" of your tabular data (docs with the same schema) with one request as you know it from SQL.
To write such views easily the property type is very common for CouchDB docs. Your known name from a SQL table can be your type like doc.type: "animal"
Your view names will be maybe animalByName or animalByWeight. Depends on your needs.
Sometimes multiple-databases plan is a good option, like a database per user or even a database per user-feature. Take a look at this conversation on CouchDB mailing list.

CouchDB map/reduce by any document property at runtime?

I come from a SQL world where lookups are done by several object properties (published = TRUE or user_id = X) and there are no joins anywhere (because of the 1:1 cache layer). It seems that a document database would be a good fit for my data.
I am trying to figure-out if there is a way to pass one (or more) object properties to a CouchDB map/reduce function to find matching documents in a database without creating dozens of views for each document type.
Is it possible to pass the desired document property key(s) to match at run-time to CouchDB and have it return the objects that match (or the count of object that match for pagination)?
For example, on one page I want all posts with a doc.user_id of X that are doc.published. On another page I might want all documents with doc.tags[] with the tag "sport".
You could build a view that iterates over the keys in the document, and emits a key of [propertyName, propertyValue] - that way you're building a single index with EVERYTHING prop/value in it. Would be massive, no idea how performance would be to build, and disk usage (probably bad).
Map function would look something like:
// note - totally untested, my CouchDB fu is rusty
function(doc) {
for(prop in doc) {
emit([prop, doc[prop]], null);
}
}
Works for the basic case of simple properties, and can be extended to be smart about arrays, and emit a prop/value pair for each item in the array. That would let you handle the tags.
To query on it, set [prop] as your query key on the view.
Basically, no.
The key difference between something like Couch and a SQL DB is that the only way to query in CouchDB is essentially through the views/indexes. Indexes in SQL are optional. They exist (mostly) to boost performance. For example, if you have a small DB, your app will run just fine on SQL with 0 indexes. (Might be some issue with unique constraints, but that's a detail.)
The overall point being is that part of the query processor in a SQL database includes other methods of data access beyond simply indexes, notably table scans, merge joins, etc.
Couch has no query processor. It has views (defined by JS) used to define B-Tree indexes.
And, that's it. That's the hammer of Couch. It's a good hammer. It's been lasting the data processing world for basically 40 years.
Indexes are somewhat expensive to create in Couch (based on data volume) which is why "temporary views" are frowned upon. And they have a cost in maintenance as well, so views need to be a conscious design element in your database. At the same time, they're a bit more powerful than normal SQL indexes as well.
You can readily add your own query processing on top of Couch, but that will be more work for you. You can create a few select views, on your most popular or selective criteria, and then filter the resulting documents by other criteria in your own code. Yes, you have to do it, so you have to question whether the effort involved is worth more than whatever benefits you feel Couch is offering your (HTTP API, replication, safe, always consistent datastore, etc.) over a SQL solution.
I ran into a similar issue like this, and built a quick workaround using CouchDB-Python (which is a great library). It's not a pretty solution (goes against the principles of CouchDB), but it works.
CouchDB-Python gives you the function "Query", which allows you to "execute an ad-hoc temporary view against the database". You can read about it here
What I have is that I store the javascript function as a string in python, and the concatenate it with variable names that I define in Python.
In some_function.py
variable = value
# Map function (in javascript)
map_fn = """function(doc) {
<javascript code>
var survey_match = """ + variable + """;
<javascript code>
"""
# Iterates through rows
for row in db.query(map_fn):
<python code>
It sure isn't pretty, and probably breaks a bunch of CouchDB philosophies, but it works.
D

Resources