Storing News in a Distributed DB vs RDBMS - couchdb

Hi all: If I am storing News articles in a DB with different categories such as "Tech", "Finance", and "Health", would a distributed database work well in this system vs a RDBMS? Each of the news items would have the news articles attached as well as a few other items. I am wondering if querying would be faster, though.
Let's say I never have more than a million rows, and I want to grab the latest (within 5 hours) tech articles. I imagine that would be a map-reduce of "Give me all tech articles" (possibly 10000), then weed out only the ones that have the latest timestamp.
Am I thinking about tackling the problem in the right way, and would a DDB even be the best solution? In a few years there might be 5 million items, but even then....

Whether to use a distributed database or key-value store depends more on your operational requirements than your domain problem.
When people ask how to do time-ordered queries in Riak, we usually suggest several strategies (although none of them are a silver-bullet as Riak lacks ordered range queries):
1) If you are frequently accessing a specifically-sized chunk of time, break your data into buckets that reflect that period. For example, all data for the day, hour or minute specified would be either stored or linked to from a bucket that contains the appropriate timestamp. If I wanted all the tech news from today, the bucket name might be "tech-20100616". As your data comes in, add appropriate links from the time-boxed bucket to the actual item.
2) If the data is more sequence-oriented and not related to a specific point in time, use links to create a chain of data, linking backward in time, forward, or both. (This works well for versioned data too, like wiki pages.) You might also have to keep an object that just points at the head of the list.
Those strategies aside, Riak is probably not the 100% solution for up-to-the-minute information, but might be better for the longer-term storage. You could combine it with something like Redis, memcached, or even MongoDB (which has great performance if your data is mildly transient and can fit in memory) to hold a rolling index of the latest stuff.

Related

How to structure relationships in Azure Cosmos DB?

I have two sets of data in the same collection in cosmos, one are 'posts' and the other are 'users', they are linked by the posts users create.
Currently my structure is as follows;
// user document
{
id: 123,
postIds: ['id1','id2']
}
// post document
{
id: 'id1',
ownerId: 123
}
{
id: 'id2',
ownerId: 123
}
My main issue with this setup is the fungible nature of it, code has to enforce the link and if there's a bug data will very easily be lost with no clear way to recover it.
I'm also concerned about performance, if a user has 10,000 posts that's 10,000 lookups I'll have to do to resolve all the posts..
Is this the correct method for modelling entity relationships?
As said by David, it's a long discussion but it is a very common one so, since I have on hour or so of "free" time, I'm more than glad to try to answer it, once for all, hopefully.
WHY NORMALIZE?
First thing I notice in your post: you are looking for some level of referential integrity (https://en.wikipedia.org/wiki/Referential_integrity) which is something that is needed when you decompose a bigger object into its constituent pieces. Also called normalization.
While this is normally done in a relational database, it is now also becoming popular in non-relational database since it helps a lot to avoid data duplication which usually creates more problem than what it solves.
https://docs.mongodb.com/manual/core/data-model-design/#normalized-data-models
But do you really need it? Since you have chosen to use JSON document database, you should leverage the fact that it's able to store the entire document and then just store the document ALONG WITH all the owner data: name, surname, or all the other data you have about the user who created the document. Yes, I’m saying that you may want to evaluate not to have post and user, but just posts, with user info inside it.This may be actually very correct, as you will be sure to get the EXACT data for the user existing at the moment of post creation. Say for example I create a post and I have biography "X". I then update my biography to "Y" and create a new post. The two post will have different author biographies and this is just right, as they have exactly captured reality.
Of course you may want to also display a biography in an author page. In this case you'll have a problem. Which one you'll use? Probably the last one.
If all authors, in order to exist in your system, MUST have blog post published, that may well be enough. But maybe you want to have an author write its biography and being listed in your system, even before he writes a blog post.
In such case you need to NORMALIZE the model and create a new document type, just for authors. If this is your case, then, you also need to figure out how to handler the situation described before. When the author will update its own biography, will you just update the author document, or create a new one? If you create a new one, so that you can keep track of all changes, will you also update all the previous post so that they will reference the new document, or not?
As you can see the answer is complex, and REALLY depends on what kind of information you want to capture from the real world.
So, first of all, figure out if you really need to keep posts and users separated.
CONSISTENCY
Let’s assume that you really want to have posts and users kept in separate documents, and thus you normalize your model. In this case, keep in mind that Cosmos DB (but NoSQL in general) databases DO NOT OFFER any kind of native support to enforce referential integrity, so you are pretty much on your own. Indexes can help, of course, so you may want to index the ownerId property, so that before deleting an author, for example, you can efficiently check if there are any blog post done by him/her that will remain orphans otherwise.
Another option is to manually create and keep updated ANOTHER document that, for each author, keeps track of the blog posts he/she has written. With this approach you can just look at this document to understand which blog posts belong to an author. You can try to keep this document automatically updated using triggers, or do it in your application. Just keep in mind, that when you normalize, in a NoSQL database, keep data consistent is YOUR responsibility. This is exactly the opposite of a relational database, where your responsibility is to keep data consistent when you de-normalize it.
PERFORMANCES
Performance COULD be an issue, but you don't usually model in order to support performances in first place. You model in order to make sure your model can represent and store the information you need from the real world and then you optimize it in order to have decent performance with the database you have chose to use. As different database will have different constraints, the model will then be adapted to deal with that constraints. This is nothing more and nothing less that the good old “logical” vs “physical” modeling discussion.
In Cosmos DB case, you should not have queries that go cross-partition as they are more expensive.
Unfortunately partitioning is something you chose once and for all, so you really need to have clear in your mind what are the most common use case you want to support at best. If the majority of your queries are done on per author basis, I would partition per author.
Now, while this may seems a clever choice, it will be only if you have A LOT of authors. If you have only one, for example, all data and queries will go into just one partition, limiting A LOT your performance. Remember, in fact, that Cosmos DB RU are split among all the available partitions: with 10.000 RU, for example, you usually get 5 partitions, which means that all your values will be spread across 5 partitions. Each partition will have a top limit of 2000 RU. If all your queries use just one partition, your real maximum performance is that 2000 and not 10000 RUs.
I really hope this help you to start to figure out the answer. And I really hope this help to foster and grow a discussion (how to model for a document database) that I think it is really due and mature now.

Real time multi threaded max-heap for top-N geohash

There is a requirement to keep a list of top-10 localities in a city from where demand for our food service is emanating at any given instant. The city could have tens of thousands of localities.
If one has to make a near real time (lag no more than 5 minutes) datastore in memory that would
- keep count of incoming demand by locality (geo hash)
- reads by hundreds of our suppliers every minute (the ajax refresh is every minute)
I was thinking of a multi threaded synchronized max-heap. This would be a complex solution as tree locking is by itself a complex implementation.
Any recommendations for the best in-memory (replicatable master slave) data structure that can be read and updated in multi threaded environment?
We expect 10K QPS and 100K updates per second. When we scale to other cities and regions, we will need per city implementation of top-10.
Are there any off the shelf solutions available?
Persistence is not a need so no mySQL based solutions. If you recommend redis or mongo DB solution, please realize that the queries are not pointed-queries by key but a top-N query instead.
Thanks in advance.
If you're looking for exactly what you're describing, there are a few approaches that might work nicely. There are several papers describing concurrent data structures that could work as priority queues; here is one option that I'm not super familiar with but which looks promising. You might also want to check out concurrent skip lists, which should also match your requirements.
If I'm interpreting your problem statement correctly, you're hoping to maintain a top-10 list of locations based on the number of hits you receive. If that's the case, I would suspect that while the number of updates would be huge, the number of times that two locations would switch positions would not actually be all that large. In other words, most updates wouldn't actually require the data structure to change shape. Consequently, you could consider using a standard binary heap where each element uses an atomic-compare-and-set integer key and where you have some kind of locking system that's used only in the case where you need to add, move, or delete an element from the heap.
Given the scale that you're working at, you may also want to consider approximate solutions to your problem. The count-min sketch data structure, for example, was specifically designed to estimate frequent elements in a data stream and does so extremely quickly. It can easily be distributed and linked up with a priority queue in a manner similar to what I described above. There are lots of good implementations out there, and if I remember correctly this data structure is actually deployed in situations like the one you're describing.
Hope this helps!

Is Cassandra a good candidate database for that must sustain over 100 read/write operations per second?

Currently our system uses PostgreSQL, however we seem to have pushed the limit of its capabilities. Some of our tables need to handle over 100 read/write operations per second so it is probably time to scale horizontally across multiple machines.
Have a lot of experience using GAE's Big Table. Big Table had rich options for querying. For example, queries were possible against list data fields. Cassandra is supposed to be based off of Big Table, but if I understand correctly, for Cassandra, we will actually have to custom-code a layer on top of Cassandra that uses and maintains index tables.
Would be great if there was an open source database available for which we did not have to build our own custom logic for maintaining index tables, zig-zag merge joins, etc...
Is Cassandra a good candidate here? Or are there ones that might be considered better?
Unless the operations are huge joins or return hundreds of thousands of rows, any database you choose will be able to sustain 100 ops/s. Cassandra will have no problems serving thousands if not tens of thousands of reads and writes per node.
Without knowing more about your particular use case it's impossible to give you meaningful advice. Cassandra is a great database, but if it's right for you I don't know. I'd suggest looking through the cassandra tag here on Stack Overflow and look at what people ask about and if it looks at all like what you're trying to do, and if the answers say that it's possible with Cassandra (I know I've answered quite a few questions where the answer was that Cassandra wasn't the best choice for that particular case).
Cassandra and GAE Big Table have big similarities, but also big differences. One thing that trips up new Cassandra users is that there really isn't any way of doing things like "add this thing only unless that other thing was there" or "add an item and remove all but the last N items".

Is NoSQL ideal to store stats?

I'm not terribly familiar with NoSQL systems, but I remember reading a while back that they are ideal to handle statistical data.
Since I'm about to start writing code that will record data like "how many users were registered on each day", I was thinking I could use this as an opportunity to learn more about NoSQL if it fits the bill.
If NoSQL is indeed ideal for this, could you provide me with some information as to why? And which specific systems are best suited for this particular need?
So, after the first answer, maybe it's helpful to clarify a bit more.
I currently have a PostgreSQL database from which I'll get the data. It will be very simple, and no calculations needed. For example, I'll just get a resultset with the amount of users registered each day for the past month (so it'll basically just be a set of value pairs for the date/users) and save that in another table/database.
Thanks!
It kind of depends on what sorts of analysis you are going to be doing on these stats. If you are going to be doing a lot of different operations (averaging, summing, joining...) you may find NoSQL solutions to be more of a pain then they are worth.
However, if you are storing stats mostly for a display purpose, or for very specific analysis routines, NoSQL solutions start to shine.
If your data is small enough, stick with a SQL solution, which will give the benefit of a full query engine to work with, but if you have lots of values (one value a day is nothing, even if you were running for a million years), and are worried about storage size and performance, NoSQL options once again may be worth it.
If your data is semi-structured, take a look at CouchDB, which offers some rudimentary indexing and querying support, which could provide some basis for analysis routines. If you are storing individual values with very little structure, my best advice would be to take a look at Tokyo Cabinet and Tokyo Tyrant, which are absolutely incredible options for key-value storage.
NoSQL systems tend to optimize the case where data is stored frequently, but accessed infrequently. In the case of statistics, you might gather lots of data from a (social) site frequently in small bits, which is optimized for. But retrieval and analysis might be slower... It of course depends on which "NoSql" System you decide to use.

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Resources