keeping track of history in graph databases - history

I am investigating the use of a graph database (like Neo4j - mainly because I need the python bindings) for modeling a real physical network. However, one of the requirements is to be able to track the history of where machines were, the state of network ports etc.
In a relational database, I can quite easily create an 'archive' table that I can use to do historical queries, however, I've been bitten many times with the issues of fixed table schemas and rather awkward left joins all over the place.
Does any one have any suggestions on how it would be best to maintain the historical relations and node properties in a graph database?

Depending on the number of nodes, you might be able to take snapshots of the graph network. Then index each node so that you can query it in each revision of the network.
You could also try versioning each node. Each time a node or one of its vertices changes, copy the node with references to the current version of each node it connects to. Then up the version number of the node you just modified.

Since Neo4J is based on a file system, you can easily keep the versions of your graph database via Git. Then go back and forth between versions to see how the graph was etc.

I know that Sones provides version control within the database.
"... place them under version control and administer various editions ..." Link

Related

Shopware 6 partitioning

Has anyone had any experience with database partitioning? We already have a lot of data and queries on it are already starting to slow down. Maybe someone has some examples? These are tables related to orders.
Shopware, since version 6.4.12.0, allows the use of database clusters, see the relevant documentation. You will have to set up a number read-only nodes first. The load of reading data will then be distributed among the read-only nodes while write operations are restricted to the primary node.
Note that in a cluster setup you should also use a lock storage that compliments the setup.
Besides using a DB cluster you can also try to reduce the load of the db server.
The first thing you should enable the HTTP-Cache, still better to additionaly also set up a reverse cache like varnish. This will greatly decrease the number of requests that hit your webserver and thus your DB server as well.
Besides all those measures explained here should improve the overall performance of your shop as well as decreasing load on the DB.
Additionally you could use Elasticsearch, so that costly search requests won't hit the Database. And use a "real" MessageQueue, so that the messages are not stored in the Database. And use Redis instead of the database for the storage of performance critical information as is documented in the articles in this category of the official docs.
The impact of all those measures probably depends on your concrete project setup, so maybe you see in the DB locks something that hints to one of the points i mentioned previously, so that would be an indicator to start in that direction. E.g. if you see a lot of search related queries Elasticsearch would be a great start, but if you see a lot of DB load coming from writing/reading/deleting messages, then the MessageQueue might be a better starting point.
All in all when you use a DB cluster with a primary and multiple replicas and use the additional services i mentioned here your shop should be able to scale quite well without the need for partitioning the actual DB.

web real time analytics dashboard: which technologies should use? (node/django, cassandra/mongodb...)

we want to develop a dashboard to analyze geospatial data.
This is a small and close approach to what we want to do: http://adilmoujahid.com/images/data-viz-talkingdata.gif
Our main concerns are about the backend technologies to be used. (front will be D3.js, DC.js, leaflet.js...)
Between Django and node.js, we think that we will use node.js, cause we've read than its faster than Django for this kind of tasks. But we are not sure and we are open to ideas.
But about Mongo or Cassandra, we are so confused. Our data is mostly structured, so store it in tables like Cassandra would make it easy to manage, also Cassandra seems to have better performance. However, we also have IoT devices data, with lots of real-time GPS location...
Which suggestions can you give to us to achieve our goal?
TL;DR Summary;
Dashboard with hundreds of simultaneous users.
Stored data will be mostly structured text/numbers, but will include also images, GPS-arrays, IoT sensors, geographical data (vector-polygons & rasters)
Databases will receive high write load coming from sensors.
Dashboard performance is so important. Its more important to read data in real time, than keeping it uncorrupted/secure.
Most calculus/math will be calculated in the client's browser, the server will try to avoid mathematical operations.
Disclaimer: I'm a DataStax employee so I'll comment on the Cassandra piece.
Cassandra is a good choice for this if your dashboard can be planned around a set of known queries. If those users will be doing ad-hoc queries directly to the database from the dashboard, you'll want something with a little more flexibility like ElasticSearch or (shameless plug) DataStax Search. Especially if you expect the queries/database to handle some of the geospatial logic.
JaguarDB has very strong support of geospatial data (2D and 3D). It allows you to store multi-measurements per point location while other databases support only one measurement (pointm). Many complex queries such as Voronoi polygon, convexhull are also supported. It is open source, distributed and sharded, multiple columns indexes, etc.
Concerning Postgresql and Cassandra, is there much difference in RAM/CPU/DISK usage between them?
Our use case does not require transactions, it will be in a single node and we will have IoT devices writing data up to 500 times per second. However ive read that Geographical data that works better with Potstgis than cassandra...
According to this use case, do you recommend Cassandra or Postgis?

What is the best way to resolve CouchDB document conflicts across 2 DB instances?

I have one application running over NodeJS and I am trying to make a distributed app. All write request goes to Node application and it writes to CouchDB A and on success of that It writes to CouchDB B. We read data through ELB(which reads from the 2 DBs).It's working fine.
But I faced a problem recently, my CouchDB B goes down and after CouchDB B up, now there is document _rev mismatch between the 2 instances.
What would be the best approach to resolve the above scenario without any down time?
If your CouchDB A & CouchDB B are in the same data centre, then #Flimzy's suggestion of using CouchDB 2.0 in a clustered deployment is a good one. You can have n CouchDB nodes configured in a cluster with a load balancer sitting above the cluster, delivering HTTP(s) traffic to any node that is "up".
If A & B are geographically separated, you can use CouchDB Replication to move data from A-->B and B-->A which would keep both instances perfectly in sync. A & B could each be clusters of 3 or more CouchDB 2.0 nodes, or single instances of CouchDB 1.7.
None of these solutions will "fix" the problem you are seeing when two copies of the database are modified in different ways at the same time. This "conflict" state is CouchDB's way of preventing data loss when two writes clash. Your app can resolve the conflict by picking a winning revision or writing a new one. It's not a fault condition, it's helping your application recover from a data loss during concurrent writes in a distributed system.
You can read more about document conflicts in this blog post series.
If both of your 1.6.x nodes are syncing buckets using standard replication, turning off one node shouldn’t be an issue. On node up it receives all updates without having conflicts – because there were no way to make them, the node was down.
If you experience conflicts during normal operation, unfortunately there exist no common general way to resolve them automatically. However, in most cases you can find a strategy of marking affected doc subtrees in a way allowing to determine which subversion is most recent (or more important).
To detect docs that have conflicts you may use standard views: a doc received by a view function has the _conflicts property if there exist conflicting revisions. Using appropriate view you can detect conflicts and merge docs. Anyway, regardless of how you detect conflicts, you need external code for resolving them.
If your conflicting data is numeric by nature, consider using CRDT structures and standard map/reduce to obtain final value. If your data is text-like you may also try to use CRDT, but to obtain reasonable performance you need to use reducers written in Erlang.
As for 2.x. I do not recommend using 2.x for your case (actually, for any real case except experiments). First, using 2.x will not remove conflicts, so it does not solve your problem. Also taking in account 2.x requires a lot of poorly documented manual operations across nodes and is unable to rebalance, you will get more pain than value.
BTW using any cluster solution have very little sense for two nodes.
As for above mentioned CVE 12635 and CouchDB 1.6.x: you can use this patch https://markmail.org/message/kunbxk7ppzoehih6 to cover the vulnerability.

do couchdb views replicate?

I dont mean the view sources, stored in _design docs (those replicate since they're just docs). What I mean is do the view results (the computed btrees) replicate as well, or do just regular documents replicate (which is how I understand it right now).
the problematic scenario is:
there's a spike in traffic and I want to bring up a temporary server up, and replicate a portion of the dataset onto that new server. the views for those (to be replicated)docs have already been computed on the old server, and so dont need to be recomputed on the new server... so I want those old computed results to be transfered along with the portion of the docs.
another scenario is to use a backend cluster to compute complex views, and then replicate those results onto the a bunch of front-end servers that are actually hit by user requests.
As Till said, the results are not replicated. For more detail, you actually don't want them to be replicated. The general CouchDB paradigm that you should remember is that each installation is treated as an independent node - that's why _id, _rev, and sequence numbers are so important. This allows each node to work without taking any other node into consideration: if one of your nodes goes down, all of the others will continue to crank away without a care in the world.
Of course, this introduces new considerations around consistency that you might not be used to. For example, if you have multiple web servers that each has its own CouchDB node on it, and those nodes run replication between themselves so that each instance stays up to date, there will be a lag between the nodes. Here's an example flow:
User writes a change to web server A.
User makes a read request to web server B, because your load balancer decided that B was the better choice. The user gets their result.
Web server A sends the updated doc to web server B via replication.
As you can see, the user got the previous version of their document because web server B didn't know about the change yet. This can be defeated with...
Stick sessions, so that all of their reads and writes go to the same server. This could just end up defeating your load balancer.
Moving the CouchDB nodes off of the web servers and onto their own boxes. If you go with this then you probably want to take a look at the couchdb-lounge project (http://tilgovi.github.com/couchdb-lounge/).
Do your users really care if they get stale results? Your use case might be one where your users won't notice whether their results don't reflect the change that they just made. Make sure you're really getting a marked value out of this work.
Cheers.
The computed result is not replicated.
Here are some additional thoughts though:
When you partition your server and bring up a second server with it, how do you distribute read/writes and combine view results? This setup requires a proxy of some thought, I suggest you look into CouchDB-Lounge.
If you're doing master-master, you could keep the servers in sync using DRDB. It's been proven to work with mysql master-master replication, I don't see why it would not work here. This would also imply that the computed result is automatically in sync on both servers.
Let me know if this helps!

What is the difference between Cassandra and CouchDB?

I'm looking at both projects and I can't really see the difference
from Cassandra Site:
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store...Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems.
from CouchDB Site:
Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API.
That said, I see the specific differences between each project as: access methods, written languages, etc. but to put AN EXAMPLE, when you talk about SOLR or Sphinx you know both are indexers with big differences but at the end are indexers.
Can I say here that Cassandra and CouchDB are non-relational databases that in some cases one can replace the other?
CouchDB is a document store. You put documents (JSON objects) in it and define views (indexes) over them. The objects can be arbitrarily complex with potentially deep structure. Further, they are not constrained to following some consistent schema.
Cassandra is a ragged-table key-value store. It just stores rows, each of which has a set of named columns grouped in to families with values. It sounds quite close to BigTable; BigTable doesn't require each row to have the same structure (unlike an SQL database). The values may have some structure, but this kind of store doesn't know anything about that -- they're just strings/byte sequences.
Yes, they are both non-relational databases, and there is probably a fair amount of overlap in their applicability, but they do have distinctly different data organization models. Each can probably be forced into emulating the other, but each model will map best to a different set of problems.
CouchDB has a feature present in very few open source database technologies: offline replication. CouchDB is designed so that applications can be run at the edge of the network. These applications are available even when internet connectivity fails.
Offline replication can also be leveraged to build large clusters, but CouchDB is designed to be robust and simple whether it is running on a single server, a datacenter, or even a smartphone.

Resources