I am facing a new type of problem that I haven't tried tackling before. So I would like some pointers in the right direction by someone more knowledgeable than I :-)
I have been asked by a friend to help him design a control system for production line. The project sounds really interesting, and I can't stop thinking about it.
I have already found that I can control the system using a node.js server. So far so good (HTML5 interface here we come)! But where I really want this system to stand out is in the collection of system metrics. The system reports all kinds of things such as temperature, flow etc, and these metrics are reported up to several hundred times per second per metric... and this runs 24/7.
My thought is to persist this in a MongoDb database, and do some realtime statistics on this. The "competition", if you will, seems to save this in a SQL server database and allow the operators to export aggregated data to Excel, and do statistics in Excel.
What are the strategies for doing real time statistics using a MongoDb?
I would really like to provide instant feedback and monitoring based on these metrics. Such as average temperature over the last 24 hours, spikes etc, and also enable alerts. There will not be much advanced statistics done on the server. If that is needed, I would enable export of data to a program such as SPSS.
Is MongoDb a good fit for that? I would love to use a Linux machine instead of a Windows machine with SQL Server and a WinForms Control Interface. The license fees alone are enough to put me off, although I know it probably isn't the case for the people buying the machinery.
This will not be placed in the cloud, but rather on a single server on the network. Next to the machine being operated, I will place a touch interface that through a browser will contact the node.js server to invoke PLC commands. There can be multiple machines that need controlling, and they would all be controlled by the same central node.js server.
The machinery is controlled by PLC controllers from http://beckhoff.com/.
I am not a complete novice when it comes to MongoDb, but I have never put anything I have made into production, and I wouldn't put MongoDb on my CV... yet!
EDIT: It seems that the $inc operator is the way to go. But what if I wan't both the daily and hourly averages as well as a continuous feed that updates a chart on screen with data every second using socket.io. Is is a good idea to update a document for each of the aggregates I need. I really also want to save every measurement, but maybe I could aggregate that on a per second basis, so I don't store up to a 1000 records per second per metric?
MongoDB can definitely be used for your scenario. Look at http://www.slideshare.net/pstokes2/social-analytics-with-mongodb, http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/ or
Real-time statistics: MySQL(/Drizzle) or MongoDB? for more on this topic
What I am really looking for is the Aggregation Framework: http://docs.mongodb.org/manual/tutorial/aggregation-examples/
That gives me exactly the kind of stats that I would like to see. Use this to calculate sums and averages as I write, and then also allow for ad-hoc queries should they be needed.
For a little insight on performance, read this awesome blogpost!
http://devsmash.com/blog/mongodb-ad-hoc-analytics-aggregation-framework
Also, anyone else looking to do something like this should take a look at this to see how to save the individual events. I don't need to save data longer than a week for example, so a rolling log should be more than enough for me: http://blog.mongodb.org/post/172254834/mongodb-is-fantastic-for-logging
With this I am very close to having a really sweet setup, and I am beginning to feel confident that this is a good choice over MySQL or MSSQL.
Related
Azure's SQL database feature for auto-tuning creates and drops indexes based on database usage. I've imported an old database into Azure which did not have comprehensive indexes defined and it seems to of done a great job on reducing CPU & DTU usage over a relatively short period of time.
It feels wrong - but does this mean I can develop going forwards without defining indexes? Does anyone do this? SSMS index editor is a pain and slow to modify with. Not having to worry/maintain indexes would speed up development time.
The auto-tuning is taking advantage of three things, Query Store, Missing Indexes and Machine Learning.
First, the last known good plan is a weaponization of the Query Store. This is in both Azure and SQL Server 2017. It will spot queries that have degraded performance after a plan change (and quite a few executions, not just one) and will revert back to that plan. If performance degrades, it turns that off. This is great. However, if you write crap code or have bad data structures or out of date statistics, it doesn't help very much.
The automatic indexes in Azure are using two things, missing index suggestions from the optimizer and machine learning on Azure. With those, if the missing index comes up a lot over a period of 12-18 hours (read this blog post on automating it), you'll get an index suggestion. It measures for another 12-18 hours and if that index helped, it stays, if not, it goes. This is also great. However, it suffers from two problems. First, same as before, if you have crap code, etc., this will only really help at the margins. Second, the missing index suggestions from the optimizer are not always the best index. For example, when I wrote the blog post, it identified a missing index appropriately, but it missed the fact that an INCLUDE column would have been even better than the index it suggested.
A human brain and eyeball is still going to be solving the more difficult problems. These automations take away a lot of the easier, low-hanging problems. Overall, that's a great thing. However, don't confuse it with a panacea for all things performance related. It's absolutely not.
I am out of ideas and hope to get some useful input. I am using this question to compress my experiences and share them, hoping to inspire some distributors to go the next step with modeling graph databases as a first class question/way.
I've been validating some graph database solutions usable by node.js for a few weeks. My use case is to save interactions of different social user network accounts. The need is to use CPU and memory in the most efficient way.
My most important requirements are:
in_memory (at least for indexing)
open source (and free to use)
same JavaScript/Node.js performance as first class citizen
comfortable query and modeling language
Neo4J
I really like cypher so my best choice would be Neo4j.
But the major issue about Neo4j is the JavaScript access is non-native. It uses the REST-API which is about ten times (10x) slower than direct Java access. So I took a look at node-neo4j-embedded, but it has been inactive for more than two years. It looks like its author isn't active at all (bad sign).
ArangoDB
The really nice core developers of ArangoDB answered to my question about internals. Finally it means JavaScript is first class citizen because native queries can be pushed out of JS. Looking at the open source benchmarks, I think it is fair. But I am afraid they didn't use node-neo4j-embedded for their benchmark. The benchmarks compare the REST-APIs (Edited because of #weinberger comment). I wished they compare the native APIs (maybe someone is snoopy enough and give it a try! - let us know!). Update: As I noticed now, OrientDB has answered the benchmark with a new node.js driver (using Command Cache by starting the server with -Dcommand.cache.enabled=true -Dcommand.cache.minExecutionTime=3, what isn't fair, because it wasn't a query caches benchmark!)
Because I like to use ArangoDB as a graph database I would have 3 choices (source: FAQ):
traverse JS objects
using AQLs graph functions
using the REST API
In general it isn't comfortable like cypher. And I am not sure how to compare and what is the right way modeling data (like Neo4J explains very well). I'd love to have something like this for ArangoDB Graphs. It feels like ArangoDB is focused on graph operations and Neo4J fits more the needs of using graphs if you have more relations than rows (the reason to use graphs instead of relations with joins).
MongoDB
The document based MongoDB isn't optimized for graph operations but latterly has gotten an experimental in_memory storage engine. Also there are some projects either in_memory or graph related but nothing is really compelling. And at this discussion it looks like MongoDB isn't what I like to use.
OrientDB
Because there is a comparison about OrientDB vs. MongoDB available (from OrientDB) I though about to use this one. "OrientDB has a hybrid Document-Graph engine" using SQL. I am a former PHP/MySQL expert. But where is the modeling part ? Their chapter working with graphs is not cypher like. It is like using SQL for Graphs. There is nothing wrong with that, but using cypher before I miss the modeling like feeling.
If someone did a modeling process with OrientDB and Graphs maybe you could write a tutorial like Neo4J had done.
Update: About JavaScript access like first citizen there are news:
"In the next release the speed of this driver will be comparable to the native Java one" The forked node.js driver had bin fixed last days.
Update: Before choosing OrientDB one might want to read article about some issues and discussions linked from there. The article is touching a sensitive issue and should be approached with critical mind. Note from author of this update: I'm new to editing SO and don't have enough reputation to put this to comments. I believe this information is a valid point to discussion, not sure how to place it here according to SO rules.
LokiJS
Before I was looking at Neo4J, ArangoDB and MongoDB, I played around with that JavaScript based in_memory database called LokiJS, what seams to follow the strategy to ignore everything what slows down performance and efficiency. LokiJS is trying to complete the Mongo-Style (RoadMap). The major issue is the bad ability to scale. Of cause it isn't a graph database but it was an interesting solution while the beginning of my project. Also it wasn't a perfect feeling to find all the distributed documentation (maybe they should reboot with GitBook).
Finally LokiJS is a very interesting project at all and I hope they will go forward!
LevelDB
Previously when I wrote my degree paper I was looking at levelDB. Remembering this while writing this post, I searched for LevelDB in_memory and got a promising result called MemDown (see also). I haven't tested this find, but maybe someone has experiences working and modeling for this solution. Maybe it would be the most efficient way if all the others will not fit because I would simply write a lightweight cypher clone with the goal to stay much lightweight as I can do.
Edit: Due to comment, here is a link to LevelGraph. As an idea to implement a CYPHER parser for LevelGraph/LevelDB your starting point would be to compare
Cypher:
CREATE (SUBJECT:"a") - [b:PREDICATE] -> (OBJECT:"c")
RETURN, subject, predicate, object
LevelGraph:
var RETURN = { SUBJECT: "a", PREDICATE: "b", OBJECT: "c" };
db.put(RETURN, function(err) {
// ..
});
Conclusion
As you likely noticed I am not the super hero about graphs. But this is my initial dive into this and I'm trying to get an overview. I assume there are a lot people out there who want to ask the same questions as me but haven't the time. I hope this post will help a lot people and will change by comments and answers to become a well done overview how to modeling data for graphs.
#editors: You are welcome.
#commenters: This is the result of my personal research - if you also have done a journey like me, please answer with a short summary like I have done for each DB I've evaluated (don't forget to target my 4 goals).
The idea to combine node-style performance through any of the native features (e.g. streams) and a high level query language like CYPHER is actually quite neat.
What you likely won't get is any kind of low level API, since this is rather rare with DB authors and, supposedly, not wanted in their design patterns. So, long running tcp connections shall just serve fine.
cypher-stream since to incorporate all of this, while (superficially judged) maintaining a good style.
Since you likely won't get any further with the search, I'd suggest sending him a pull request if any other features are needed :)
You should take a look at Gundb https://github.com/amark/gun
It's open source and has a very active and helpfull lead developer.
Join us at https://gitter.im/amark/gun
Currently we are using mongodb as our primary store for big online sales site, and currently we are focusing ourselves on big scalability among multiple machines.
Site backend is written in node.js and we are using mongoose as ODM.
I can see many blog posts which are writing about awesome cassandra DB, and I am starting to think about switching to cassandra. But still I am not sure if this is a really good decision, because I didn't found any good ODM/ORM lib for cassandra and node.js (and writing raw queries can be pain. Also writing good tested ORM/ODM can be time consuming task). So I am not sure how much benefit will I have after this switch. We are using elasticsearch as search engine, and it works excellent in combination with mongodb, and I am asking my self will do also good with cassandra.
If you have any experiance with this, it will be very helpfull.
Thank you!
Cassandra is a very nicely designed database, which can fulfill a lot of scenarios. MongoDB is also a really good DB engine. So let me just compare couple of main bullet points for you.
Always on system
Cassandra is really great when you need to provide 24x7 operations in multiple data centers. If you got more then one datacenter with multiple servers in each of them then Cassandra is great for you. Cassandra can sync writes to more than one datacenter and maintain desired data consistency across complex set ups. Recovery and re-sync is also quite easy.
On the other note MongoDB is easy to operate. If you got one data center and only couple of servers it might be a perfect fit (although global write lock might be a pain over time). In simple deployments it's easy to maintain and monitor.
Scalability
To continue the above statements - Cassandra is linearly scalable. There is, literally, no limit of how big the cluster will be. Your writes will always stay fast, while reads might become more complicated over time - depending on the structure of your data.
Denormalization of data
With Cassandra your writes and reads can be extremely fast if you will create a structure that will reflect what you need to get from your data. There is no query language (well, there is, but it's not exactly SQL) that you can use to reorganize your result set using aggregates, groupings, etc. Yes, some things are doable and some not - that is very specific to Cassandra data model. You will have to implement a lot of things on your own and write the result to the DB - i.e. counters for aggregation, different groupings, etc.
In comparison MongoDB is easy to use, easier to learn and more flexible - both for development (as knowledge curve/efforts goes) and for implementation of business logic (as time/effort is considered). That is - kind of - a reason why there are ORM engines for MongoDB and only couple (very limited) for Cassandra.
To summarize - both DBs are really good... if you will embrace their limitations. If you got only 100GB of data and you need flexible, easy to implement DB engine I would stick to MongoDB, alternatively take a look RethinkDB which have a very similar model and way better (in my personal opinion) clustering/data center replication implementation.
Cassandra is a great option for you if you will need to store TBs of data soon, deploying your apps across multiple data centers while accepting the cost of additional efforts to implement the same features and maintaining similar capabilities.
Don't take it personally that I have used the word only while describing your data set. Yes, it's not big - my company stores more than 20 TB these days... so yeah, 100GB is really not that much...
To stop everyone from pointing that I should compare some other features or point out some other differences between those two - it's just a rough, high level overview on the things I consider relevant to the problem, not a full comparison or analysis of the problem. But feel free to point out what I have missed and I will be happy to include new stuff in this answer...
Now I have a project with ads exchange service (something like google double click) and I have to pick a high-scalable database. I'm thinking about mongodb or cassandra.
Cassandra:
fit with our write-intensive system. (+)
looks hard to do aggregate(very important for analytics) (is there a good way? Just read slide about Twitter rainbird, seem good) (?)
I dont prefer java much. (-)
MongoDB:
Seem easier to do analytics. (have build-in aggregate functions) (+)
more RAM-consuming? (because of document-oriented vs key-value Cassandra) (?)
write perfomance compare to Cassandra? (?)
javascript shell and natural fit with node.js(one important part in our project) (+)
http://pastebin.com/raw.php?i=FD3xe6Jt - This article make me cautious. (-)
Can you guys help me to pick the one or answer some of my questions above
Thanks.
I don't know about Cassandra, but MongoDB has some advantages for using it for analytics: high concurrency, sharding, storing everything about an event in a single document, features like upsert and $inc.
For more detailed explanations check the following resources:
MongoDB Analytics - videos
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics
http://www.mongodb.org/display/DOCS/Use+Cases
http://www.slideshare.net/jrosoff/scalable-event-analytics-with-mongodb-ruby-on-rails
http://nosql.mypopescu.com/post/3508305955/fast-asynchronous-analytics-with-mongodb
http://blog.opengovernment.org/2011/02/24/fast-asynchronous-analytics-with-mongodb/
http://blog.10gen.com/post/4416876632/london-startup-ubervu-on-storing-5tb-of-data-in-mongodb
It depends a lot on your domain, most cases one would probably choose Mongo.
For example http://square.github.com/cube/ is built on Mongo.
Cube is an open-source system for visualizing time series data, built on MongoDB, Node and D3. If you send Cube timestamped events (with optional structured data), you can easily build realtime visualizations of aggregate metrics for internal dashboards. For example, you might use Cube to monitor traffic to your website, counting the number of requests in 5-minute intervals:
Most use cases of Cassandra draw from the need oh high availability that's the main feature of it afaik. Your needs seem to be centered around having a cheap way to shove queryable data in a scale-out DB, and mongo almost matches RDBMS in regards to querying. Mongo is also probably easier to deal with.
I think cassandra is a good fit for this problem.
You don't need to know much java to get it running (other than install java), as long as there is a client library in your chosen language.
Cassandra 0.8+ now has atomic counter support - perfect for impressions/click tracking.
You could also run hadoop on top of cassandra, giving you a proven platform for writing map reduce jobs to do analytics/aggregations and store the results back to Cassandra too.
Check out this slideshow about cassandra and hadoop: http://www.slideshare.net/jeromatron/cassandrahadoop-4399672
I hope that helps.
I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.