How to quickly build large scale analytics server?

How to quickly build large scale analytics server? - node.js

I need to build a analytics server for large scale (seven figures and up) quickly and for the cheap.
Piwik would be the easy choice but for what I've gathered so far, Piwik is rather hard to scale and can require rather hefty servers to handle loads.
My second idea would to create quick and dirty Node.js server which just pushes everything to Amazon DynamoDB, so that one can start gathering the data from the day one and then build the UI later on. That would be quick to create and scale (vertically and horizontally). However, I'm wondering if DynamoDB is the right choice for such use? (gather data, generate reports)

I'm using DynamoDB professionaly and would not use it for your application.
DynamoDB truly has tons of constraints. Among them, you can have only one hash_key and optionally, one range_key.
You may do some "analytics" for items grouped under a given hash_key using query but really nothing fancy. For complex queries, you would have to use scan or EMR which are slow and expensive and have a couple of drawbacks due to throttling.
Nonetheless, NoSQL seems a good choice, at least for the prototyping stage of your application. But, I would recommend MongoDB instead. You can index any column, do complex queries, do not worry about data throttling. Sharding and replications is not too hard to setup.
MongoDB has a strong ecosystem and community which DynamoDB has not (yet) as it is much younger. MongoDB also has hosted offers which would allow you to bootstrap your application as quickly as you would with DynamoDB.

Piwik scales up to millions of pages & dozens of thousands of tracked websites per month. See their docs: http://piwik.org/docs/optimize/ and: http://piwik.org/blog/2012/07/piwik-high-scale-performance-report-as-of-july-2012/

Related

MongoDB Atlas database vs MongoDB local, which is best for SaaS in terms of transaction speed (Querying)

I have developed an automation web tool (SaaS app), right now I'm using mongoDb atlas cloud database with amazon EC2 Xlarge instance with quad core EBS enabled processor and 16GB RAM. Is atlas the best or local mongo if so why?, which will give me a better performance, some serious help here.

MongoDB:
you are able to take advantage of this tool since being a non-relational database, it is much easier to build the model of the architecture of the database model. This makes the development time much easier. When working with javascript language, or working with JSON objects and collections, MongoDB makes the connection of services for queries much lighter and optimizes the performance of the applications. Also, you can work, in case you do not know the console commands, with a Desktop database administrator in a graphical way. The learning times really are much faster, which allows a great scalability of the project. In the development department, this optimizes the delivery time with the clients, which makes the projects much more feasible in terms of delivery times.
PROS:
Being a JSON language optimizes the response time of a query, you can directly build a query logic from the same service
You can install a local, database-based environment rather than the non-relational real-time bases such a firebase does not allow, the local environment is paramount since you can work without relying on the internet.
Forming collections in Mango is relatively simple, you do not need to know of query to work with it, since it has a simple graphic environment that allows you to manage databases for those who are not experts in console management.
CONS:
MongoDB seems to be one of the most complete tools in its field, I believe that it has all the features that a non-relational database should have.
Perhaps because it is a relatively new tool there are very few experts in the field of MongoDB.
To Summarize:
Mongo DB is better placed in large projects, with great scalability. It also allows you to work quite comfortably with projects based on programming languages such as javascript angular typescript C #. I believe that its performance is much better with the type of technologies that handle very logical, similar terms of programming. If we use languages like java php, for example, it is better to work with relational databases like postgres or mySql.
MongoDB-atlas:
my department at the company i work at, is using the MongoDB Atlas cluster that we set up on our own servers. It has reached to a point that it becomes hard to manage and to scale. MongoDB Atlas came to our site with the ability to scale and free of management, which saves a lot of effort for us.
PROS:
No infrastructure on our side. Free of management.
Easy to scale up and down.
It has strong authentication and encryption features that make sure that developers don't get lazy and leave out data in the open by leaving their servers unguarded.
CONS:
More granular billing.
More specific alerting system.
One of the drawbacks of MongoDB-Atlas is the cost. Hopefully more competition will bring down the costs over time.
To Summarize:
I would recommend MongoDB Atlas to every person/company who have a significant need in the NoSQL database and do not want to manage their infrastructure. Using MongoDB Atlas can significantly reduce your management time and cost, which saves valuable resources for other tasks. It also suits a smaller company as MongoDB Atlas scales up and down very quickly.
Hopefully I answered your question, Good Luck!

web real time analytics dashboard: which technologies should use? (node/django, cassandra/mongodb...)

we want to develop a dashboard to analyze geospatial data.
This is a small and close approach to what we want to do: http://adilmoujahid.com/images/data-viz-talkingdata.gif
Our main concerns are about the backend technologies to be used. (front will be D3.js, DC.js, leaflet.js...)
Between Django and node.js, we think that we will use node.js, cause we've read than its faster than Django for this kind of tasks. But we are not sure and we are open to ideas.
But about Mongo or Cassandra, we are so confused. Our data is mostly structured, so store it in tables like Cassandra would make it easy to manage, also Cassandra seems to have better performance. However, we also have IoT devices data, with lots of real-time GPS location...
Which suggestions can you give to us to achieve our goal?
TL;DR Summary;
Dashboard with hundreds of simultaneous users.
Stored data will be mostly structured text/numbers, but will include also images, GPS-arrays, IoT sensors, geographical data (vector-polygons & rasters)
Databases will receive high write load coming from sensors.
Dashboard performance is so important. Its more important to read data in real time, than keeping it uncorrupted/secure.
Most calculus/math will be calculated in the client's browser, the server will try to avoid mathematical operations.

Disclaimer: I'm a DataStax employee so I'll comment on the Cassandra piece.
Cassandra is a good choice for this if your dashboard can be planned around a set of known queries. If those users will be doing ad-hoc queries directly to the database from the dashboard, you'll want something with a little more flexibility like ElasticSearch or (shameless plug) DataStax Search. Especially if you expect the queries/database to handle some of the geospatial logic.

JaguarDB has very strong support of geospatial data (2D and 3D). It allows you to store multi-measurements per point location while other databases support only one measurement (pointm). Many complex queries such as Voronoi polygon, convexhull are also supported. It is open source, distributed and sharded, multiple columns indexes, etc.

Concerning Postgresql and Cassandra, is there much difference in RAM/CPU/DISK usage between them?
Our use case does not require transactions, it will be in a single node and we will have IoT devices writing data up to 500 times per second. However ive read that Geographical data that works better with Potstgis than cassandra...
According to this use case, do you recommend Cassandra or Postgis?

Benefits of using a hosted search service over building your own

I'm building a B2B Node app which has heavily related data models. We currently have our own search queries, but as we scale some of the queries appear to be becoming sluggish.
We will need to support multilingual search as well as content-based searches (searching matching content within related data).
The queries are growing more and more complicated (each has multiple joins on joins on joins) and I'm now considering a hosted search tool such as Algolia.
Given my concerns below, why should I use a hosted cloud search service rather than continue building my own queries?
Data privacy is important
Data is hosted in our own postgres DB - integrations with that are important (e.g.: will I now need to manually maintain our DB data and data in Algolia?)
Speed will be important, but not so much now
Must be able to do content-based searches across multiple languages
We are a tiny team of devs now, so dev resource time is vital
What other things should I be concerned about that can help make a decision in search capabilities?
Regarding maintenance of both DB and Cloud data, it seems it's as simple as getting all data, caching it, and storing it in the cloud:
var index = Algolia.initIndex('contacts');
var contactsJSON = require('./contacts.json');
index.addObjects(contactsJSON, function(err, content) {
if (err) {
console.error(err);
}
});

Search services like Algolia or self-hosted Elasticsearch/solr operate as full text search, not relational db queries.
But it sounds like the bottleneck is the continual rejoining. Which if you can make your relational data act like a full text document db then that could be a more efficient type of index (pre-joined sort of).
You might also look into views, or a data warehouse (maybe star schema).
But if you are going the search route maybe investigate hosting your own elasticsearch.
You could specify database, schema, sql, index, query details if you want more help.

Full Disclosure: I founded a company called SearchStax on the premise that companies and developers should not spend time setting up, managing, scaling or building tools for the search infrastructure (ops) - they are better off investing time of their employees into building value for the company, whether that be features, capabilities, product or customers.
Open Source Search solutions based on top of Lucene (Apache Solr / Elasticsearch) have what you need now and what you might need in near future from a capability perspective from a search engine. Find a mature service provider / AS-A-Service company that has specialization in open source search and let them deal with all. It may look small effort right now, though it's probably not worth time and effort of your devs to spend time on the operations of that.
For your concerns mentioned above:
Data privacy is important
Your concern around Privacy and Security are addressable. There are multiple ways you can secure your Solr environment and the right MSP or a Managed Solution provider should be able to address those.
a. Security at the transport layer can be addressed by SSL certificates. All the data going over the wire is encrypted.
b. IP Filtering and User Based Authentication should address who has access to what. Solr-as-a-Service offering by Measured Search supports both.
c. Security at rest can be addressed in multiple ways - OS level / File encryption, but you can even go further by ensuring not even your services provider has access to that data by using Searchable Encryption technology.
Privacy concerns are all address by Terms & Conditions - I am sure your legal department will address that from a Service Provider's perspective.
Data is hosted in our own postgres DB - integrations with that are important
Solr provides ability to import data directly (DIH) through a traditional relational database (MySQL, Postgres, Oracle, etc). You can either use that so Solr can pull data periodically or write your own simple script to push data through the Solr APIs.
If you are hosted in the cloud (AWS), a tunnel can be created so only the Solr deployments have the ability to pull data from your servers and your database servers are not exposed to the world, if you choose to go the DIH route.
Speed will be important, but not so much now
Solr is built for search speed - I don't think that's where your problems are going to be. Service offering like Measured Search's - you can spin up a cluster in any data center supported by AWS or Azure and make sure your search deployments are closer to your application servers so the latency overhead is minimal.
Must be able to do content-based searches across multiple languages
Yes, Solr supports that. More than 30 languages.
We are a tiny team of devs now, so dev resource time is vital
I am biased here, but I would not have my developers spend much time on operations and let them focus on what they do best - build great product capabilities to push the limits and deliver business value.
If you are interested in doing a comparison and ROI of doing it yourself vs using a solr-as-a-service like offered by SearchStax, check this paper out - https://www.searchstax.com/white-papers/why-measured-search-is-better-than-diy-solr-infrastructure/

How does AWS SimpleDb differ to Azure DocumentDb? How do both differ to ElasticSearch

In terms of
scalability,
performance,
maintenance,
Ease of use / Learning curve
cost,
In order of significance but wouldn't mind a general answer as I appreciate I m probably asking for too much :)
Thanks
EDIT: I m looking for a database that will serve as the single authoritative data store and I need all attributes of the documents stored to be indexed for various business reasons. Therefore I know that other solutions won't do what I m looking for.

tl;dr; If you are using JavaScript and building browser apps, node.js and DocumentDB are a match made in heaven. If you are using .NET and/or other Azure services, then DocumentDB is favored. If you are using other AWS services, then SimpleDB might be better.
I know that questions like this are not ideal for Stack Overflow, but I often see value in answers like this and my most popular answer on SO is essentially informed opinion backed by evidence. I have not used SimpleDB but I looked into it before deciding on DocumentDB. I rejected it pretty quickly... although I did give AWS Lambda a serious look before deciding on DocumentDB. So:
scalability. DocumentDB has a very straight forward and explicit scaling model -- add more collections if you need either more space or more operations per second. SimpleDB's scaling model is similar except less straight forward since you add domains which are overloaded to both provide type separation (think tables) and scalability. You can scale either to whatever you need.
performance. Since I never built anything on it, I can't say anything about SimpleDB's performance. However, I've been very impressed with the performance of DocumentDB. Latency is less than 10ms for simple id-based reads and I get impressive latency and throughput for queries. The DocumentDB implementation of our current app returns complex n-dimensional aggregations (done in stored procedures on DocumentDB using documentdb-lumenize) in 1/4 the time of the functionally-equivalent MongoDB/node.js implementation. You'd have to do your own performance testing on your actuall application to have a definitive answer here.
maintenance. Both are much more hands off than traditional data stores. There just aren't that many knobs to turn maintaining either of them. SimpleDB geographically distributes your data by default. You'd have to do the equivalent manually in DocumentDB. Possible, but harder. DocumentDB has good import/export tools and their backup solution is about to be significantly upgraded.
ease of use / learning curve. If you are JavaScript programmer, than DocumentDB has a lot to recommend it. DocumentDB uses JSON natively. SimpleDB uses XML. DocumentDB has ACID-enabling stored procedures written in JavaScript. You'd need to combine SimpleDB with something else (Lambda maybe, but the XML/JavaScript mismatch would make this less than ideal) to get the equivalent. Both allow use to use SQL but DocumentDB also allows for JavaScript native queries.
There is one huge mindset hurdle that you will have to get over in order to be successful with DocumentDB. Despite the fact that they both scale by adding more domains/collections, SimpleDB domains are closer conceptually to tables. The word choice of "collection" by the DocumentDB team is unfortunate since they are more akin to partitions and should not be thought of as tables. The hard part is getting used to the idea that you store all of your different data types in the same collection. Once you get over that, I find DocumentDB's approach refreshing and incredibly flexible. I can efficiently model inheritance and type-mixins. Collections nay partitions have one purpose -- scalability. Domains are used for both scalability and data type separation which is actually harder in practice.
cost. Not much to say here. Both allow you to scale your cost gradually. For really small implementations, DocumentDB is probably more expensive since the smallest unit of usage is a single collection which is $25/month minimum. You'd have to do your own modeling/what-if analysis to determine which would be less expensive for you. Note, that Azure is being every aggressive in general and even pushing AWS to lower prices in some cases. My gut is that they would be roughly equal in cost for most applications.
Other thoughts:
You wrote, "I need all attributes of the documents stored to be indexed". One really nice feature of DocumentDB is that you can specify the size of your indexes By default, every field is indexed into a 3-byte per field hash index, which is highly space efficient. I do not know if SimpleDB has the equivalent.
This is a bit like comparing apples to oranges. I consider DocumentDB to be like MongoDB or CouchDB in it's data model and VoltDB in its use execution model (although VoltBD sprocs are written in Java). SimpleDB feels more like a simple XML object store. If you already have a big XML mindset, then it might be easier, but I think there are more folks using JSON today than XML.
Writing ACID-enabling stored procedures in JavaScript is a killer feature that only DocumentDB has. Some say the days of stored procedures are over; that you should put all such logic in your application server layer. If you implementing a simple CRUD API, that may be, but almost every application requires some sort of transaction where more than one row is changed at a time. This is mind bogglingly hard to do correctly without transaction support in your data store. Even if you do implement the equivalent of transactions with your NoSQL database, the overhead of the implementation eats away any development/performance/scalability advantages that you got by choosing NoSQL rather than SQL.
DocumentDB's user defined functions and triggers (also written in JavaScript) might be useful, although I believe the trigger implementation is crippled at this moment in time and I haven't found a use for UDFs myself yet.
DocumentDB has built-in attachment support. You need to integrate manually with S3 for the equivalent on AWS.
DocumentDB has geo indexing and operators.
SimpleDB's 1K per document limit is a serious limitation. This tells me that it's designed mostly for logging or as an index to S3 and not a full-fledged document store. The limit for DocumentDB is 512K.
If comparison to SimpleDB is like apples to oranges, then comparison to ElasticSearch is like apples to fire engines. My impression of ElasticSearch is that it's all about full-text searching and analytics. I don't think it's space/execution/api efficient to serve as a primary transactional store. Built on Lucene, it was not designed to have the reliability/durability to be your primary store. Further, even when hosted, it's more of an IaaS offering, wherease DocumentDB and SimpleDB are true PaaS offerings. The maintenance will be much higher with ElasticSearch.

How does Azure DocumentDB scale? And do I need to worry about it?

I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.
So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.
Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.
So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?
I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.

Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.
A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).
Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/
Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.

With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.
Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.

You can fan out queries like this - http://stuartmcleantech.blogspot.co.uk/2016/03/scalable-querying-multiple-azure.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string