I've no experience with geo location based apps and want to build a geolaction based app with a backend written in nodejs and running on google cloud.
My main problem is how to design the database and which db should I use (Bigtable or Datastore)? The main query is to query places at a given location and radius. I have read a lot about the geohash, but the nodejs librarys aren't so good now.
So what are you recommend me for chosing and designing database?
If you want to store the data in relational format, perform frequent
joins between location/co-ordinates and the amount of data being
processed is less (>50 GB), then go for Google Cloud SQL.
Cloud Bigtable is ideal for storing very large amounts of
single-keyed data with very low latency. It has great integration
services with most of the Apache projects.
If there is no requirement of data to be in the relational format,
and frequent insertions and updations are required on huge amounts of
data, go for Google Cloud Datastore. The querying process would be
slightly different and difficult for a naive person to understand.
You can also use Google BigQuery which processes TBs of data within a
few seconds, if frequent insertions and updations are not required.
It is more of a data store.
Have a look at the following URL for better insights: https://cloud.google.com/storage-options/
Google has also announced Cloud Spanner which is a relational
database service that offers great consistency and speed (still be
beta). It is still in early stage, but can revolutionise the concepts
of SQL vs NoSQL.
All of the above databases have querying libraries written for NodeJS.
GeoMesa, an Apache licensed open source suite of tools that enables large-scale geospatial analytics, works with Cloud Bigtable. I don't know how well this will interact with node.js, but it's worth considering a framework like GeoMesa since it will likely enable you to focus more on your core product.
Related
we want to develop a dashboard to analyze geospatial data.
This is a small and close approach to what we want to do: http://adilmoujahid.com/images/data-viz-talkingdata.gif
Our main concerns are about the backend technologies to be used. (front will be D3.js, DC.js, leaflet.js...)
Between Django and node.js, we think that we will use node.js, cause we've read than its faster than Django for this kind of tasks. But we are not sure and we are open to ideas.
But about Mongo or Cassandra, we are so confused. Our data is mostly structured, so store it in tables like Cassandra would make it easy to manage, also Cassandra seems to have better performance. However, we also have IoT devices data, with lots of real-time GPS location...
Which suggestions can you give to us to achieve our goal?
TL;DR Summary;
Dashboard with hundreds of simultaneous users.
Stored data will be mostly structured text/numbers, but will include also images, GPS-arrays, IoT sensors, geographical data (vector-polygons & rasters)
Databases will receive high write load coming from sensors.
Dashboard performance is so important. Its more important to read data in real time, than keeping it uncorrupted/secure.
Most calculus/math will be calculated in the client's browser, the server will try to avoid mathematical operations.
Disclaimer: I'm a DataStax employee so I'll comment on the Cassandra piece.
Cassandra is a good choice for this if your dashboard can be planned around a set of known queries. If those users will be doing ad-hoc queries directly to the database from the dashboard, you'll want something with a little more flexibility like ElasticSearch or (shameless plug) DataStax Search. Especially if you expect the queries/database to handle some of the geospatial logic.
JaguarDB has very strong support of geospatial data (2D and 3D). It allows you to store multi-measurements per point location while other databases support only one measurement (pointm). Many complex queries such as Voronoi polygon, convexhull are also supported. It is open source, distributed and sharded, multiple columns indexes, etc.
Concerning Postgresql and Cassandra, is there much difference in RAM/CPU/DISK usage between them?
Our use case does not require transactions, it will be in a single node and we will have IoT devices writing data up to 500 times per second. However ive read that Geographical data that works better with Potstgis than cassandra...
According to this use case, do you recommend Cassandra or Postgis?
I'm building a B2B Node app which has heavily related data models. We currently have our own search queries, but as we scale some of the queries appear to be becoming sluggish.
We will need to support multilingual search as well as content-based searches (searching matching content within related data).
The queries are growing more and more complicated (each has multiple joins on joins on joins) and I'm now considering a hosted search tool such as Algolia.
Given my concerns below, why should I use a hosted cloud search service rather than continue building my own queries?
Data privacy is important
Data is hosted in our own postgres DB - integrations with that are important (e.g.: will I now need to manually maintain our DB data and data in Algolia?)
Speed will be important, but not so much now
Must be able to do content-based searches across multiple languages
We are a tiny team of devs now, so dev resource time is vital
What other things should I be concerned about that can help make a decision in search capabilities?
Regarding maintenance of both DB and Cloud data, it seems it's as simple as getting all data, caching it, and storing it in the cloud:
var index = Algolia.initIndex('contacts');
var contactsJSON = require('./contacts.json');
index.addObjects(contactsJSON, function(err, content) {
if (err) {
console.error(err);
}
});
Search services like Algolia or self-hosted Elasticsearch/solr operate as full text search, not relational db queries.
But it sounds like the bottleneck is the continual rejoining. Which if you can make your relational data act like a full text document db then that could be a more efficient type of index (pre-joined sort of).
You might also look into views, or a data warehouse (maybe star schema).
But if you are going the search route maybe investigate hosting your own elasticsearch.
You could specify database, schema, sql, index, query details if you want more help.
Full Disclosure: I founded a company called SearchStax on the premise that companies and developers should not spend time setting up, managing, scaling or building tools for the search infrastructure (ops) - they are better off investing time of their employees into building value for the company, whether that be features, capabilities, product or customers.
Open Source Search solutions based on top of Lucene (Apache Solr / Elasticsearch) have what you need now and what you might need in near future from a capability perspective from a search engine. Find a mature service provider / AS-A-Service company that has specialization in open source search and let them deal with all. It may look small effort right now, though it's probably not worth time and effort of your devs to spend time on the operations of that.
For your concerns mentioned above:
Data privacy is important
Your concern around Privacy and Security are addressable. There are multiple ways you can secure your Solr environment and the right MSP or a Managed Solution provider should be able to address those.
a. Security at the transport layer can be addressed by SSL certificates. All the data going over the wire is encrypted.
b. IP Filtering and User Based Authentication should address who has access to what. Solr-as-a-Service offering by Measured Search supports both.
c. Security at rest can be addressed in multiple ways - OS level / File encryption, but you can even go further by ensuring not even your services provider has access to that data by using Searchable Encryption technology.
Privacy concerns are all address by Terms & Conditions - I am sure your legal department will address that from a Service Provider's perspective.
Data is hosted in our own postgres DB - integrations with that are important
Solr provides ability to import data directly (DIH) through a traditional relational database (MySQL, Postgres, Oracle, etc). You can either use that so Solr can pull data periodically or write your own simple script to push data through the Solr APIs.
If you are hosted in the cloud (AWS), a tunnel can be created so only the Solr deployments have the ability to pull data from your servers and your database servers are not exposed to the world, if you choose to go the DIH route.
Speed will be important, but not so much now
Solr is built for search speed - I don't think that's where your problems are going to be. Service offering like Measured Search's - you can spin up a cluster in any data center supported by AWS or Azure and make sure your search deployments are closer to your application servers so the latency overhead is minimal.
Must be able to do content-based searches across multiple languages
Yes, Solr supports that. More than 30 languages.
We are a tiny team of devs now, so dev resource time is vital
I am biased here, but I would not have my developers spend much time on operations and let them focus on what they do best - build great product capabilities to push the limits and deliver business value.
If you are interested in doing a comparison and ROI of doing it yourself vs using a solr-as-a-service like offered by SearchStax, check this paper out - https://www.searchstax.com/white-papers/why-measured-search-is-better-than-diy-solr-infrastructure/
Our team have just recently started using Application Insights to add telemetry data to our windows desktop application. This data is sent almost exclusively in the form of events (rather than page views etc). Application Insights is useful only up to a point; to answer anything other than basic questions we are exporting to Azure storage and then using Power BI.
My question is one of data structure. We are new to analytics in general and have just been reading about star/snowflake structures for data warehousing. This looks like it might help in providing the answers we need.
My question is quite simple: Is this the right approach? Have we over complicated things? My current feeling is that a better approach will be to pull the latest data and transform it into a SQL database of facts and dimensions for Power BI to query. Does this make sense? Is this what other people are doing? We have realised that this is more work than we initially thought.
Definitely pursue Michael Milirud's answer, if your source product has suitable analytics you might not need a data warehouse.
Traditionally, a data warehouse has three advantages - integrating information from different data sources, both internal and external; data is cleansed and standardised across sources, and the history of change over time ensures that data is available in its historic context.
What you are describing is becoming a very common case in data warehousing, where star schemas are created for access by tools like PowerBI, Qlik or Tableau. In smaller scenarios the entire warehouse might be held in the PowerBI data engine, but larger data might need pass through queries.
In your scenario, you might be interested in some tools that appear to handle at least some of the migration of Application Insights data:
https://sesitai.codeplex.com/
https://github.com/Azure/azure-content/blob/master/articles/application-insights/app-insights-code-sample-export-telemetry-sql-database.md
Our product Ajilius automates the development of star schema data warehouses, speeding the development time to days or weeks. There are a number of other products doing a similar job, we maintain a complete list of industry competitors to help you choose.
I would continue with Power BI - it actually has a very sophisticated and powerful data integration and modeling engine built in. Historically I've worked with SQL Server Integration Services and Analysis Services for these tasks - Power BI Desktop is superior in many aspects. The design approaches remain consistent - star schemas etc, but you build them in-memory within PBI. It's way more flexible and agile.
Also are you aware that AI can be connected directly to PBI Web? This connects to your AI data in minutes and gives you PBI content ready to use (dashboards, reports, datasets). You can customize these and build new reports from the datasets.
https://powerbi.microsoft.com/en-us/documentation/powerbi-content-pack-application-insights/
What we ended up doing was not sending events from our WinForms app directly to AI but to the Azure EventHub
We then created a job that reads from the eventhub and send the data to
AI using the SDK
Blob storage for later processing
Azure table storage to create powerbi reports
You can of course add more destinations.
So basically all events are send to one destination and from there stored in many destinations, each for their own purposes. We definitely did not want to be restricted to 7 days of raw data and since storage is cheap and blob storage can be used in many analytics solutions of Azure and Microsoft.
The eventhub can be linked to stream analytics as well.
More information about eventhubs can be found at https://azure.microsoft.com/en-us/documentation/articles/event-hubs-csharp-ephcs-getstarted/
You can start using the recently released Application Insights Analytics' feature. In Application Insights we now let you write any query you would like so that you can get more insights out of your data. Analytics runs your queries in seconds, lets you filter / join / group by any possible property and you can also run these queries from Power BI.
More information can be found at https://azure.microsoft.com/en-us/documentation/articles/app-insights-analytics/
A brief summary of the project I'm working on:
I was hired as a web dev intern at a small company (part of a larger corporation) close to the state college I attend. For the past couple months, myself and two other interns have been working on the front-end as well as the back-end. The company is prototyping adding sensors to its products (oil/gas industry); we were tasked with building the portal that customers could login to to see data from their machines even if they're not near them.
Basically, we're collecting sensor data (~ten sensors/machine) and it's sent back to us. Where we're stuck is determining the best way to store and analyze long term data. We have a Redis Cache set up for fast access by the front-end, where only the lastest set of data for each machine is stored. But for historical data, I (and my coworkers) are having a tough time deciding the best route to go. Our whole project is based in VS (C#/Razor) with Azure integration (which is amazing by the way), so I'd like to keep the long term storage there as well. As far as I can tell, HDinsight + data in a BLOB seems to be the best option, but I'm fairly green when it comes to backend solutions. I would just like input from some older developers who may have more experience in this area, as we are the only developers here besides a couple older members who are more involved in the engineering side of things vs. development.
So, professionals of stack overflow, what would be your recommendation for long-term data storage and analytics?
PS: I apologize if I have HDinsight confused. From what I understand, it maps data in BLOB storage into HBase for easier analytics? Hadoop/HBase confuses me.
My first recommendation would be Azure Table storage. It provides a highly scalable and low cost data archival solution. If designed properly, you can also get a very decent query performance. Refer to the Azure Storage Table Design Guide for more details.
My second choice would be Azure DocumentDB service which is a NoSQL document database. It costs a bit more but querying data is much more flexible.
You should only go with HDInsight when you have a specific need as it's a resource-intensive and expensive service. Once you identify a specific requirement for a big-data analysis that's when you import your data and process it with HDInsight.
I am choosing database technology for my new project. I am wondering what are the key differences between Azure DocumentDB and Azure Table Storage?
It seems that main advantage of DocumentDB is full text search and rich query functionality. If I understand it correctly, I would not need separate search engine library such as Lucene/Elasticsearch.
On the other hand Table Storage is much cheaper.
What are the other differences that could influence my decision?
I consider Azure Search an alternative to Lucene. I used Lucene.net in a worker role and simply the idea of not having to deal with the infrastructure, ingestion, etc.. issues make the Azure Search service very appealing to me.
There is a scenario I approached with Azure storage in which I see DocumentDB
as a perferct fit, and it might explain my point of view.
I used Azure storage to prepare and keep daily summaries of the user activities in my solution outside of Azure SQL Database, as the summaries are requested frequently by a large number of clients with good chances to experience spikes on certain times of the day. A simple write once read many scenario usage pattern (my schema) Azure SQL db found it difficult to cope with while it perfectly fit the capacity of storage (btw daily summaries were not in cache because of size) .
This scenario evolved over time and now I happen to keep more aggregated and ready to use data in those summaries, and updates became more complex.
Keeping these daily summaries in DocumentDB would make the write once part of the scenario more granular, updating only the relevant data in the complex summary, and ease the read part, as the capability of getting parts of more summaries becomes a trivial quest, for example.
I would consider DocumentDB in scenarios in which data is unstructured and rather complex and I need rich query capability (Table storage is lagging on this part).
I would consider Azure Search in scenarios in which a high throughput full-text search is required.
I did not find the quotas/expected perf to precisely compare DocumentDB to Search but I highly suspect Search is the best fit to replace Lucene.
HTH, Davide