Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I've been researching on BigData for the last couple of months and started doing my FYP which is to analyze BigData using MapReduce and HDInsight in Windows Azure.
I just came to this particular confusion where, which platform could be better to do BigData analytics in terms of cost, performance, stability etc. such as Amazon, Oracle, IBM etc. This question could be too broad but I just wanted to get a basic idea of how they can be differentiated when compared to Azure HDInsight.
To be short, HDInsight vs Other BigData Solutions for BigData analytics. Any help would be appreciated.
Comparison among different hadoop distributors
1]1
You can find reference about Microsoft distribution at this article
One important detail is that it is not only about the platform. I agree that it is important to understand your options but humbly I suggest that you take into consideration your (and your team's) skills.
One platform may be better than another one but if you are starting from scratch then you may fail in achieving your goals in terms of timelines, budget or even fail completely.
Combining Operational and Analytical Technologies- Using Hadoop
New technologies like NoSQL, MPP databases, and Hadoophave emerged to address Big Data challenges and to enable new types of products and services to be delivered by the business.
One of the most common ways companies are leveraging the capabilities of both systems is by integrating a NoSQL database such as MongoDB with Hadoop. The connection is easily made by existing APIs and allows analysts and data scientists to perform complex, retroactive queries for Big Data analysis and insights while maintaining the efficiency and ease-of-use of a NoSQL database.
NoSQL, MPP databases and Hadoop are complementary: NoSQL systems should be used to capture Big Data and provide operational intelligence to users, and MPP databases and Hadoop should be used to provide analytical insight for analysts and data scientists. Together, NoSQL, MPP databases and Hadoop enable businesses to capitalize on Big Data.
As per my knowledge, each cloud service provider has their positives and negatives.
I have a good knowledge about google cloud. so, I tried to compare w.r.t. google cloud.
Below are the two links, which provides product mapping with respect to google cloud.
https://cloud.google.com/free/docs/map-azure-google-cloud-platform
https://cloud.google.com/free/docs/map-aws-google-cloud-platform
For example, Azure HDInsight maps with Google Cloud Dataproc and Google Cloud Dataflow. Here using Dataproc, we can run Hadoop Mapreduce jobs. Dataflow we can use for both batch and stream data processing.
In AWS, Azure HDInsight maps with Amazon Elastic MapReduce (EMR).
Each service provider has different pricing mechanism based on type of CPU, no.of cores and storage option. In google cloud, we have option for preemptible instances, which will cost very cheap but we can use them only for short term. (Max 24 hrs).
You can compare pricing from below links:
https://cloud.google.com/dataproc/pricing
https://cloud.google.com/dataflow/pricing
https://azure.microsoft.com/en-us/pricing/details/hdinsight/
https://aws.amazon.com/emr/pricing/
There is a tool in market to compare different cloud services:
https://github.com/GoogleCloudPlatform/PerfKitBenchmarker
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 days ago.
Improve this question
hello stackoverflow community...
regarding Microsoft's Azure Synapse Analytics (ASA) cloud-based OLAP there seems to be a critical oversight with ASA dedicated MPP sqlpools
i'm rolling my own IDENTITY surrogate keys because azure synapse MPP dedicated SQLpools (RDBs) don't seem to support this. in doing so data sets that could be loaded with "order (N)" performance (i.e., pre-sorted data sets) will be degraded to "order (N log N)" (each time i insert a record i must calculate it's surrogate key ( e.g., surrogateKey <- [ MAX(table[surrogateKey]) + 1 ] )
INITIAL LOADING, as well as INCREMENTAL LOADING, of MANY LARGE ODS datasets (sometimes exceeding billions of rows) is a requirement for the DW environments I support.
this currently does not appear to be supported. if this a challenge of using azure synapse dedicated SQLpools using a ASA-dedicated-MPP-sqlpool-db, how do we overcome these very real limitations?
how do we best implement and support necessary surrogate keys with ASA for OLAP data architectures without sound assistance and support from the RDBMS, and have it be performant?
please also help find answers to these important, related questions, stackoverflow...
how do i best ingress and ingest source data into ASA-dedicated-MPP-sqlpool-db analytics tables that:
0.) maintain their own auto-numbered [i.e. MSSQL IDENTITY(SEED, INCREMENT)] ENFORCED, UNIQUE, PRIMARY KEY, surrogate key fields ?
1.) also support NONCLUSTERED indexes ?
2.) bulk insert operations should be an order (N) operation with pre-sorted source data sets. how do i accomplish this for massive insert ops with "order (N)" performance with azure synapse sqlpool DBs ? it sure seems like this is now order (N log N) ?
3.) support PK - FK referential integrity ? how do we approach doing this and have a performant analytics data model with an azure sqlpool DBs ?
thank you
there must be a better way to do this with azure dedicated sqlpool DBs ?
Synapse dedicated SQL pools do support IDENTITY. And there shouldn’t be any reason it won’t support tables of billions of rows. Large data loads should use COPY INTO or Polybase for best performance.
I am a part of a team of 5 developers that work with gathering data, transforming, analyzing and predicting data in Azure Databricks (basically a combination of Data Science and Data Engineering).
Up until now we have been working on relatively small data, so the team of 5 could easily work with a single cluster with 8 worker nodes in development. Even though we are 5 developers, usually we're at maximum 3 developers in Databricks at the same time.
Recently we started working with "Big Data" and thus we need to make use of Databricks' Apache Spark parallelization methods to improve run-times for our codes. However, a problem that quickly came to light is that with more than one developer running parallelizing codes on a single cluster, there will be queues that slow us down. Because of this we have been thinking about increasing the amount of clusters in our dev-environment so that multiple developers can work on codes that take use of the Spark parallelizing methods.
My question is this: What is the industry standard for number of clusters to have in a development environment? Do teams usually have a cluster per developer? That sounds like it could easily become quite expensive in terms of economic costs.
Usually I see following pattern:
There is a shared cluster for many people for adhoc experimenting, "small" data processing, etc. Please notice that current versions of databricks runtimes is trying to split resources between all users.
If some people need to run something "heaviweight", like, integration tests, etc., closer to production workloads, it's allowed them to create clusters. But to control costs, etc. it's recommended to use cluster policies to limit a size of cluster to create, node types, auto-termination times, etc.
For development clusters it's ok to use spot instances because Databricks cluster manager will pull new instances if existing ones are evicted
SQL queries could be more efficient to run on SQL warehouses that are optimized for BI workloads
P.S. Really, integration tests, and similar things could be easily run as jobs that are less expensive
DataStax seems expensive. Is there a best practice configuration that is available to use Apache Cassandra in production? I am trying to setup Cassandra on EC2.
Thanks
Instead of giving you a commercial for some other product, let me give you some practical advice when choosing to go with OSS vs Commerical licensed products.
You have two things to spend when using any technology. Time or money. Ultimately time is money, but for the sake of this let's say they are different. By your question, you have more time so let's focus on that.
Spend the time to learn the fundamentals. The term black magic is FUD. Some of the world's largest workloads are running on Cassandra. You can do it too.
Seek out peers and learn from those who have been successful. There are organizations that have been running Cassandra in prod for years.
Focus on a single use case/project. Nothing worse than trying to replace all of your infrastructure with a new technology when you are learning. Pick one thing and become proficient. Use that experience for the next projects.
You can get some free training at DataStax Academy. http://academy.datastax.com
Learn from peers by watching talks from the community of awesome users.
You can find something in these 135 talks here: https://www.youtube.com/playlist?list=PLm-EPIkBI3YoiA-02vufoEj4CgYvIQgIk
If you need to ask questions. Stack Overflow, the Cassandra mailing list, and DataStax Academy Slack are all good resources.
Using a commercial product or spending the time is up to you, but don't let anyone try to convince you that it's too hard and you should use something else. We are all here to help if we can.
Disclaimer: I'm a ScyllaDB employee.
There are several alternative to operate Cassandra/Scylla like workloads.
Use OpenSource Cassandra, with best practices. Most of them, unfortunately, where created couple of years ago. So you'll need to learn the black magic of tuning JVM and Cassandra loads.
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
There are no "official" AMIs on AWS for recent releases of Cassandra.
Use Scylla OpenSource. It is a drop replacement for Cassandra. Scylla autotunes itself, to minimize the intervention of the operator in the day-to-day operations. Also, Scylla provides opensoure AMIs for EC2 deployment, so, all you need is an AWS account.
Scylla is a C++ implementation of Cassandra, which benefits from the great (and costly) resources on AWS. Thus, offer a better ops/$$ ratio. Scylla highly recommends the usage of I3 instances, you'd be using contemporary CPU technology, excellent I/O (NVMe based) and lots of memory at the fraction of the cost of other EC2 instances.
You can read more about it here:
http://www.scylladb.com/2017/05/10/faster-and-better-what-to-expect-running-scylla-on-aws-i3-instances/
ScyllaDB is committed to provide opensource, optimized AMI versions.
Buy enterprise licenses from DataStax or Scylla.
Hire consultants to help you install a Cassandra setup.
Companies like "the last pickle" or Pythian can help you in that regard.
Use DBaaS offerings from the following companies:
Scylla:
IBM Compose: https://www.compose.com/databases/scylladb
Joyent Triton:https://www.joyent.com/blog/free-trial-managed-scylladb-beta-on-triton
Scylla and Cassandra
Instaclustr: https://www.instaclustr.com/
Hope this helps.
A brief summary of the project I'm working on:
I was hired as a web dev intern at a small company (part of a larger corporation) close to the state college I attend. For the past couple months, myself and two other interns have been working on the front-end as well as the back-end. The company is prototyping adding sensors to its products (oil/gas industry); we were tasked with building the portal that customers could login to to see data from their machines even if they're not near them.
Basically, we're collecting sensor data (~ten sensors/machine) and it's sent back to us. Where we're stuck is determining the best way to store and analyze long term data. We have a Redis Cache set up for fast access by the front-end, where only the lastest set of data for each machine is stored. But for historical data, I (and my coworkers) are having a tough time deciding the best route to go. Our whole project is based in VS (C#/Razor) with Azure integration (which is amazing by the way), so I'd like to keep the long term storage there as well. As far as I can tell, HDinsight + data in a BLOB seems to be the best option, but I'm fairly green when it comes to backend solutions. I would just like input from some older developers who may have more experience in this area, as we are the only developers here besides a couple older members who are more involved in the engineering side of things vs. development.
So, professionals of stack overflow, what would be your recommendation for long-term data storage and analytics?
PS: I apologize if I have HDinsight confused. From what I understand, it maps data in BLOB storage into HBase for easier analytics? Hadoop/HBase confuses me.
My first recommendation would be Azure Table storage. It provides a highly scalable and low cost data archival solution. If designed properly, you can also get a very decent query performance. Refer to the Azure Storage Table Design Guide for more details.
My second choice would be Azure DocumentDB service which is a NoSQL document database. It costs a bit more but querying data is much more flexible.
You should only go with HDInsight when you have a specific need as it's a resource-intensive and expensive service. Once you identify a specific requirement for a big-data analysis that's when you import your data and process it with HDInsight.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am looking for an eventually consistent data store and it looks like it may be coming down to Riak or Cassandra. Has anyone got expereinces of a view on this?
As you probably know, they are both architecturally strongly influenced by Dynamo (eventually consistent, no single points of failure, etc). Both also go beyond Dynamo in providing a "richer than pure K/V" data model -- in Cassandra's case, providing a Bigtable-like ColumnFamily mode, in Riak's, a Document-oriented one. I have seen sane people choose both.
I believe points that favor Cassandra include
speed
support for clusters spanning multiple data centers
big names using it (digg, twitter, facebook, webex, ... -- http://n2.nabble.com/Cassandra-users-survey-tp4040068p4040393.html)
Points that favor Riak include
map/reduce support out of the box
/Cassandra dev, fwiw
Riak is used by
Mozilla Foundation
Ask.com sponsored listings
Comcast
Citigroup
Bet365
I think they both pass the test of credible reference customers/users.
Cassandra seems more mature, and is currently doing better in benchmarks. Riak seems easier to add a node to as your cluster grows.
For completeness: A good (probably biased) comparison between the two can be found at http://docs.basho.com/riak/1.3.2/references/appendices/comparisons/Riak-Compared-to-Cassandra/
Use and download are different. Best to get references.
Perhaps a private conversation could be had where Riak references in these companies could be shared? Not sure how to get such with Cassandra, but there is a community of companies that support Cassandra that seem like a good place to start. As these probably have community participants in Cassandra development, it may be a REALLY reasonable place to start.
I would like to hear Riak's answer to recent and large deployments where customers are happy.
I also would like to see the roadmap for each product. Cassandra is a bit easier to track (http://wiki.apache.org/cassandra/) than Riak in my view as Cassandra's wiki discusses limitations and things that are probably going to change going forward, but neither outline futures well. I could understand that of an open source community ... perhaps ... but I cannot for a product for which I must pay.
I also would suggest research of Cloudant, which has what appears to be a very nice layering of capabilities. It also looks like it is bringing to bear the capabilities elsewhere in Apache land. CouchDB is the Apache platform on which Cloudant is based. BUT the indexing with Lucene seems but the tip of the iceberg when it comes to where Cloudant could go. Creating and managing an index is a very systematic process, a kind of data pipeline, that could be scripted using other Apache community assets. AND capabilities like NLP also could be added through Lucene indirectly, or maybe directly into what is persisted.
It would be nice to see a proposed Cloudant roadmap, especially since the team could mine the riches of the Apache community and integrate such into Cloudant. Such probably exists as there is an operational component to the Cloudant revenue model that will require it, if for no other reason.
Another area of interest ... Cloudant's pricing model ... it is clear their revenue model is not based on software, but around service. That is quite attractive, and it seems consistent with the ecosystem surrounding Cassandra too. I don't know if the Basho folks have won over enough of the nosql community as yet ... don't see such from any buzz around their web site or product.
I like this Cloudant web page (https://cloudant.com/the-data-layer/). I was surprised to see the embedded Erlang capability ... I did not know CouchDB was written in Erlang as this seems unusual to me in the Apache community (my ignorance); CouchDB appears to be older than other nosql products I know (now) to be written in Erlang. Whatever their strategy, they at least count Amazon EC2 and Microsoft Azure as hosting partners, indicating an appreciation of Microsoft and !Microsoft worlds - all very important if properly recognizing the middleware value potential (beyond cache or hash table applications) that these types of data stores could have.
Finally, while I don't know the board well, Andy Palmer's guidance looks like it will be valuable. He can bring some guidance vis-a-vis structured data (through VoltDB) to a world that rightly or wrongly may be unfairly branded as KVP hash tables of unstructured data. The need for structure and ecosystem surrounding nosql "databases" is being recognized ... witness Google's efforts with Spanner ... KVP/little structure/need for search-ability motivated Google's investment in the Spanner space. While we all may not need something like Spanner, we probably do need an improving and robust "enterprise" management and interoperability capability in these nosql databases to make it reasonable to incorporate them into modern cloud architectures. The needed structure can come from ease of interoperability and functional richness. It can also come from new capabilities that support conversion of unstructured data to structured data (e.g. indexes, use of NLP to create structured and parsed renderings of things inside of a KVP blob, and plenty of other things that, if put into a roadmap and published, could entice and grow a user base). Cloudant looks like it has a good chance of success ... I will take a closer look at it ...
And look what I found about CouchDB ...
CouchDB comes with a suite of features, such as on-the-fly document transformation and real-time change notifications, that makes web app development a breeze. It even comes with an easy to use web administration console. You guessed it, served up directly out of CouchDB! We care a lot about distributed scaling. CouchDB is highly available and partition tolerant, but is also eventually consistent. And we care a lot about your data. CouchDB has a fault-tolerant storage engine that puts the safety of your data first.