Querying big data [closed]

Querying big data [closed] - cassandra

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working with a system that takes in a 50/s 10kb write stream which runs 24 hours a day. The data is ingested via a messaging system in to a sql database, and then used in an overnight aggregation that takes around 15 hours to produce queryable data for an application.
This is currently all in sql, but we are moving to a new architecture.
The plan is to move the ingested writes in to a distributed database like Cassandra or dynamodb, and then perform the aggregation in hadoop. This makes those parts of the system scalable.
My question is, when people have this architecture, where do they put the data after the writes and aggregation have been performed so that it can be queried.
In more detail:
The query model our application uses is quite complicated, to make the data queryable in cassandra, we would have to denormalise it for all queries, this is possible, but would mean a massive growth in data size. Is this normal practice? Or would you prefer to move the data back in to sql?
We could move the data in to redshift, but this seems to be more for ad hoc data analytics, and its purpose is not to be the backend for a data analytics application. I also think the queries are too complicated in their current form to be written in an orm which is what is required for redshift.
Does this mean that I still need to put the data in to sql server?
I am looking for examples of what people are doing at the moment.
I am sorry this question is a bit abstract, please do not close it, I will add more detail. I have read lots on big data, but most articles are about the ingestion of data using messaging / workers and distributed databases, but I have not found any that show what they do with this ingested data and how it is queried from the application.
*answer to JosefN's comment: Yes, we are not planning to denormalise in to a sql db. The choice is, denormalise in to cassandra, for all clients and queries, this would probably mean 100x the current data size, as there will be so much duplication in the denormalised model. The other option is to store it as it is now, so that it is queryable, but then, is my only option a sql db?
*after more research I have more information. The best options at the moment seems to be:
store back in sql
denormalise in cassandra
use one of the real time sql engines on top of hadoop / hdfs like impala
drpc with storm
I do not have any experience with Impala or DRPC with storm, so if anyone has any info on latency and the type of queries that can be performed with these that would be great.
Please do not refer to documentation or blog posts, I know how these technologies work, I only want to know if someone has used them in production and has their own information on this subject. thanks

I would suggest moving the aggregated data into HDFS. Using Hive, which provides a relational view over data stored inside HDFS, you can very well use adhoc sql like queries. At the same time you will be benefitted from parallelism of MapReduce jobs that gets invoked when you use Hive. This would help you to decrease query latencies that you would be having while using a RDBMS. Also think about doing the aggregation jobs in Hadoop itself.

Since the data after aggregation is small and you are looking for good latency keeping it in hdfs and query it using hive is not preferable.
I have seen people using hbase to store aggregated data and query it but as you mentioned earlier you will have to denormalize the data. For this case I will recommend writing aggregated data back to mysql and query it there if aggregated data is not big.

I think one traditional approach is to run your Hadoop/Hive jobs to aggregate across all possible dimensions, and then store in a key/value store like HBase, and look up at runtime with a key based on the aggregation done ( ie. /state=NJ/dt=20131225/ ) This can cause an explosion in size, especially if there are many dimensions to roll up
If you want/need a more realtime solution as well, take a look at Twitter's summingbird.

Related

What is the point of a table in a data lake? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I thought the whole point of using a Data Lake versus a Data Warehouse was to invert the ETL (Extract, Transform, Load) process to LET (Load, Extract, Transform). Doesn't extracting this data, transforming and loading it into a table get us right back where we started?

IMHO the point of a data lake is to store all types of data: unstructured, semi-structured and structured. The Azure version of this is Azure Data Lake Store (ADLS) and its primary function is scalable, high-volume storage.
Separately, there is a product Azure Data Lake Analytics (ADLA). This analytics product can interact with ADLS, but also blob storage, SQL on a VM (IaaS) and the two PaaS database offerings, SQL Database and SQL Data Warehouse and HDInsight. It has a powerful batch language called U-SQL, a combination of SQL and .net for interrogating and manipulating these data stores. It also has a database option which, where appropriate, allows you to store data you have processed in table format.
One example might be where you have some unstructured data in your lake, you run your batch output and want to store the structured intermediate output. This is where you might store the output in an ADLA database table. I tend to use them where I can prove I can get a performance improvement out of them and/or want to take advantage of the different indexing options.
I do not tend to think of these as warehouse tables because they don't interact well with other products yet, ie they don't as yet have endpoints / aren't visible, eg Azure Data Factory can't move tables from there yet.
Finally I tend to think of ADLS as analogous to HDFS and U-SQL/ADLA as analogous to Spark.
HTH

By definition a data lake is a huge repository storing raw data in it's native format until needed. Lakes use a flat architecture rather than nested (http://searchaws.techtarget.com/definition/data-lake). Data in the lake has a unique ID and metadata tags, which are used in queries.
So data lakes can store structured, semi-structured and unstructured data. Structured data would include SQL database type data in tables with rows and columns. Semi-structured would be CSV files and the like. And unstructured data is anything and everything -- emails, PDFs, video, binary. It's that ID and the metadata tags that help users find data inside the lake.
To keep a data lake manageable, successful implementers rotate, archive or purge data from the lake on a regular basis. Otherwise it becomes what some have called a "data swamp", basically a graveyard of data.
The traditional ELT process is better suited to data warehouses because they are more structured and data in a warehouse is there for a purpose. Data lakes, being less structured, are more suited to other approaches such as ELT (Extract, Load, Transform), because they store raw data that is only categorized by each query. (See this article by Panopoly for a discussion of ELT vs ETL.) For example, you want to see customer data from 2010. When you query a data lake for that you will get everything from accounting data, CRM records and even emails from 2010. You cannot analyze that data until it has been transformed into usable formats where the common denominators are customers + 2010.

To me, the answer is "money" and "resources"
(and probably correlated to use of Excel to consume data :) )
I've been through a few migrations from RDBMS to Hadoop/Azure platforms and it comes down to the cost/budget and use-cases:
1) Port legacy reporting systems to new architectures
2) Skillset of end-users who will consume the data to drive business value
3) The type of data being processed by the end user
4) Skillset of support staff who will support the end users
5) Whether the purpose of migration is to reduce infrastructure support costs, or enable new capabilities.
Some more details for a few of the above:
Legacy reporting systems often are based either on some analytics software or homegrown system that, over time, has a deeply embedded expectation for clean, governed, structured, strongly-typed data. Switching out the backend system often requires publishing the exact same structures to avoid replacing the entire analytics solution and code base.
Skillsets are a primary concern as well, because your often talking about hundreds to thousands of folks who are used to using Excel, with some knowing SQL. Few end-users, in my experience, and few Analysts I've worked with know how to program. Statisticians and Data Engineers tend towards R/Python. And developers with Java/C# experience tend towards Scala/Python.
Data Types are a clincher for what tool is right for the job... but here you have a big conflict, because there are folks who understand how to work with "Data Rectangles" (e.g. dataframes/tabular data), and those who know how to work with other formats. However, I still find folks consistently turning semi-structured/binary/unstructured data into a table as soon as they need to get a result operationalized... because support is hard to find for Spark.

Cassandra store data in BLOB

We are using Cassandra 3 and have come up with a modelling based on the initial requirements. Since there have been very frequent requirements changes, this model has subsequently changed many times as well. Hence considering these requirements and model changes, there has been no major improvement in terms of development. The team have decided to go with the BLOB data type and store the entire data in the BLOB. Can you please share the drawback to use BLOB such a scenario. Thanks in Advance.

We migrated from Astyanax Cassandra 1.1 to CQL Cassandra 3.0 directly, so we still have a lot of column families which have value as BLOB.
Major issues we face right now are:
1) Difficult to visualize data directly from database: Biggest advantage of CQL is it supports SQL like queries, hence logging into cql terminal and getting results directly from there is saves a lot of time normally. If you use BLOB you will not be able to do all such things.
2) CQL performs better when your table has a well defined schema instead of using blob to store big chunk of data together.
If you are creating a new table, I will suggest to use Collections for your use case. You will be able to store different type of data and performance will also be good.
Nice slides comparing performance of schemaless tables and tables with scehma and collections. You can skip to slide 26 if you just want the summary.
https://www.slideshare.net/DataStax/migration-from-thrift-to-cql-brij-bhushan-ravat-ericsson-cassandra-summit-2016

Query-Driven Modelling and Big Data

I was watching one of the Cassandra videos on DataSax Academy. One concept they talk a lot about is query driven modelling. This makes sense when you know your queries upfront like in the KillrVideo example.
However, in big data cases, I hope I am not the only one to think that we barely know what kind of queries analysts will perform on the data 5 months or one year down the road.
If this is the case, what are the best practices for storing your data? My guess is that for advanced querying of such data, you likely will end up loading your data into Spark. But what do I have to consider at storage time to avoid operational troubles and troubles at retrieval time? What retrieval approaches are less problematic?

Cassandra is also a database for analytics use cases, but not always for Ad-Hoc Analaytics (Only one report and this query will never perform again stuff).
For this use cases is a hadoop cluster a better option for your. (Maybe parquete on hadoop) If you see that queries will perform over and over again, Cassandra is your friend. Generally you can use Cassandra for 50 to 70% of your use cases. With column keys and secondary indizies you can perform really a wide spectrum of queries. Go to your Analytics Guys and ask them what they need. Then: Create your tables :)

Datastax has a course on doing analysis on Cassandra with Apache Spark.

Bigdata analysis in nosql

I'm trying to migrate our postgres database containing millions of clicks (few years click history) to more performing system. Our current analytic queries, which are running on postgres are taking forever to complete and it degrades performance of the whole database. I've been investigating possible solutions and I've decided to closely investigate 2 options:
HBase with Hadoop (mapreduce)
Cassandra with Spark
I was working with NoSQL before, however never used it for analytical purposes. At first I was a bit disapointed how little analytical query options those databases provide (missing groupBy, count, ...). After reading many articles and presentations I've found out, that I need to design my schema according how I intend to read my data and that storage layer is separated from query layer. Which adds more redundant data, however in the world of NoSQL this is not an issue.
Eventually I've found one nice grails plugin cassandra-orm, which internally encapsulates orderBy feature in cassandra counters counters. However I'm still worried about howto make this design extendable. What about the queries, that will come in the future, which I have no clue about today, how can I design my schema prepared for that ?
One option would be to use Spark, but Spark doesn't provide data in real time.
Could you give me some insight or advice what are the best possible options for bigdata analysis. Should I use combination of real time queries vs. pre-aggregated ones?
Thanks,

If you are looking at near real time data analysis, Spark + HBase combination is one of the solutions.
If you want to compromise on throughput, Solr + Cassandra combination from Datastax can be used.
I am using Solr + Cassandra from Datastax for my use case, which does not require real time processing. The performance of search option is not that great with this combo but I am OK with the throughput.
Spark+HBase combination seems to be promising. Depending on your business requirement & expertise, you can chose the right combination.

If you want the ability to analyse data in near-real-time with complete flexibility in query structure, I think your best bet would be to throw a scalable indexing engine such as Elasticsearch or Solr into your polyglot persistence mix. You could still use Cassanra as the primary data store and then index those fields you're interested in querying and/or aggregating.
Have a look at Datastax Enterprise which bundles together Cassandra and Solr. Also have a look at Solr's Stats component and its faceting capabilities. These, combined with the indexing engine's rich query language, are handy for implementing many analytics use cases.
If your data set consists of a few million records 'only', I think you'll be able to get some good response times from Solr or ES on a reasonably spec'ed cluster.

Can Druid replace Cassandra? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I cant help think that there aren't many use case that can be effectively served by Cassandra better than Druid. As a time series store or key value, queries can be written in Druid to extract data however needed.
The argument here is more around justifying Druid than Cassandra.
Apart from the Fast writes in Cassandra, is there really anything else ? Esp given the real time aggregations/and querying capabilities of Druid, does it not outweigh Cassandra.
For a more straight question that can be answered - doesnt Druid provide a superset of features as comapred to Cassandra and wouldn't one be better off in using druid rightaway? For all use cases?

For a more straight question that can be answered - doesnt Druid provide a superset of features as comapred to Cassandra and wouldn't one be better off in using druid rightaway? For all use cases?
Not at all, they are not comparable. We are talking about two very different technologies here. Easy way is to see Cassandra as a distributed storage solution, but Druid a distributed aggregator (i.e. an awesome open-source OLAP-like tool (: ). The post you are referring to, in my opinion, is a bit misleading in the sense that it compares the two projects in the world of data mining, which is not cassandra's focus.
Druid is not good at point lookup, at all. It loves time series and its partitioning is mainly based on date-based segments (e.g. hourly/monthly etc. segments that may be furthered sharded based on size).
Druid pre-aggregates your data based on pre-defined aggregators -- which are numbers (e.g. Sum the number of click events in your website with a daily granularity, etc.). If one wants to store a key lookup from a string to say another string or an exact number, Druid is the worst solution s/he can look for.

Not sure this is really a SO type of question, but the easy answer is that it's a matter of use case. Simply put, Druid shines when it facilitates very fast ad-hoc queries to data that has been ingested in real time. It's read consistent now and you are not limited by pre-computed queries to get speed. On the other hand, you can't write to the data it holds, you can only overwrite.
Cassandra (from what I've read; haven't used it) is more of an eventually consistent data store that supports writes and does very nicely with pre-compute. It's not intended to continuously ingest data while providing real-time access to ad-hoc queries to that same data.
In fact, the two could work together, as has been proposed on planetcassandra.org in "Cassandra as a Deep Storage Mechanism for Druid Real-Time Analytics Engine!".

It depends on the use case . For example I was using Cassandra for aggregation purpose i.e. stats like aggregated number of domains w.r.t. users ,departments etc . Events trends (bandwidth,users,apps etc ) with configurable time windows . Replacing Cassandra with Druid worked out very well for me because druid is super efficient with aggregations .On the other hand if you need timeseries data with eventual consistency Cassandra is better ,Where you can get details of the events .
Combination of Druid and Elasticsearch worked out very well to remove Cassandra from our Big Dada infrastructure
.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string