Presto's support for approx_distinct - presto

I am evaluating distributed query engines for analytical queries (both interactive as well as batch) on large scale data (~100GB). One of the requirements is low latency (<= 1s) for count-distinct queries, where approximate results (with up to 5% error) are acceptable.
Presto seems to support this with its approx_distinct(). As far as my understanding goes, it uses HyperLogLog for that. However, unless the data is persisted in rolled-up form, along with the HyperLogLog values, it would have to be computed on the fly. I do not think my queries would finish within a second for large datasets.
Does it support rollup with HyperLogLog computation at ingestion time (similar to Druid)? Given that unlike Druid, Presto queries the data from external stores (Hive/Cassandra/RDBMS etc.), I am not sure that ingestion time rollups are supported, unless Presto's native store supports them. Can someone please confirm?

There isn't such a thing as "Presto's native store". Presto is query execution engine with connector architecture allowing plugging in multiple storage layers.
If you want an approximate count-distinct for a whole data set, you can compute table stats (When using Presto with Hive, this currently needs to be done in Hive).
If you want an approximate count-distinct for a dynamic selection of data, you still need to read the data. Then you won't get to second latency with such big data set. However, you can combine approx_distinct (or use plain count(distinct ..)) with TABLESAMPLE to limit the size of data read.

You can try with Verdict, which can significantly reduce query processing cost by applying statistics and approximate query processing, yielding 99.9% accuracy. It runs on all SQL-based engines including Apache Hive, Apache Impala, Apache Spark, Amazon Redshift, etc..
You can download source code from here. After downloading and some simple setup, you can issue query as you normally do and get results in a much shorter time.

Related

What is the performance difference between stream.filter instead of CQL ALLOW FILTERING?

The data in my Cassandra DB table doesn't have much data right now.
However, since it is a table where data is continuously accumulated, I am interested in performance issues.
First of all, please don't think about the part where you need to redesign the table.
Think of it as a general RDBS date-based lookup. (startDate ~ endDate)
From Cassandra DB
Apply allow filtering and force the query.
This will get you exactly the data you want.
Query "all data" in Cassandra DB, This query only needs to be done once. (no where)
After that, only the data within the desired date is extracted through the stream().filter() function.
Which method would you choose?
In general, which one has more performance issues?
Summary: You need to do about 6 methods.
Execute allow filtering query 6 times / Not perform stream filter
Execute findAll query once / Execute stream filter 6 times
The challenge with both options is that neither will scale. It may work with very small data sets, say less than 1000 partitions, but you will quickly find that neither will work once your tables grow.
Cassandra is designed for real-time OLTP workloads where you are retrieving a single partition for real-time applications.
For analytics workloads, you should instead use Spark with the spark-cassandra-connector because it optimises analytics queries. Cheers!

Getting data OUT of Cassandra?

How can I export data, over a period of time (like hourly or daily) or updated records from a Cassandra database? It seems like using an index with a date field might work, but I definitely get timeouts in my cqlsh when I try that by hand, so I'm concerned that it's not reliable to do that.
If that's not the right way, then how do people get their data out of Cassandra and into a traditional database (for analysis, querying with JOINs, etc..)? It's not a java shop, so using Spark is non-trivial (and we don't want to change our whole system to use Spark instead of cassandra directly). Do I have to read sstables and try to keep track of them that way? Is there a way to say "get me all records affected after point in time X" or "get me all changes after timestamp X" or something similar?
It looks like Cassandra is really awesome at rapidly reading and writing individual records, but beyond that Cassandra seems to not be the right tool if you want to pull its data into anything else for analysis or warehousing or querying...
Spark is the most typical to do exactly that (as you say). It does it efficiently and is used often so pretty reliable. Cassandra is not really designed for OLAP workloads but things like spark connector help bridge the gap. DataStax Enterprise might have some more options available to you but I am not sure their current offerings.
You can still just query and page through the whole data set with normal CQL queries, its just not as fast. You can even use ALLOW FILTERING just be wary as its very expensive and can impact your cluster (creating a separate dc for the workload and using LOCOL_CL queries against it helps). You will probably also in that scenario add a < token() and > token() to the where clause to split up the query and prevent too much work on any one coordinator. Organizing your data so that this query is more efficient would be strongly recommended (ie if doing time slices, put things in a partition bucketed by time and clustering key timeuuids so its sequential read for each part of time).
Kinda cheesy sounding but the CSV dump from cqlsh is actually fast and might work for you if your data set is small enough.
I would not recommend going to the sstables directly unless you are familiar with internals and using hadoop or spark.

What timeseries database to select for large number of records?

I got into scenario where I have about 100,000 input records per seconds to store. The nature of records is timeseries data.
I need to run both aggregation, other analytics and also some machine learning algorithms over the data continuously. Performance is here the factor as I look for near real-time results.
What would you recommend as database engine?
Take a look at ClickHouse analytical database. It can accept millions of rows per second. It can scan billions of rows per second on a single computer. It scales horizontally to multiple nodes. It fits time series workloads.
If you still need time series database, then try VictoriaMetrics. It is built on ClickHouse ideas, so it is fast and resource-efficient.
I am adding my own solution...
ClickHouse is definitely nice killer. But I am now evaulating for new project open source gpu database OmniSci. Its open source version is limited to single gpu node (up to 16 gpu devices - with oem tesla having 64GB per device you can get 1TB VRAM, of course not that cheap as clickhouse). Its simply SQL database on steroids (JDBC driver exists) with Kafka data source
Omnisci is having also crossdashboarding solution which is licensed already, but you can have real time dashboarding over lets say 20-50 billions of ts records (8-16 gpus) and multidashboard real time analytics without any kind of preaggregation required, etc....
But it will cost money...
If you want going purely open source, my second candidate is NVIDA's RAPIDS framework which implements cuDF (CUDA Dataframe - like Spark data structure), eventually you can use it to keep your data window (append new, delete obsolete), and cuxfilter solution which is similar to OmniSci, but its more framework, but with skilled frontend coder you can achieve something very similar/same as OmniSci.
Of course you can go and implement your own on top of cassandra with an appropriate data model for your usecase. This will maybe get you the best results tailored to your needs.
You could look at KairosDB (https://kairosdb.github.io/) which is a timeseries database on top of apache cassandra and I got 50k writes per second on a medium sized single (but bare metal) node.
It's quite good documented (https://kairosdb.github.io/docs/build/html/CassandraSchema.html) and it has aggregators out of the box (https://kairosdb.github.io/docs/build/html/restapi/QueryMetrics.html).
OpenTSDB was slower in my tests. Influx looks promising but i have no experience myself: https://github.com/influxdata/influxdb

What would be the proper way to tune Apache Spark for responsive web applications?

I have previously used Apache Spark for streaming applications where it does a wonderful job for ETL pipelines and predictions using Machine Learning.
However, Spark for EDA may not be as fast as one may want. For example, if you would like to do basic mathematical operations on data coming from Postgres or ElasticSearch using the data frames in Spark, the time it takes to fetch data from the host system and do the analysis is much higher than that taken by the SQL query on Postgres to run.
Even simple aggregations such as sum, average, and count can be done much faster using SQL than doing them on top of Spark-SQL.
From what I understand, this is not because of latency in fetching the data from the host system. If you call the show method on a data frame, you can quickly get the top rows of the data set. However, if you limit the response in SQL, and then call collect the time taken is huge.
This means that the data is there, but the processing being done while calling collect is taking a time.
Regardless of the data source (CSV file, JSON file, ElasticSearch, Parquet, etc.), the behavior remains the same.
What is the reason for this latency on collect and is there any way to reduce it to the point where it can work with responsive applications to make real-time or near real-time queries?

Is it bad to use INDEX in Cassandra if performance is not important?

Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.

Resources