Streaming big data while sorting

Streaming big data while sorting - node.js

I have huge data and as a result I cannot hold all of it in memory and I always get out of memory errors; obviously one of the solutions would be using streaming in Node.JS; but streaming is not possible(as far as I know) with sorting which is one the functionalities which I apply on my data; is there any algorithm maybe Divide and conquer algorithm that I can use for the combination of streaming and sorting (which is one of the functionalities which I apply on my data?)

You can stream the data using Kinesis and use the Kinesis Client Library, or subscribe a Lambda function to your Kinesis stream and incrementally maintain sorted materialized views. Where you store your sorted materialized views and how you divide your data will depend on your application. If you cannot store the entire sorted materialized views, you could have rolling views. If your data is time-series, or has some other natural order, you could divide the range of your ordered attribute into chunks. Then, you could have for example, 1-day or 1-hour sorted chunks of your data. In other words, choose the sorted subdivision that allows you to keep the information in memory as needed.

Related

Does Spark guarantee consistency when reading data from S3?

I have a Spark Job that reads data from S3. I apply some transformations and write 2 datasets back to S3. Each write action is treated as a separate job.
Question: Does Spark guarantees that I read the data each time in the same order? For example, if I apply the function:
.withColumn('id', f.monotonically_increasing_id())
Will the id column have the same values for the same records each time?

You state very little, but the following is easily testable and should serve as a guideline:
If you re-read the same files again with same content you will get the same blocks / partitions again and the same id using f.monotonically_increasing_id().
If the total number of rows differs on the successive read(s) with different partitioning applied before this function, then typically you will get different id's.
If you have more data second time round and apply coalesce(1) then the prior entries will have same id still, newer rows will have other ids. A less than realistic scenario of course.
Blocks for files at rest remain static (in general) on HDFS. So partition 0..N will be the same upon reading from rest. Otherwise zipWithIndex would not be usable either.
I would never rely on the same data being in same place when read twice unless there were no updates (you could cache as well).

Ways to store key-value pairs with optimized read, to be used along with a stream processing engine

We have data (static) with (approx.) 20Mrows and 50,000 columns. It's a sparse data, and we need fast read for a single cell value or a given column (all the rows for the column). Input is a streaming data, and we want to aggregate the input (last x mins) depending on the values from the DB (values mentioned above).
We need some suggestion on how should we proceed to have the lowest latency:
1. We store the values in the Apache Spark in-memory (on-heap or off-heap) and also process the data using the same.
2. We store the values in Redis/RocksDB and process the data in Apache Spark.
Apache Flink is out of consideration due to resistance in adding a new framework on to the stack, and we are looking for something more stable (as this problem is just a part of a project).

With Flink, assuming you use the row id as the key, then you could store this data as state via a Map<column id, cell value>. Under the hood if you've configured Flink to use RocksDB as the state backend, then looking up a single cell is fast, as the key into RocksDB is the <row id> + <column id>.
You could also separately key by column, and iterate over all the rows, though that's obviously going to be slower - not sure what the definition of a "fast read" is for 1M rows of a given column.
With this approach you could then use Flink's support for queryable state to make the lookups very simple to implement.

PySpark Cassandra Connector efficiently querying across partition keys

I'm faced with the following problem using PySpark and dataframes with the cassandra-connector. My Cassandra data lake consists of metric measurements across (network) devices, and the entries are of type (device,interface,metric,time,value).
My cassandra table for the raw data has:
PRIMARY KEY ((device,interface,metric),time)
for supposedly efficient fetching of time ranges for a given measurement.
Now for reporting purposes, users can query any set of device/interface/metric combinations (ie give me a specific metric for all interfaces of a device). Now I know the list of each, so I'm not looking to do wildcard searches, but rather IN queries.
I'm using Spark 1.4, so I'm adding filters like to obtain dataframes to calculate min/max/percentile/etc... on the recorded metric values.
metrics_raw_sub = metrics_raw\
.filter(metrics_raw.device.inSet (device_list))\
.filter(metrics_raw.interface.inSet (interface_list))\
.filter(metrics_raw.metric.inSet (metric_list))
This isn't very efficient as these predicates do not get pushed down to CQL (only the last predicate can be an IN query), so I'm pulling in tons of data and filtering on the client side. (not good)
Why doesn't cassandra-connector allow multiple IN predicates across partition columns? Doing this in a native CQL shell appears to work?
Another approach to my problem above would be to (and this yields efficient individual queries as predicates are pushed down to Cassandra):
for device in device_list:
for interface in interface_list:
metrics_raw_sub = metrics_raw\
.filter(metrics_raw.device == device)\
.filter(metrics_raw.interface == interface)\
.filter(metrics_raw.metric.inSet (metric_list))
And then run the aggregation logic for each subquery, but I feel like this is largely serialising what should be a parallel computation across all requested device/interface/metric values... Can I batch the Cassandra queries so I can run my analytics on one large distributed dataframe?
Bottom line, I'm looking to do this very efficiently. If the turn-around times are short enough, we'll run these on-demand. If not, we'll need to look into pre-computing them and storing into tables (which sacrifices flexibility for doing custom time-range reporting)
Any insights would be much appreciated!!
Nik.

Perform queries over the time-series stream of data

I'm trying to design an architecture of my streaming application and choose the right tools for the job.
This is how it works currently:
Messages from "application-producer" part have a form of (address_of_sensor, timestamp, content) tuples.
I've already implemented all functionality before Kafka, and now I've encountered major flaw in the design. In "Spark Streaming" part, consolidated stream of messages is translated into stream of events. The problem is that events for the most part are composite - consist of multiple messages, which have occurred at the same time at different sensors.
I can't rely on "time of arrival to Kafka" as a mean to detect "simultaneity". So I has to somehow sort messages in Kafka before extracting them with Spark. Or, more precisely, make queries over Kafka messages.
Maybe Cassandra is the right replacement for Kafka here? I have really simple data model, and only two possible types of queries to perform: query by address, and range query by timestamp. Maybe this is the right choice?
Do somebody have any numbers of Cassandra's throughput?

If you want to run queries on your time series, Cassandra may be the best fit - it is very write optimized, you can build 'wide' rows for your series. It is possible to make slices on your wide rows, so you can select some time ranges with only one query.
On the other hand, kafka can be considered as a raw data flow - you don't have queries, only recently produced data. In order to collect data based on some key in the same partition, you have to select this key carefully. All data within same partition are time sorted.

Range Query on Timestamp is the classic use case of cassandra , if u need address based queries as well u would have to make them as clustering column if using cassandra . As far as cassandra througput are concerned if you can invest in proper performance analysis on cassandra cluster you can achieve very high write throughput . But I have used SparkQL , Cassandra Driver and spark Cassandra connector they don't really give high query throughput speed until you have a big cluster with high CPU configuration , it does not work well with small dataset .
Kafka should not be used as data source for queries , its more of commit log

What is the best data model for timeseries in Cassandra when fast sequential reads are required

I want to store streaming financial data into Cassandra and read it back fast. I will have up to 20000 instruments ("tickers") each containing up to 3 million 1-minute data points. I have to be able to read large ranges of each of these series as speedily as possible (indeed it is the reason I have moved to a columnar-type database as MongoDB was suffocating on this use case). Sometimes I'll have to read the whole series. Sometimes I'll need less but typically the most recent data first. I also want to keep things really simple.
Is this model, which I picked up in a Datastax tutorial, the most effective? Not everyone seems to agree.
CREATE TABLE minutedata (
ticker text,
time timestamp,
value float,
PRIMARY KEY (ticker, time))
WITH CLUSTERING ORDER BY (time DESC);
I like this because there are up to 20 000 tickers so the partitioning should be efficient, and there are only up to 3 million minutes in a row, and Cassandra can handle up to 2 billion. Also with the time descending order I get most recent data when using a limit on the query.
However, the book Cassandra High Availability by Robbie Strickland mentions the above as an anti-pattern (using sensor-data analogy), and I quote the problems he cites from page 144:
Data will be collected for a given sensor indefinitely, and in many
cases at a very high frequency
With sensorID as the partition key, the row will grow by two
columns for every reading (one marker and one reading).
I understand point one would be a problem but it's not in my case due to the 3 million data point limit. But point 2 is interesting. What are these "markers" between each reading? I clearly want to avoid anything that breaks contiguous data storage.
If point 2 is a problem, what is a better way to model timeseries so that they can efficiently be read in large ranges, fast? I'm not particularly keen to break the timeseries into smaller sub-periods.

If your query pattern was to find a few rows for a ticker using a range query, then I would say having all the data for a ticker in one partition would be a good approach since Cassandra is optimized to access partitions efficiently.
But if everything is in one one partition, then that means the query is happening on only one node. Since you say you often want to read large ranges of rows, then you may want more parallelism.
If you split that same data across many nodes and read it in parallel, you may be able to get better performance. For example, if you partitioned your data by ticker and by year, and you had ten nodes, you could theoretically issue ten async queries and have each year queried in parallel.
Now 3 million rows is a lot, but not really that big, so you'd probably have to run some tests to see which approach was actually faster for your situation.
If you're doing more than just retrieving all these rows and are doing some kind of analytics on them, then parallelism will become more attractive and you might want to look into pairing Cassandra with Spark so that the data and be read and processed in parallel on many nodes.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string