Spark data volume, parallelisation trade-off - apache-spark

I have a data that is grouped on three columns. Two of the three columns have very high cardinality (can go up to 500 unique values per column), but each group will have at most 400 rows.
I need to perform some computation on the grouped data. The computation takes a couple of seconds for each group. Will using spark be an overkill here? Will the process of parallelizing and distributing the operation add more time than doing it on one machine (and maybe using multiprocessing)?
Also, will adding more levels of parallelisation (on high cardinality columns) using spark increase the net time taken to process the data for the same cluster configuration?

Related

How to obtain row count estimates in in Cassandra using the Java client driver

If the only thing I have available is a com.datastax.driver.core.Session, is there a way to get a rough estimate of row count in a Cassandra table from a remote server? Performing a count is too expensive. I understand I can get a partition count estimate through JMX but I'd rather not assume JMX has been configured. (I think that result must be multiplied by number of nodes and divided by replication factor.) Ideally the estimate would include cluster keys too, but everything is on the table.
I also see there's a size_estimates table in the system keyspace but I don't see much documentation on it. Is it periodically refreshed or do the admins need to run something like nodetool flush?
Aside from not including cluster keys, what's wrong with using this as a very rough estimate?
select sum(partitions_count)
from system.size_estimates
where keyspace_name='keyspace' and table_name='table';
The size estimates is updated on a timer every 5 minutes (overridable with -Dcassandra.size_recorder_interval).
This is a very rough estimate, but you could from the token of the partition key find the range it belongs in and on each of the replicas pull from this table (its local replication and unique to each node, not global) and divide out the size and the number of partitions for a very vague approximate estimate of the partition size. There are so many assumptions and averaging that occurs in this path even before writing to this table. Cassandra errs on efficiency side at cost of accuracy and is more for general uses like spark bulk reading so take it with a grain of salt.
Its not useful now but looking towards the future post 4.0 freeze there will be many new virtual tables, including possibly ones to get accurate statistics on specific and ranges of partitions on demand.

Does joining additional columns in Spark scale horizontally?

I have a dataset with about 2.4M rows, with a unique key for each row. I have performed some complex SQL queries on some other tables, producing a dataset with two columns, a key and the value true. This dataset is about 500 rows. Now I would like to (outer) join this dataset with my original table.
This produces a new table with a very sparse set of values (true in about 500 rows, null elsewhere).
Finally, I would like to do this about 200 times, giving me a final table of about 201 columns (the key, plus the 200 sparse columns).
When I run this, I notice that as it runs it gets considerably slower. The first join takes 2 seconds, then 4s, then 6s, then 10s, then 20s and after about 30 joins the system never recovers. Of course, the actual numbers are irrelevant as that depends on the cluster I'm running, but I'm wondering:
Is this slowdown is expected?
I am using parquet as a data storage format (columnar storage) so I was hopeful that adding more columns would scale horizontally, is that a correct assumption?
All the columns I've joined so far are not needed for the Nth join, can they be unloaded from memory?
Are there other things I can do when combining lots of columns in spark?
Calling explain on each join in the loop shows that each join is getting more complex (appears to include all previous joins and it also includes the complex sql queries, even though those have been checkpointed). Is there a way to really checkpoint so each join is just a join? I am actually calling show() after each join, so I assumed the join is actually happening at that point.
Is this slowdown is expected
Yes, to some extent it is. Joins belong to the most expensive operations in a data intensive systems (it is not a coincidence that products which claim linear scalability usually take joins out of the table). Join-like operation in a distributed system typically require data exchange between nodes hitting a bunch of high latency numbers.
In Spark SQL there is also additional cost of computing execution plan, which has larger than linear complexity.
I am using parquet as a data storage format (columnar storage) so I was hopeful that adding more columns would scale horizontally, is that a correct assumption?
No. Input format doesn't affect join logic at all.
All the columns I've joined so far are not needed for the Nth join, can they be unloaded from memory?
If truly excluded from the final output they will be pruned from the execution plan. But since you for a reason, I assume it is not the case and there are required for the final output.
Is there a way to really checkpoint so each join is just a join? I am actually calling show() after each join, so I assumed the join is actually happening at that point.
show computes only a small subset of data required for the output. It doesn't cache, although shuffle files might be reused.
(appears to include all previous joins and it also includes the complex sql queries, even though those have been checkpointed).
Checkpoints are created only if data is fully computed and don't remove stages from the execution plan. If you want to do it explicitly, write partial result to persistent storage and read it back at the beginning of each iteration (it is probably an overkill).
Are there other things I can do when combining lots of columns in spark?
The best thing you can do is to find a way to avoid joins completely. If key is always the same then single shuffle, and operation on groups / partitions (with byKey method, window functions) might be better choice.
However if you
have a dataset with about 2.4M rows
then using non-distributed system that supports in-place modification might be much better choice.
In the most naive implementation you can compute each aggregate separately, sort by key and write to disk. Then data can be merged together line by line with negligible memory footprint.

What is the best data model for timeseries in Cassandra when *fast sequential reads* are required

I want to store streaming financial data into Cassandra and read it back fast. I will have up to 20000 instruments ("tickers") each containing up to 3 million 1-minute data points. I have to be able to read large ranges of each of these series as speedily as possible (indeed it is the reason I have moved to a columnar-type database as MongoDB was suffocating on this use case). Sometimes I'll have to read the whole series. Sometimes I'll need less but typically the most recent data first. I also want to keep things really simple.
Is this model, which I picked up in a Datastax tutorial, the most effective? Not everyone seems to agree.
CREATE TABLE minutedata (
ticker text,
time timestamp,
value float,
PRIMARY KEY (ticker, time))
WITH CLUSTERING ORDER BY (time DESC);
I like this because there are up to 20 000 tickers so the partitioning should be efficient, and there are only up to 3 million minutes in a row, and Cassandra can handle up to 2 billion. Also with the time descending order I get most recent data when using a limit on the query.
However, the book Cassandra High Availability by Robbie Strickland mentions the above as an anti-pattern (using sensor-data analogy), and I quote the problems he cites from page 144:
Data will be collected for a given sensor indefinitely, and in many
cases at a very high frequency
With sensorID as the partition key, the row will grow by two
columns for every reading (one marker and one reading).
I understand point one would be a problem but it's not in my case due to the 3 million data point limit. But point 2 is interesting. What are these "markers" between each reading? I clearly want to avoid anything that breaks contiguous data storage.
If point 2 is a problem, what is a better way to model timeseries so that they can efficiently be read in large ranges, fast? I'm not particularly keen to break the timeseries into smaller sub-periods.
If your query pattern was to find a few rows for a ticker using a range query, then I would say having all the data for a ticker in one partition would be a good approach since Cassandra is optimized to access partitions efficiently.
But if everything is in one one partition, then that means the query is happening on only one node. Since you say you often want to read large ranges of rows, then you may want more parallelism.
If you split that same data across many nodes and read it in parallel, you may be able to get better performance. For example, if you partitioned your data by ticker and by year, and you had ten nodes, you could theoretically issue ten async queries and have each year queried in parallel.
Now 3 million rows is a lot, but not really that big, so you'd probably have to run some tests to see which approach was actually faster for your situation.
If you're doing more than just retrieving all these rows and are doing some kind of analytics on them, then parallelism will become more attractive and you might want to look into pairing Cassandra with Spark so that the data and be read and processed in parallel on many nodes.

why HBase count operation so slow

The command is:
count 'tableName'.
It's very slow to get the total row number of the whole table.
My situation is:
I have one master and two slaves, each node with 16 cpus and 16G memory.
My table only has one column family with two columns: title and Content.
The title column at most has 100B bytes, the content may have 5M bytes.
Right now the table has 1550 rows, every time when I count the row number, it would take about 2 minutes.
I'm very curious why hbase so slow on this operation, I guess it's even slower then mysql. Is Cassandra faster than Hbase on these operations?
First of all, you have very small amount of data. If you have that kind of volume, then IMO using NoSql would provide you no advantage.
Your test is not appropriate to judge performance of HBase and Cassandra. Both have their own use cases and sweet spots.
count command on hbase is running a single threaded java program to do counts of rows. Still, I am surprised that its taking 2 mins to count 1550 rows.
If you would like to do counts in faster way(for bigger dataset) then you should run MapReduce job of HBase Row_Counter.
Run MapReduce job by running this:
bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter
First of all, please remind that to make use of data locality, your "slaves" (better known as RegionServers) must have also the DataNode role, not doing so is a performance killer.
Due performance reasons HBase does not mantain a live counter of rows. To perform a count the HBase shell client needs to retrieve all the data, and that means that if your average row has 5M of data, then the client would retrieve 5M * 1550 from the regionservers just to count, which is a lot.
To speed it up you have 2 options:
If you need realtime responses you can maintain your own live counter of rows making use of HBase atomic counters: each time you insert you increment the counter, and each time you delete you decrement the counter. It can even be in the same table, just use another column family to store it.
If you don't need realtime run a distributed row counter map-reduce job (source code) forcing the the scan to just use the smallest column family & column available to avoid reading big rows, each RegionServer will read the locally stored data and no network I/O will be required. In this case you may need to add a new column to your rows with a small value if you don't have one yet (a boolean is your best option).

Is a single partition in an Azure Storage Table a good design?

I have a system that needs to process ~150 jobs per day.
Additionally, I need to query past jobs efficiently, usually by time-range, but sometimes by other properties like job owner or resource used.
I know that running queries across table partitions can slow down my application, but what if I just put every row into one partition? If I use datetime.ticks as my rowkey and my query ranges are always small, will this scale well?
I tried putting data into separate partitions by time, but it seems like my queries get slower as more partitions are included in the query.
The partition is a scale unit. Each partition can receive up to 2000tps before you start receiving throttling errors. As such, as long as you don't forsee exceeding that volume, you should be find fine keeping a single partition.
However, as the size of the partition grows, so will query times. So you may want to factor that in as well.

Resources