Cassandra Fast Read Configuration - cassandra

I have 4 Cassandra nodes with 1 seed in a single data center. I have about 5M records in which Cassandra takes around 4 mins to read where with MySQL, it takes only 17 seconds. So my guess is that there is something wrong in my configuration. So kindly will anyone let me know what configuration attributes so I have to check in Cassandra.yaml.

You may be doing an apples to oranges comparison if you are reading all 5M records from one client.
With MySQL all the data is local and optimized for reads since data is updated in place.
Cassandra is distributed and optimized for writes. Writes are simple appends, but reads are expensive since all the appends need to be read and merged to get the current value of each column.
Since the data is distributed across multiple nodes, there is a lot of overhead of accessing and retrieving the data over the network.
If you were using Spark with Cassandra and loading the data into Spark workers in parallel without shuffling it across the network to a single client, then it would be a more similar comparison.
Cassandra is generally good at ingesting large amounts of data and then working on small slices of it (i.e. partitions) rather than doing table scan operations such as reading the entire table.

Related

How to determine the number of executors to read a delta table?

I have a delta table which is partitioned by multiple keys, one of which includes date excluding minute details(only upto hour, example - Fri, 15 Jul 2022 07)
Now, with the data keep ingesting via batch and streaming ingestion workflow, what would be the best strategy to evaluate number of executors to read all the data from delta table?
One of the very naive way could be to just let spark autoscale but we may still need to play with shuffle partitions etc. Looking for hints or best practices around the same. Thanks!
If you want to "read all the data from delta table" it does not really matter whether this table is partitioned or not since the query reads all the data and hence loads the whole table.
This is the worst possible query - the dreaded full scan. If it's inevitable, just know that that is the kind of queries where Spark SQL shines so bright utilising the full power of a Spark cluster. You've been warned :)
Executors are simply machines with CPU cores and memory. You're probably more interested in the number of CPU cores for all the tasks to load the delta table.
I'd start this calculation with the number of files for a given version of the delta table. Files are of different size and (I might be wrong here) they are usually chunked (I don't want to use the overloaded term partitioned here, but that's what springs to my mind) to 512MB splits.
The number of splits (512MB blocks) for all the files of a given version of the delta table would be the number of tasks. That would give you the number of CPU cores and hence their "containers", i.e. Spark executors (to evenly saturate available physical resources for the best performance).

Glue Spark write data one partition at time

Need help to understand how it works: I have 2 TB of data which I am writing using glue spark partition on a certain date column. I am using g2x with 40 workers nodes.
These are a few observations:
Job is writing one partition at one time i.e data for one day is loaded only. (Shouldn't it write data-parallel in multiple partitions)
It creates very small files within partitions.
For the above reason, writing data is very slow. Are there any settings that can be changed to improve this?
To avoid creating very small files, you can use coalesce(k) where k is the number of partitions that you want to have, probably 40.
More about coalesce

Concept for temporary data in Apache Cassandra

I have a question regarding the usage of Cassandra for temporary data (Data which is written once to the database, which is read once from the database and then deleted).
We are using Cassandra, to exchange data between processes which are running on different machines / different containers. Process1 is writing some data to the Cassandra, Process2 is reading this data. After that, data can be deleted.
As we learned that Cassandra doesn't like writing and deleting data very often in one table because of tombestones and performance issues, we are creating temporary tables for this.
Process1 : Create table, write data to table.
Process2 : Read data from table, drop table.
But doing this in a very high number (500-1000 tables create and drop per hour) we are facing problems on our schema synchronization between our nodes (we have cluster with 6 nodes).
The Cassandra cluster got very slow, we got a lot of timeout warnings, we got errors about different schemas on the nodes, the CPU load on the cluster nodes grew up to 100% and then the cluster was dead :-).
Is Cassandra the right database for this usecase ?
Is it a problem of how we configured our cluster ?
Will it be a better solution to create temporary keyspaces for this ?
Has anyone experience of how to handle such usecase with Cassandra ?
You don't need any database here. Your use case is to enable your applications to handshake with each other to share data asynchronously. There are two possible solutions:
1) For Batch based writes and reads consider using something like HDFS for intermediate storage. Process 1 writes data files in HDFS directories and Process 2 reads it from HDFS.
2) For message based system consider something like Kafka. Process 1 process the data stream and writes into Kafka Topics and Process 2 consumers reads data from Kafka Topics. Kafka do provides Ack/Nack features.
Continuously creating and deleting number of tables in Cassandra is not a good practice and is never recommended.

Mysql or Spark Processing of 400gb data

If I use spark in my case, based on block and cores will it be useful ?
I have 400 GB of data in single table i.e. User_events with multiple columns in MySQL. This table stores all user events from application. Indexes are there on required columns. I have an user interface where user can try different permutation and combination of fields under user_events
Currently I am facing the performance issues where query either takes 15/20 seconds or even longer or times out.
I have gone through couple of Spark tutorial but I am not sure if it can help here. Per mine understanding from spark,
First Spark has to bring all the data in memory. Bring 100 M record on netwok will be costly operation and I will be needing big memory for the
same. Isn't it ?
Once data in memory, Spark can distribute the data among partition based on cores and input data size. Then it can filter the data on each partition
in parallel. Here Spark can be beneficial as it can do the parallel operation while MySQL will be sequential. Is that correct ?
Is my understanding correct ?

spark datasax cassandra connector slow to read from heavy cassandra table

I am new to Spark/ Spark Cassandra Connector. We are trying spark for the first time in our team and we are using spark cassandra connector to connect to cassandra Database.
I wrote a query which is using a heavy table of the database and I saw that Spark Task didn't start until the query to the table fetched all the records.
It is taking more than 3 hours just to fetch all the records from the database.
To get the data from the DB we use.
CassandraJavaUtil.javaFunctions(sparkContextManager.getJavaSparkContext(SOURCE).sc())
.cassandraTable(keyspaceName, tableName);
Is there a way to tell spark to start working even if all the data didn't finish to download ?
Is there an option to tell spark-cassandra-connector to use more threads for the fetch ?
thanks,
kokou.
If you look at the Spark UI, how many partitions is your table scan creating? I just did something like this and I found that Spark was creating too many partitions for the scan and it was taking much longer as a result. The way I decreased the time on my job was by setting the configuration parameter spark.cassandra.input.split.size_in_mb to a value higher than the default. In my case it took a 20 minute job down to about four minutes. There are also a couple more Cassandra read specific Spark variables that you can set found here.
These stackoverflow questions are what I referenced originally, I hope they help you out as well.
Iterate large Cassandra table in small chunks
Set number of tasks on Cassandra table scan
EDIT:
After doing some performance testing with regards to fiddling with some Spark configuration parameters, I found that Spark was creating far too many table partitions when I wasn't giving the Spark executors enough memory. In my case, upping the memory by a gigabyte was enough to render the input split size parameter unnecessary. If you can't give the executors more memory, you may still need to set spark.cassandra.input.split.size_in_mbhigher as a form of workaround.

Resources