Cassandra Data Read Speed Slows Down - cassandra

I have a problem that I can't understand. I have 3 node (RF:3) in my cluster and my nodes hardware is pretty good. Now there are 60 - 70 million rows and 3000 columns data in my cluster so i want to query specific data approximately 265000 rows and 4 columns, i use default fetch size, I can get 5000 lines of data per second up to 55000 lines of data after that my data retrieval speed drops.
I think this situation will be solved from the cassandra.yaml file, do you have any idea what I can check?

Related

how to decide number of executors for 1 billion rows in spark

We have a table which has one billion three hundred and fifty-five million rows.
The table has 20 columns.
We want to join this table with another table which has more of less same number of rows.
How to decide number of spark.conf.set("spark.sql.shuffle.partitions",?)
How to decide number of executors and its resource allocation details?
How to find the amount of storage those one billion three hundred and fifty-five million rows will take in memory?
Like #samkart says, you have to experiment to figure out the best parameters since it depends on the size and nature of your data. The spark tuning guide would be helpful.
Here are some things that you may want to tweak:
spark.executor.cores is 1 by default but you should look to increase this to improve parallelism. A rule of thumb is to set this to 5.
spark.files.maxPartitionBytes determines the amount of data per partition while reading, and hence determines the initial number of partitions. You could tweak this depending on the data size. Default is 128 MB blocks in HDFS.
spark.sql.shuffle.partitions is 200 by default but tweak it depending on the data size and number of cores. This blog would be helpful.

Interpolating a huge set of data

So I have a very large set of data (4 million rows+) with journey times between two location nodes for two separate years (2015 and 2024). These are stored in dat files in a format of:
Node A
Node B
Journey Time (s)
123
124
51.4
So I have one long file of over 4 million rows for each year. I need to interpolate journey times for a year between the two for which I have data. I've tried Power Query in Excel as well as Power BI Desktop but have had no reasonable solution beyond cutting the files into smaller < 1 million row pieces so that Excel can manage.
Any ideas?
What type of output are you looking for? PowerBI can easily handle this amount of data, but it depends what you expect your result to be. If you're looking for the average % change in node to node travel time between the two years, then PowerBI could be utilised as it is great at aggregating and comparing large datasets.
However, if you are wanting an output of every single node to node delta between those two years i.e. 4M row output, then PowerBI will calculate this, but then what do you do with it.... a 4M long table?
If you're looking to have an exported result >150K rows (PowerBI limit) or >1M rows (Excel limit), then I would use Python for that (as mentioned above)

Load Large Amount of Row In Cassandara / Datastax

I have a table that has about 3 millions row in 1 partition key. I need to load all those data and save it as file.
Actually it is data recorded for 1 day sensor input. And what I want to build is a playback service to replay the sensor event at least in the past 3 months. So I am thinking it will be really big volume of data.
I am new in NoSql database, any approach how to achieve these goal?

Pyspark job being stuck at the final task

The flow of my program is something like this:
1. Read 4 billion rows (~700GB) of data from a parquet file into a data frame. Partition size used is 2296
2. Clean it and filter out 2.5 billion rows
3. Transform the remaining 1.5 billion rows using a pipeline model and then a trained model. The model is trained using a logistic regression model where it predicts 0 or 1 and 30% of the data is filtered out of the transformed data frame.
4. The above data frame is Left outer joined with another dataset of ~1 TB (also read from a parquet file.) Partition size is 4000
5. Join it with another dataset of around 100 MB like
joined_data = data1.join(broadcast(small_dataset_100MB), data1.field == small_dataset_100MB.field, "left_outer")
6. The above dataframe is then exploded to the factor of ~2000 exploded_data = joined_data.withColumn('field', explode('field_list'))
7. An aggregate is performed aggregate = exploded_data.groupBy(*cols_to_select)\
.agg(F.countDistinct(exploded_data.field1).alias('distincts'), F.count("*").alias('count_all')) There are a total of 10 columns in the cols_to_select list.
8. And finally an action, aggregate.count() is performed.
The problem is, the third last count stage (200 tasks) gets stuck at task 199 forever. In spite of allocating 4 cores and 56 executors, the count uses only one core and one executor to run the job. I tried breaking down the size from 4 billion rows to 700 million rows which is 1/6th part, it took four hours. I would really appreciate some help in how to speed this process up Thanks
The operation was being stuck at the final task because of the skewed data being joined to a huge dataset. The key that was joining the two dataframes was heavily skewed. The problem was solved for now by removing the skewed data from the dataframe. If you must include the skewed data, you can use iterative broadcast joins (https://github.com/godatadriven/iterative-broadcast-join). Look into this informative video for more details https://www.youtube.com/watch?v=6zg7NTw-kTQ

Max. size of wide rows?

Theoretically, Cassandra allows up to 2 billion columns in a wide row.
I have heard that in reality up to 50.000 cols/50 MB are fine; 50.000-100.000 cols/100 MB are OK but require some tuning; and that one should never go above 100.000/100 MB columns per row. The reason being that this will put pressure on the heap.
Is there some truth to this?
In Cassandra, the maximum number of cells (rows x columns) in a single partition is 2 billion.
Additionally, a single column value may not be larger than 2GB, but in practice, "single digits of MB" is a more reasonable limit, since there is no streaming or random access of blob values.
Partitions greater than 100Mb can cause significant pressure on the heap.
One of our tables with cassandra 1.2 went pass 100 MB columns per row limit due to new write patterns we experienced. We have experienced significant pressure on both compactions and our caches. Btw, we had rows with several hundred MBs.
One approach is to just redesign and migrate the table to a better designed table(s) that will keep your wide rows under that limit. If that is not an option, then I suggest tune your cassandra so both compactions and caches configs can deal with your wide rows effectively.
Some interesting links to things to tune:
Cassandra Performance Tuning
in_memory_compaction_limit_in_mb

Resources