Max. size of wide rows? - cassandra

Theoretically, Cassandra allows up to 2 billion columns in a wide row.
I have heard that in reality up to 50.000 cols/50 MB are fine; 50.000-100.000 cols/100 MB are OK but require some tuning; and that one should never go above 100.000/100 MB columns per row. The reason being that this will put pressure on the heap.
Is there some truth to this?

In Cassandra, the maximum number of cells (rows x columns) in a single partition is 2 billion.
Additionally, a single column value may not be larger than 2GB, but in practice, "single digits of MB" is a more reasonable limit, since there is no streaming or random access of blob values.
Partitions greater than 100Mb can cause significant pressure on the heap.

One of our tables with cassandra 1.2 went pass 100 MB columns per row limit due to new write patterns we experienced. We have experienced significant pressure on both compactions and our caches. Btw, we had rows with several hundred MBs.
One approach is to just redesign and migrate the table to a better designed table(s) that will keep your wide rows under that limit. If that is not an option, then I suggest tune your cassandra so both compactions and caches configs can deal with your wide rows effectively.
Some interesting links to things to tune:
Cassandra Performance Tuning
in_memory_compaction_limit_in_mb

Related

Cassandra Data Read Speed Slows Down

I have a problem that I can't understand. I have 3 node (RF:3) in my cluster and my nodes hardware is pretty good. Now there are 60 - 70 million rows and 3000 columns data in my cluster so i want to query specific data approximately 265000 rows and 4 columns, i use default fetch size, I can get 5000 lines of data per second up to 55000 lines of data after that my data retrieval speed drops.
I think this situation will be solved from the cassandra.yaml file, do you have any idea what I can check?

how to decide number of executors for 1 billion rows in spark

We have a table which has one billion three hundred and fifty-five million rows.
The table has 20 columns.
We want to join this table with another table which has more of less same number of rows.
How to decide number of spark.conf.set("spark.sql.shuffle.partitions",?)
How to decide number of executors and its resource allocation details?
How to find the amount of storage those one billion three hundred and fifty-five million rows will take in memory?
Like #samkart says, you have to experiment to figure out the best parameters since it depends on the size and nature of your data. The spark tuning guide would be helpful.
Here are some things that you may want to tweak:
spark.executor.cores is 1 by default but you should look to increase this to improve parallelism. A rule of thumb is to set this to 5.
spark.files.maxPartitionBytes determines the amount of data per partition while reading, and hence determines the initial number of partitions. You could tweak this depending on the data size. Default is 128 MB blocks in HDFS.
spark.sql.shuffle.partitions is 200 by default but tweak it depending on the data size and number of cores. This blog would be helpful.

Problem with high Maximum tombstones per slice?

I'm seeing the stats below for one of my tables running nodetool cfstats
Maximum tombstones per slice (last five minutes): 23571
Per the Datastax doc:
Maximum number of tombstones scanned by single key queries during the
last five minutes
All my other tables have low numbers like 1 or 2. Should I be worried? Should I try to lower the tombstone creation?
Tombstones can impact on read performance if they are residing in frequently used tables. you should re-work on data modelling part. Also. you can lower the value of gc_grace_seconds so that tombstones clear fast instead of waiting default value 10 days.

What is the best range of numbers to use for ID number relationships in PowerPivot: 1 to 500,000 or 1,000,000 to 1,500,000?

Using PowerPivot and having a cost table, with 300,000 different costtypes, and a calculation table, with about 700,000 records/types, I change the product strings (which can be quite long) to integers, in order to make them shorter and get the RELATED formula to work faster.
With this many records and cost types, would it be better to have all the ID numbers the same length of numbers?
So for example should I start with number 1000000 up to 1500000 or just from 1 to 500000?
Try saving files with 1-500000 and 1000001-1500000 and see the difference in properties. Difference doesn't worth it.
1 to 500,000 is the better option because it is lesser bytes to store. Having the same length has no advantage whatsoever.
You will not notice difference in allocated memory. If you save
1; 2;... or 1000001; 1000002;... or 1 abcdefgh; 2 abcdefgh;... you will find out that:
2.14 Mb for both 1-64000 and 1000001-1064000 in xls format*
3.02 Mb for 1 abcdefgh; 2 abcdefgh;...
584 Kb on disc (much smaller) for both 1-100000 and 1000001-1100000 in .ods format (you cannot save more). There is a small difference (596069Kb vs 597486Kb, but it is negated by cluster size 4 Kb).
From usability - go for 1,000,000 to 1,500,000. You are guaranteed to have same number of digits. Otherwise it is easy to mess up 1234 vs 11234. Strongly consider SQLite or similar database because 0.5 million of rows is pushing the limits of Excel format.
xls format can store a maximum of 65536 rows and 256 columns
1 and 1000000 take the same amount of space because data is not compressed and space sufficient for an int ( number up to 4 billion) is allocated.

Does Cassandra's limitation of 2 billion cells per partition includes the replicated rows (cells) from other partitions?

Cassandra allows upto 2 billion cells per partition. If I have 2 node cluster, with a replication factor of 2, does that mean 2 billion cells will take into account the rows redundantly save from the other node?
No, the replication factor does not affect this limit. The limitation is not 2 billion/RF.
HTH, Cheers,
Carlo

Resources