Cassandra Table needs a lot of Storage space - cassandra

I just stored a 3,1 GB CSV via Spark-Cassandra-Connector to a Table in a Cassandra Cluster (5 Nodes, 30 GB each, 7.5 GB RAM each instance, cassandra uses ~1.8 GB of that).
I jst saw via DataOpsCenter, that my Cluster holds 16 GB of data (each node ~3.x GB) and my storage usage has grown from 14 GB (before) to 64 GB (after the writing process)!!!
My Keystore has following settings:
replica_placement_strategy org.apache.cassandra.locator.SimpleStrategy
replication_factor 2
CREATE TABLE debs.energydata10m (
id int PRIMARY KEY,
house_id int,
household_id int,
plug_id int,
ts timestamp,
type int,
val float
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='{"keys":"ALL", "rows_per_partition":"NONE"}' AND
comment='' AND
dclocal_read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.000000 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
Why does Cassandra need that much storage for this 3.1 GB CSV?
Edit: Here is the output of the ls -lR /var/lib/cassandra/data/debs/ command:
ubuntu#ip-xx-xx-xx-xx:~$ ls -lR /var/lib/cassandra/data/debs/
/var/lib/cassandra/data/debs/:
total 24
drwxr-xr-x 2 cassandra cassandra 6 Jun 16 12:43 energydata1000m-52502e00142511e5b5ddabd6d8b6d1d3
drwxr-xr-x 2 cassandra cassandra 16384 Jun 17 13:39 energydata100m-4cb23100142511e5b5ddabd6d8b6d1d3
drwxr-xr-x 2 cassandra cassandra 6 Jun 17 08:41 energydata10m-46487f90142511e5b5ddabd6d8b6d1d3
drwxr-xr-x 2 cassandra cassandra 4096 Jun 17 10:58 energydata10m-f17f204014d811e5b5ddabd6d8b6d1d3
drwxr-xr-x 3 cassandra cassandra 22 Jun 17 10:07 energydata10m-fa83059014cd11e5b5ddabd6d8b6d1d3
drwxr-xr-x 2 cassandra cassandra 6 Jun 16 12:40 energydata-d615ace0141d11e5b5ddabd6d8b6d1d3
/var/lib/cassandra/data/debs/energydata1000m-52502e00142511e5b5ddabd6d8b6d1d3:
total 0
/var/lib/cassandra/data/debs/energydata100m-4cb23100142511e5b5ddabd6d8b6d1d3:
total 3294336
-rw-r--r-- 1 cassandra cassandra 361779 Jun 17 12:36 debs-energydata100m-ka-187-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 943405306 Jun 17 12:36 debs-energydata100m-ka-187-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 12:36 debs-energydata100m-ka-187-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 17615016 Jun 17 12:36 debs-energydata100m-ka-187-Filter.db
-rw-r--r-- 1 cassandra cassandra 254001924 Jun 17 12:36 debs-energydata100m-ka-187-Index.db
-rw-r--r-- 1 cassandra cassandra 9911 Jun 17 12:36 debs-energydata100m-ka-187-Statistics.db
-rw-r--r-- 1 cassandra cassandra 1763968 Jun 17 12:36 debs-energydata100m-ka-187-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 12:36 debs-energydata100m-ka-187-TOC.txt
-rw-r--r-- 1 cassandra cassandra 46747 Jun 17 12:25 debs-energydata100m-ka-211-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 120719760 Jun 17 12:25 debs-energydata100m-ka-211-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 12:25 debs-energydata100m-ka-211-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 2266552 Jun 17 12:25 debs-energydata100m-ka-211-Filter.db
-rw-r--r-- 1 cassandra cassandra 32799168 Jun 17 12:25 debs-energydata100m-ka-211-Index.db
-rw-r--r-- 1 cassandra cassandra 9955 Jun 17 12:25 debs-energydata100m-ka-211-Statistics.db
-rw-r--r-- 1 cassandra cassandra 227840 Jun 17 12:25 debs-energydata100m-ka-211-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 12:25 debs-energydata100m-ka-211-TOC.txt
-rw-r--r-- 1 cassandra cassandra 400275 Jun 17 13:39 debs-energydata100m-ka-353-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 1053658168 Jun 17 13:39 debs-energydata100m-ka-353-Data.db
-rw-r--r-- 1 cassandra cassandra 9 Jun 17 13:39 debs-energydata100m-ka-353-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 19254504 Jun 17 13:39 debs-energydata100m-ka-353-Filter.db
-rw-r--r-- 1 cassandra cassandra 281034756 Jun 17 13:39 debs-energydata100m-ka-353-Index.db
-rw-r--r-- 1 cassandra cassandra 9911 Jun 17 13:39 debs-energydata100m-ka-353-Statistics.db
-rw-r--r-- 1 cassandra cassandra 1951696 Jun 17 13:39 debs-energydata100m-ka-353-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 13:39 debs-energydata100m-ka-353-TOC.txt
-rw-r--r-- 1 cassandra cassandra 106147 Jun 17 13:32 debs-energydata100m-ka-377-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 275239666 Jun 17 13:32 debs-energydata100m-ka-377-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 13:32 debs-energydata100m-ka-377-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 5209632 Jun 17 13:32 debs-energydata100m-ka-377-Filter.db
-rw-r--r-- 1 cassandra cassandra 74503386 Jun 17 13:32 debs-energydata100m-ka-377-Index.db
-rw-r--r-- 1 cassandra cassandra 9935 Jun 17 13:32 debs-energydata100m-ka-377-Statistics.db
-rw-r--r-- 1 cassandra cassandra 517456 Jun 17 13:32 debs-energydata100m-ka-377-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 13:32 debs-energydata100m-ka-377-TOC.txt
-rw-r--r-- 1 cassandra cassandra 63267 Jun 17 13:36 debs-energydata100m-ka-392-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 163610575 Jun 17 13:36 debs-energydata100m-ka-392-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 13:36 debs-energydata100m-ka-392-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 3146928 Jun 17 13:36 debs-energydata100m-ka-392-Filter.db
-rw-r--r-- 1 cassandra cassandra 44398512 Jun 17 13:36 debs-energydata100m-ka-392-Index.db
-rw-r--r-- 1 cassandra cassandra 9971 Jun 17 13:36 debs-energydata100m-ka-392-Statistics.db
-rw-r--r-- 1 cassandra cassandra 308400 Jun 17 13:36 debs-energydata100m-ka-392-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 13:36 debs-energydata100m-ka-392-TOC.txt
-rw-r--r-- 1 cassandra cassandra 16475 Jun 17 13:37 debs-energydata100m-ka-398-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 42447012 Jun 17 13:37 debs-energydata100m-ka-398-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 13:37 debs-energydata100m-ka-398-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 819112 Jun 17 13:37 debs-energydata100m-ka-398-Filter.db
-rw-r--r-- 1 cassandra cassandra 11540160 Jun 17 13:37 debs-energydata100m-ka-398-Index.db
-rw-r--r-- 1 cassandra cassandra 9915 Jun 17 13:37 debs-energydata100m-ka-398-Statistics.db
-rw-r--r-- 1 cassandra cassandra 80208 Jun 17 13:37 debs-energydata100m-ka-398-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 13:37 debs-energydata100m-ka-398-TOC.txt
-rw-r--r-- 1 cassandra cassandra 3307 Jun 17 13:37 debs-energydata100m-ka-399-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 8375321 Jun 17 13:37 debs-energydata100m-ka-399-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 13:37 debs-energydata100m-ka-399-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 159248 Jun 17 13:37 debs-energydata100m-ka-399-Filter.db
-rw-r--r-- 1 cassandra cassandra 2292966 Jun 17 13:37 debs-energydata100m-ka-399-Index.db
-rw-r--r-- 1 cassandra cassandra 9895 Jun 17 13:37 debs-energydata100m-ka-399-Statistics.db
-rw-r--r-- 1 cassandra cassandra 16000 Jun 17 13:37 debs-energydata100m-ka-399-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 13:37 debs-energydata100m-ka-399-TOC.txt
-rw-r--r-- 1 cassandra cassandra 3299 Jun 17 13:39 debs-energydata100m-ka-400-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 8332947 Jun 17 13:39 debs-energydata100m-ka-400-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 13:39 debs-energydata100m-ka-400-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 159088 Jun 17 13:39 debs-energydata100m-ka-400-Filter.db
-rw-r--r-- 1 cassandra cassandra 2290716 Jun 17 13:39 debs-energydata100m-ka-400-Index.db
-rw-r--r-- 1 cassandra cassandra 9895 Jun 17 13:39 debs-energydata100m-ka-400-Statistics.db
-rw-r--r-- 1 cassandra cassandra 15984 Jun 17 13:39 debs-energydata100m-ka-400-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 13:39 debs-energydata100m-ka-400-TOC.txt
/var/lib/cassandra/data/debs/energydata10m-46487f90142511e5b5ddabd6d8b6d1d3:
total 0
/var/lib/cassandra/data/debs/energydata10m-f17f204014d811e5b5ddabd6d8b6d1d3:
total 326684
-rw-r--r-- 1 cassandra cassandra 95051 Jun 17 10:30 debs-energydata10m-ka-37-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 245687780 Jun 17 10:30 debs-energydata10m-ka-37-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 10:30 debs-energydata10m-ka-37-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 4617168 Jun 17 10:30 debs-energydata10m-ka-37-Filter.db
-rw-r--r-- 1 cassandra cassandra 66716856 Jun 17 10:30 debs-energydata10m-ka-37-Index.db
-rw-r--r-- 1 cassandra cassandra 9923 Jun 17 10:30 debs-energydata10m-ka-37-Statistics.db
-rw-r--r-- 1 cassandra cassandra 463376 Jun 17 10:30 debs-energydata10m-ka-37-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 10:30 debs-energydata10m-ka-37-TOC.txt
-rw-r--r-- 1 cassandra cassandra 3379 Jun 17 10:28 debs-energydata10m-ka-38-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 8505046 Jun 17 10:28 debs-energydata10m-ka-38-Data.db
-rw-r--r-- 1 cassandra cassandra 9 Jun 17 10:28 debs-energydata10m-ka-38-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 162984 Jun 17 10:28 debs-energydata10m-ka-38-Filter.db
-rw-r--r-- 1 cassandra cassandra 2346732 Jun 17 10:28 debs-energydata10m-ka-38-Index.db
-rw-r--r-- 1 cassandra cassandra 9895 Jun 17 10:28 debs-energydata10m-ka-38-Statistics.db
-rw-r--r-- 1 cassandra cassandra 16368 Jun 17 10:28 debs-energydata10m-ka-38-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 10:28 debs-energydata10m-ka-38-TOC.txt
-rw-r--r-- 1 cassandra cassandra 1811 Jun 17 10:58 debs-energydata10m-ka-39-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 4475513 Jun 17 10:58 debs-energydata10m-ka-39-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 10:58 debs-energydata10m-ka-39-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 86392 Jun 17 10:58 debs-energydata10m-ka-39-Filter.db
-rw-r--r-- 1 cassandra cassandra 1243818 Jun 17 10:58 debs-energydata10m-ka-39-Index.db
-rw-r--r-- 1 cassandra cassandra 9895 Jun 17 10:58 debs-energydata10m-ka-39-Statistics.db
-rw-r--r-- 1 cassandra cassandra 8704 Jun 17 10:58 debs-energydata10m-ka-39-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 10:58 debs-energydata10m-ka-39-TOC.txt
/var/lib/cassandra/data/debs/energydata10m-fa83059014cd11e5b5ddabd6d8b6d1d3:
total 0
drwxr-xr-x 3 cassandra cassandra 40 Jun 17 10:07 snapshots
/var/lib/cassandra/data/debs/energydata10m-fa83059014cd11e5b5ddabd6d8b6d1d3/snapshots:
total 4
drwxr-xr-x 2 cassandra cassandra 4096 Jun 17 10:07 1434535647574-energydata10m
/var/lib/cassandra/data/debs/energydata10m-fa83059014cd11e5b5ddabd6d8b6d1d3/snapshots/1434535647574-energydata10m:
total 326784
-rw-r--r-- 1 cassandra cassandra 92923 Jun 17 09:15 debs-energydata10m-ka-37-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 240323836 Jun 17 09:15 debs-energydata10m-ka-37-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 09:15 debs-energydata10m-ka-37-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 4520064 Jun 17 09:15 debs-energydata10m-ka-37-Filter.db
-rw-r--r-- 1 cassandra cassandra 65218608 Jun 17 09:15 debs-energydata10m-ka-37-Index.db
-rw-r--r-- 1 cassandra cassandra 9919 Jun 17 09:15 debs-energydata10m-ka-37-Statistics.db
-rw-r--r-- 1 cassandra cassandra 452976 Jun 17 09:15 debs-energydata10m-ka-37-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 09:15 debs-energydata10m-ka-37-TOC.txt
-rw-r--r-- 1 cassandra cassandra 3307 Jun 17 09:14 debs-energydata10m-ka-38-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 8321541 Jun 17 09:14 debs-energydata10m-ka-38-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 09:14 debs-energydata10m-ka-38-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 159384 Jun 17 09:14 debs-energydata10m-ka-38-Filter.db
-rw-r--r-- 1 cassandra cassandra 2294964 Jun 17 09:14 debs-energydata10m-ka-38-Index.db
-rw-r--r-- 1 cassandra cassandra 9895 Jun 17 09:14 debs-energydata10m-ka-38-Statistics.db
-rw-r--r-- 1 cassandra cassandra 16016 Jun 17 09:14 debs-energydata10m-ka-38-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 09:14 debs-energydata10m-ka-38-TOC.txt
-rw-r--r-- 1 cassandra cassandra 3307 Jun 17 09:15 debs-energydata10m-ka-39-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 8316992 Jun 17 09:15 debs-energydata10m-ka-39-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 09:15 debs-energydata10m-ka-39-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 159296 Jun 17 09:15 debs-energydata10m-ka-39-Filter.db
-rw-r--r-- 1 cassandra cassandra 2293614 Jun 17 09:15 debs-energydata10m-ka-39-Index.db
-rw-r--r-- 1 cassandra cassandra 9895 Jun 17 09:15 debs-energydata10m-ka-39-Statistics.db
-rw-r--r-- 1 cassandra cassandra 16000 Jun 17 09:15 debs-energydata10m-ka-39-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 09:15 debs-energydata10m-ka-39-TOC.txt
-rw-r--r-- 1 cassandra cassandra 755 Jun 17 10:07 debs-energydata10m-ka-40-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 1781300 Jun 17 10:07 debs-energydata10m-ka-40-Data.db
-rw-r--r-- 1 cassandra cassandra 10 Jun 17 10:07 debs-energydata10m-ka-40-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 34752 Jun 17 10:07 debs-energydata10m-ka-40-Filter.db
-rw-r--r-- 1 cassandra cassandra 500220 Jun 17 10:07 debs-energydata10m-ka-40-Index.db
-rw-r--r-- 1 cassandra cassandra 9895 Jun 17 10:07 debs-energydata10m-ka-40-Statistics.db
-rw-r--r-- 1 cassandra cassandra 3552 Jun 17 10:07 debs-energydata10m-ka-40-Summary.db
-rw-r--r-- 1 cassandra cassandra 91 Jun 17 10:07 debs-energydata10m-ka-40-TOC.txt
-rw-r--r-- 1 cassandra cassandra 152 Jun 17 10:07 manifest.json
/var/lib/cassandra/data/debs/energydata-d615ace0141d11e5b5ddabd6d8b6d1d3:
total 0
Information: The Data of energydata10m or energydata1000m already existed before the writing process of energydata100m (the 14 GB disk space before launing)!
************** EDIT **************
I found calculation formulas here: http://docs.datastax.com/en/cassandra/1.2/cassandra/architecture/architecturePlanningUserData_t.html They say that the data on disk can be much higher than the original dataset. Can someone explain how to calculate the values of the link above? I don't know about the needed data-sizes...

The following documentation explains the data sizes and their calculation:
http://docs.datastax.com/en/cassandra/1.2/cassandra/architecture/architecturePlanningUserData_t.html

Related

How to get the vendor and product ID of a usb device from /sys/bus/devices/usb/devices

I have a WiFi adapter plugged into my computer, I can find it's ID numbers by looking through the output of lsusb.
Bus 005 Device 009: ID 1737:0071 Linksys WUSB600N v1 Dual-Band Wireless-N Network Adapter [Ralink RT2870]
This is the only wiFi adapter I have plugged in currently, so this obviously is it. I searched around in /sys/bus/usb/devices/ until I found this path on my machine
# ls -l /sys/bus/usb/devices/5-3.1:1.0/
total 0
-rw-r--r-- 1 root root 4096 Dec 28 18:11 authorized
-r--r--r-- 1 root root 4096 Dec 28 18:11 bAlternateSetting
-r--r--r-- 1 root root 4096 Dec 28 18:06 bInterfaceClass
-r--r--r-- 1 root root 4096 Dec 28 18:06 bInterfaceNumber
-r--r--r-- 1 root root 4096 Dec 28 18:06 bInterfaceProtocol
-r--r--r-- 1 root root 4096 Dec 28 18:06 bInterfaceSubClass
-r--r--r-- 1 root root 4096 Dec 28 18:11 bNumEndpoints
lrwxrwxrwx 1 root root 0 Dec 28 18:06 driver -> ../../../../../../../../bus/usb/drivers/rt2800usb
drwxr-xr-x 3 root root 0 Dec 28 18:11 ep_01
drwxr-xr-x 3 root root 0 Dec 28 18:11 ep_02
drwxr-xr-x 3 root root 0 Dec 28 18:11 ep_03
drwxr-xr-x 3 root root 0 Dec 28 18:11 ep_04
drwxr-xr-x 3 root root 0 Dec 28 18:11 ep_05
drwxr-xr-x 3 root root 0 Dec 28 18:11 ep_06
drwxr-xr-x 3 root root 0 Dec 28 18:11 ep_81
drwxr-xr-x 3 root root 0 Dec 28 18:06 ieee80211
drwxr-xr-x 5 root root 0 Dec 28 18:06 leds
-r--r--r-- 1 root root 4096 Dec 28 18:11 modalias
drwxr-xr-x 3 root root 0 Dec 28 18:06 net
drwxr-xr-x 2 root root 0 Dec 28 18:11 power
lrwxrwxrwx 1 root root 0 Dec 28 18:06 subsystem -> ../../../../../../../../bus/usb
-r--r--r-- 1 root root 4096 Dec 28 18:11 supports_autosuspend
-rw-r--r-- 1 root root 4096 Dec 28 18:06 uevent
By looking at the driver symbolic link I see this is using the rt2800usb driver. So this has to be the correct entry for my WiFi adapter. But identifying based off kernel driver name is inexact and I would prefer not do it that way. Is there a file under /sys/bus/usb/devices/5-3.1:1.0/ that can tell me the vendor ID and the product ID of the entry I am looking at?

Can I safely delete Hive jars from Spark dependencies?

I'm using Spark standalone without Hive enabled. According to the docs:
since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution
Yet, when I look at the jars, they are all there.
user#mymachine:~/spark-3.0.0-bin-hadoop3.2/jars$ ls -l | grep hive
-rw-r--r-- 1 user user 183472 Jun 6 2020 hive-beeline-2.3.7.jar
-rw-r--r-- 1 user user 43387 Jun 6 2020 hive-cli-2.3.7.jar
-rw-r--r-- 1 user user 436980 Jun 6 2020 hive-common-2.3.7.jar
-rw-r--r-- 1 user user 10839104 Jun 6 2020 hive-exec-2.3.7-core.jar
-rw-r--r-- 1 user user 116311 Jun 6 2020 hive-jdbc-2.3.7.jar
-rw-r--r-- 1 user user 326419 Jun 6 2020 hive-llap-common-2.3.7.jar
-rw-r--r-- 1 user user 8194428 Jun 6 2020 hive-metastore-2.3.7.jar
-rw-r--r-- 1 user user 916206 Jun 6 2020 hive-serde-2.3.7.jar
-rw-r--r-- 1 user user 54116 Jun 6 2020 hive-shims-0.23-2.3.7.jar
-rw-r--r-- 1 user user 8785 Jun 6 2020 hive-shims-2.3.7.jar
-rw-r--r-- 1 user user 120305 Jun 6 2020 hive-shims-common-2.3.7.jar
-rw-r--r-- 1 user user 12984 Jun 6 2020 hive-shims-scheduler-2.3.7.jar
-rw-r--r-- 1 user user 234942 Jun 6 2020 hive-storage-api-2.7.1.jar
-rw-r--r-- 1 user user 38352 Jun 6 2020 hive-vector-code-gen-2.3.7.jar
-rw-r--r-- 1 user user 692922 Jun 6 2020 spark-hive_2.12-3.0.0.jar
-rw-r--r-- 1 user user 2079585 Jun 6 2020 spark-hive-thriftserver_2.12-3.0.0.jar
Is it safe to remove these?

How to split a file based on 2 or more column values

I am trying to split a file base on 2 column values.
Able to split a file based on one column
awk -F\| '{print>$1}' file1
Data Needs to be split based on column 2nd and 5th ( If column2 = 3 AND column5=M)
A1|3|100|20|M
A1|5|101|20|N
A1|5|101|30|M
A1|3|105|20|O
B1|3|150|5|M
A1|3|106|20|Q
A1|5|101|20|N
A1|5|101|30|Q
A1|5108|20|O
B1|3|150|5|M
Output : File 1
A1|5|101|20|N
A1|5|101|30|M
A1|3|105|20|O
A1|3|106|20|Q
A1|5|101|20|N
A1|5|101|30|Q
A1|5108|20|O
Output: File 2
A1|3|100|20|M
B1|3|150|5|M
B1|3|150|5|M
cat file1
A1|3|100|20|M
A1|5|101|20|N
A1|5|101|30|M
A1|3|105|20|O
B1|3|150|5|M
A1|3|106|20|Q
A1|5|101|20|N
A1|5|101|30|Q
A1|5108|20|O
B1|3|150|5|M
Try this:
awk -F\| '{print>$2$5}' file1
Which gives me:
ls -l
total 120
-rw-rw-r-- 1 tink tink 26 Jun 21 12:22 3M
-rw-rw-r-- 1 tink tink 15 Jun 21 12:22 3M
-rw-rw-r-- 1 tink tink 14 Jun 21 12:22 3O
-rw-rw-r-- 1 tink tink 14 Jun 21 12:22 3Q
-rw-rw-r-- 1 tink tink 13 Jun 21 12:22 5108
-rw-rw-r-- 1 tink tink 15 Jun 21 12:22 5M
-rw-rw-r-- 1 tink tink 14 Jun 21 12:22 5N
-rw-rw-r-- 1 tink tink 15 Jun 21 12:22 5N
-rw-rw-r-- 1 tink tink 14 Jun 21 12:22 5Q
-rw-rw-r-- 1 tink tink 140 Jun 21 12:16 file1

Restore not restoring data all the time

I have cassandra 3.7 installed in container and managed by kubernetes
I created a keyspace cathy1 with replication factor 3
Inside the cassandra container on node1, I have created a keyspace cathy1 as following:
CREATE KEYSPACE cathy1 WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};
CREATE TABLE cathy1.employees(emp_id int PRIMARY KEY,emp_name text);
INSERT INTO cathy1.employees(emp_id,emp_name) VALUES (1,'cathy');
INSERT INTO cathy1.employees(emp_id,emp_name) VALUES (2,'jon');
so each node owns 100% of the data
I run a cqlsh -f list_tables on each node:
emp_id | emp_name
--------+----------
1 | cathy
2 | jon
(2 rows)
I run on node 2 :
nodetool snapshot -t mycathy1-node2 cathy1
I see a directory mycathy1-node2 under cassandra/data/cathy1/employees* /snapshots containing this:
-rw-r--r-- 1 root root 32 Oct 18 20:27 manifest.json
-rw-r--r-- 2 root root 43 Oct 18 20:22 mb-12-big-CompressionInfo.db
-rw-r--r-- 2 root root 96 Oct 18 20:22 mb-12-big-Data.db
-rw-r--r-- 2 root root 9 Oct 18 20:22 mb-12-big-Digest.crc32
-rw-r--r-- 2 root root 16 Oct 18 20:22 mb-12-big-Filter.db
-rw-r--r-- 2 root root 32 Oct 18 20:22 mb-12-big-Index.db
-rw-r--r-- 2 root root 4610 Oct 18 20:23 mb-12-big-Statistics.db
-rw-r--r-- 2 root root 56 Oct 18 20:22 mb-12-big-Summary.db
-rw-r--r-- 2 root root 92 Oct 18 20:22 mb-12-big-TOC.txt
Then I truncate the table
cqlsh -e "truncate cathy1.employees"
At that moment there are no files under cassandra/data/cathy1/employees* on any nodes
Only the snapshots directory remains
I run a cqlsh -f list_tables on each node:
emp_id | emp_name
--------+----------
(0 rows)
I run a repair on node 2:
nodetool repair cathy1
it finishes successfully
then still on node 2
cd cassandra/data/employees*
cp ./snapshots/mycathy1-node2/* .
-rw-r--r-- 1 root root 32 Oct 18 20:34 manifest.json
-rw-r--r-- 1 root root 43 Oct 18 20:34 mb-12-big-CompressionInfo.db
-rw-r--r-- 1 root root 96 Oct 18 20:34 mb-12-big-Data.db
-rw-r--r-- 1 root root 9 Oct 18 20:34 mb-12-big-Digest.crc32
-rw-r--r-- 1 root root 16 Oct 18 20:34 mb-12-big-Filter.db
-rw-r--r-- 1 root root 32 Oct 18 20:34 mb-12-big-Index.db
-rw-r--r-- 1 root root 4610 Oct 18 20:34 mb-12-big-Statistics.db
-rw-r--r-- 1 root root 56 Oct 18 20:34 mb-12-big-Summary.db
-rw-r--r-- 1 root root 92 Oct 18 20:34 mb-12-big-TOC.txt
drwxr-xr-x 16 root root 4096 Oct 18 20:29 snapshots
Then I run nodetool refresh employees
I run a cqlsh -f list_tables on each node:
emp_id | emp_name
--------+----------
(0 rows)
I run nodetool repair cathy1
and there is still no data visible !!!!!
Pending Flushes: 0 <br>
Table: employees <br>
Space used (live): 4954 <br>
Space used (total): 4954 <br>
Space used by snapshots (total): 59873 <br>
Off heap memory used (total): 32 <br>
SSTable Compression Ratio: 0.75 <br>
**Number of keys (estimate): 4** <br>
Even if statistics says there are 4 keys in table cathy1.employees
nodetool flush cathy1
still no data visible with cqlsh
Why is that ?
You need to run sstableloader at the directory where you've copied your snapshot files.
sstableloader -d <node_ip> -u cassandra -pw cassandra <directory_location>
Note: If your current directory is the directory you have copied your snapshot file, then you don't need to put anything at the field directory_location.
For more details about sstableloader: https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsBulkloader_t.html

Page cache limit

I'm using Ubuntu with kernel 3.2.1, x86_64. I'm trying to benchmark a file system, and I want to limit the page cache size to avoid the file system cache taking up too much RAM, which would obviously improve performance (but would not reflect the results on systems with less memory).
Is there a way to do that? I've seen an option in some RHEL distribution for tuning /proc/sys/vm/pagecaches which seems to satisfy this, but I dont see anything useful in Ubuntu except dirty_background_ratio, which will only trigger flushing to disk, not more caching (so I can get a lot more sync I/O etc)
Thank you
ubuntu does not seem to have vm.pagecache settings
ls -l /proc/sys/vm/
total 0
-rw-r--r-- 1 root root 0 Jun 17 14:13 block_dump
--w------- 1 root root 0 Jun 17 14:13 compact_memory
-rw-r--r-- 1 root root 0 Jun 17 14:13 dirty_background_bytes
-rw-r--r-- 1 root root 0 Jun 17 09:16 dirty_background_ratio
-rw-r--r-- 1 root root 0 Jun 17 14:13 dirty_bytes
-rw-r--r-- 1 root root 0 Jun 17 14:13 dirty_expire_centisecs
-rw-r--r-- 1 root root 0 Jun 17 09:16 dirty_ratio
-rw-r--r-- 1 root root 0 Jun 17 09:16 dirty_writeback_centisecs
-rw-r--r-- 1 root root 0 Jun 17 14:13 drop_caches
-rw-r--r-- 1 root root 0 Jun 17 14:13 extfrag_threshold
-rw-r--r-- 1 root root 0 Jun 17 14:13 hugepages_treat_as_movable
-rw-r--r-- 1 root root 0 Jun 17 14:13 hugetlb_shm_group
-rw-r--r-- 1 root root 0 Jun 17 09:16 laptop_mode
-rw-r--r-- 1 root root 0 Jun 17 14:13 legacy_va_layout
-rw-r--r-- 1 root root 0 Jun 17 14:13 lowmem_reserve_ratio
-rw-r--r-- 1 root root 0 Jun 17 14:13 max_map_count
-rw-r--r-- 1 root root 0 Jun 17 14:13 memory_failure_early_kill
-rw-r--r-- 1 root root 0 Jun 17 14:13 memory_failure_recovery
-rw-r--r-- 1 root root 0 Jun 17 14:13 min_free_kbytes
-rw-r--r-- 1 root root 0 Jun 17 14:13 min_slab_ratio
-rw-r--r-- 1 root root 0 Jun 17 14:13 min_unmapped_ratio
-rw-r--r-- 1 root root 0 Jun 17 09:15 mmap_min_addr
-rw-r--r-- 1 root root 0 Jun 17 14:13 nr_hugepages
-rw-r--r-- 1 root root 0 Jun 17 14:13 nr_hugepages_mempolicy
-rw-r--r-- 1 root root 0 Jun 17 14:13 nr_overcommit_hugepages
-r--r--r-- 1 root root 0 Jun 17 14:13 nr_pdflush_threads
-rw-r--r-- 1 root root 0 Jun 17 14:13 numa_zonelist_order
-rw-r--r-- 1 root root 0 Jun 17 14:13 oom_dump_tasks
-rw-r--r-- 1 root root 0 Jun 17 14:13 oom_kill_allocating_task
-rw-r--r-- 1 root root 0 Jun 17 09:15 overcommit_memory
-rw-r--r-- 1 root root 0 Jun 17 14:13 overcommit_ratio
-rw-r--r-- 1 root root 0 Jun 17 14:13 page-cluster
-rw-r--r-- 1 root root 0 Jun 17 14:13 panic_on_oom
-rw-r--r-- 1 root root 0 Jun 17 14:13 percpu_pagelist_fraction
-rw-r--r-- 1 root root 0 Jun 17 14:13 scan_unevictable_pages
-rw-r--r-- 1 root root 0 Jun 17 14:13 stat_interval
-rw-r--r-- 1 root root 0 Jun 17 14:13 swappiness
-rw-r--r-- 1 root root 0 Jun 17 14:13 vfs_cache_pressure
-rw-r--r-- 1 root root 0 Jun 17 14:13 zone_reclaim_mode
you could try the following:
vi /etc/sysctl.conf
vm.min_free_kbytes=1024
vm.swappiness = 100
then run
sysctl -p
vm.min_free_kbytes = 1024
vm.swappiness = 100
Unsure if it is of any help.
The swapiness 100 Swap more application data to disk when ram is exhausted
Set the parameter echo 3019053 > /proc/sys/vm/min_free_kbytes , the current value represents 2.8-GB. then check the value by giving the command line : sysctl -a | grep min_free_kbytes.
now to answer this question, if you want to avoid too much of cache to be accumulated set this value to about 70-90% of the Total RAM.
now to avoid swap to be accumulated, turn the swap off by giving the command as : swapoff -a and commenting the line for swap under /etc/fstab to turn it off permanently after reboots. this will make things happen as the questioner wants.

Resources