Cassandra CommitLog Directory Forgetting To Remove Files - cassandra

Version: DSE 6.7.5, CQL spec 3.4.5.
I have 8GB commitlog_total_space_in_mb.
Folder is currently at 13GB.
Looking at the date stamps in the folder it appears that it forgets about commitlogs or it may be failing to delete the commitlogs when it flushes memtables.
Happens on multiple nodes.
-rw-r--r--. 1 cassandra cassandra 33554338 Sep 20 02:00 CommitLog-600-1568892978830.log
-rw-r--r--. 1 cassandra cassandra 33554227 Sep 20 02:02 CommitLog-600-1568892978853.log
-rw-r--r--. 1 cassandra cassandra 33554217 Sep 20 02:02 CommitLog-600-1568892978862.log
-rw-r--r--. 1 cassandra cassandra 33554337 Sep 20 02:03 CommitLog-600-1568892978863.log
-rw-r--r--. 1 cassandra cassandra 33554169 Sep 20 02:04 CommitLog-600-1568892978864.log
-rw-r--r--. 1 cassandra cassandra 33554412 Sep 20 08:19 CommitLog-600-1568892954896.log
-rw-r--r--. 1 cassandra cassandra 33554326 Sep 20 08:19 CommitLog-600-1568892954901.log
-rw-r--r--. 1 cassandra cassandra 33554133 Sep 20 08:20 CommitLog-600-1568892954904.log
-rw-r--r--. 1 cassandra cassandra 33554281 Sep 20 08:20 CommitLog-600-1568892954905.log
-rw-r--r--. 1 cassandra cassandra 33553885 Sep 20 08:20 CommitLog-600-1568892954906.log
When i perform a nodetool flush/drain it will not remove any of the old files.
-rw-r--r--. 1 cassandra cassandra 33554338 Sep 20 02:00 CommitLog-600-1568892978830.log
-rw-r--r--. 1 cassandra cassandra 33554227 Sep 20 02:02 CommitLog-600-1568892978853.log
-rw-r--r--. 1 cassandra cassandra 33554217 Sep 20 02:02 CommitLog-600-1568892978862.log
-rw-r--r--. 1 cassandra cassandra 33554337 Sep 20 02:03 CommitLog-600-1568892978863.log
-rw-r--r--. 1 cassandra cassandra 33554169 Sep 20 02:04 CommitLog-600-1568892978864.log
-rw-r--r--. 1 cassandra cassandra 28 Sep 20 08:46 CommitLog-600-1568892981041.log
When I start the node back up it goes through them and crashes around the final commitlog. https://pastebin.com/Kw9Kee5C
CassandraDaemon.java:129 - Exception in thread Thread[PerDiskMemtableFlushWriter_0:11,5,main] java.lang.AssertionError: null
It wont start back up unless I move some of the last commitlogs out or all of them out.
What can I do to fix this problem

I have resolved this for my issue for the time being by changing compaction to
compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
For some reason having cell type of map with the following compaction was causing me the errors.
{'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '30', 'compaction_window_unit': 'DAYS', 'max_threshold': '32', 'min_threshold': '4', 'split_during_flush': 'true'}

Related

Duplicated Cassandra SSTable files

After unsuccessful nodetool repair operation I got two big sstable files (last two in the listing below) instead of one, each having the same size as a single file before. And now this files cannot be merged back by common tools (nodetool clean, nodetool compact, nodetool repair). Tables are replicated to another cassandra node (replication_factor: 2), and there are two big sstable files as well now.
-rw-r--r-- 1 cassandra cassandra 16M Mar 5 12:36 mc-116413-big-Data.db
-rw-r--r-- 1 cassandra cassandra 34M Mar 5 01:21 mc-116320-big-Index.db
-rw-r--r-- 1 cassandra cassandra 39M Mar 3 22:46 mc-116125-big-Index.db
-rw-r--r-- 1 cassandra cassandra 66M Mar 5 12:25 mc-116412-big-Data.db
-rw-r--r-- 1 cassandra cassandra 262M Mar 5 05:51 mc-116365-big-Data.db
-rw-r--r-- 1 cassandra cassandra 263M Mar 5 08:46 mc-116386-big-Data.db
-rw-r--r-- 1 cassandra cassandra 263M Mar 5 11:42 mc-116407-big-Data.db
-rw-r--r-- 1 cassandra cassandra 7.2G Mar 5 03:18 mc-116345-big-Data.db
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db```
I suppose that one of this files contains duplicated data. How can I compact files back to a single file?
Maybe I'm not looking properly but I don't see any duplicate SSTable files in the file listing you posted.
If you're referring to these 2:
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db
They're not duplicates because they have 2 different generation IDs -- 116125 and 116320. This means they also have different ancestors.
If you're referring to these:
-rw-r--r-- 1 cassandra cassandra 39M Mar 3 22:46 mc-116125-big-Index.db
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 34M Mar 5 01:21 mc-116320-big-Index.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db
Again, they're not duplicates of each other. The *-Data.db files contain the actual data. The *-Index.db files are component files which contain the partition index, i.e. the index of the partitions within the data files which are used for fast retrieval.
If you're interested, I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/5219/. Cheers!
[UPDATE] To respond to this follow-up question:
Could you suppose why this two files don`t compacted in a single file,
as usual do?
Assuming the table is configured with SizeTieredCompactionStrategy, it will require similar-sized sstables as candidates before they get compacted together.
The default minimum sstable candidates is min_threshold: 4 so you need 4 similarly-sized sstables for a compaction to be triggered.

How to print change data logs at stdout in yugabyte?

I am using yugabyte db 1.3.0.0 and following https://docs.yugabyte.com/latest/deploy/cdc/use-cdc/ to learn yugabyte db cdc.
Procedure Followed:
a) generate users dynamically and insert into table, yugastore.users continuously through script
b) downloaded yb-cdc-connector.jar using command, wget -O yb-cdc-connector.jar https://github.com/yugabyte/yb-kafka-connector/blob/master/yb-cdc/yb-cdc-connector.jar?raw=true
c)copied yb-cdc-connector.jar to jre/lib/ext using command cp -a /root/yb-cdc-connector.jar /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.232.b09-0.el7_7.x86_64/jre/lib/ext/
d) java -jar /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.232.b09-0.el7_7.x86_64/jre/lib/ext/yb_cdc_connector.jar --table_name yugastore.users --master_addrs 127.0.0.1 --stream_id 1 --log_only
Error Logs:
[root#srvr0 ~]# java -jar /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.232.b09-0.el7_7.x86_64/jre/lib/ext/yb_cdc_connector.jar --table_name yugastore.users --master_addrs 127.0.0.1 --stream_id 1 --log_only
Error: Unable to access jarfile /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.232.b09-0.el7_7.x86_64/jre/lib/ext/yb_cdc_connector.jar
Records Insertion:
yugastore=> select count(*) from yugastore.users;
count
-------
59414
(1 row)
yugastore=> select count(*) from yugastore.users;
count
-------
60066
(1 row)
yugastore=> select count(*) from yugastore.users;
count
-------
79341
(1 row)
yugastore=>
java:
[root#srvr0 ~]# java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
Availability of yb-cdc-connector.jar:
[root#srvr0 ~]# ll /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.232.b09-0.el7_7.x86_64/jre/lib/ext/
total 54756
-rw-r--r--. 1 root root 4003855 Oct 22 12:18 cldrdata.jar
-rw-r--r--. 1 root root 9445 Oct 22 12:18 dnsns.jar
-rw-r--r--. 1 root root 48733 Oct 22 12:18 jaccess.jar
-rw-r--r--. 1 root root 1204895 Oct 22 12:18 localedata.jar
-rw-r--r--. 1 root root 617 Oct 22 12:18 meta-index
-rw-r--r--. 1 root root 2033680 Oct 22 12:18 nashorn.jar
-rw-r--r--. 1 root root 52079 Oct 22 12:18 sunec.jar
-rw-r--r--. 1 root root 304504 Oct 22 12:18 sunjce_provider.jar
-rw-r--r--. 1 root root 279788 Oct 22 12:18 sunpkcs11.jar
-rw-r--r--. 1 root root 48026360 Jan 16 08:31 yb-cdc-connector.jar
-rw-r--r--. 1 root root 78006 Oct 22 12:18 zipfs.jar
Please help me in printing change logs at stdout!
Update1:
small correction required in the doc: edit yb_cdc_connector.jar to yb-cdc-connector.jar in the document command, as wget downloads as yb_cdc_connector.jar.
Though there are changes happening (inserts and deletes) to the table, it showing polling but no cdc is printed.
Logs:
[root#srvr0 ~]# java -jar ./yb_cdc_connector.jar --table_name yugastore.users --master_addrs 127.0.0.1 --stream_id 1 --log_only
[2020-01-22 01:37:24,221] INFO Starting CDC Kafka Connector... (org.yb.cdc.Main:28)
2020-01-22 01:37:24,393 [INFO|org.yb.cdc.KafkaConnector|KafkaConnector] Creating new YB client...
[2020-01-22 01:37:28,344] INFO Discovered tablet YB Master for table YB Master with partition ["", "") (org.yb.client.AsyncYBClient:1593)
2020-01-22 01:37:28,839 [INFO|org.yb.cdc.KafkaConnector|KafkaConnector] Polling for new tablet c6f3d759202341ecad87e2617579371c
2020-01-22 01:37:28,842 [INFO|org.yb.cdc.KafkaConnector|KafkaConnector] Polling for new tablet e94b1748d1a742289500a30a38ff9eda
Insertion:
time: 01:37:49.980 cumulative records: 10
time: 01:37:52.169 cumulative records: 20
time: 01:37:52.410 cumulative records: 30
time: 01:37:52.425 cumulative records: 40
...
time: 01:39:28.139 cumulative records: 970
time: 01:39:28.171 cumulative records: 980
time: 01:39:28.208 cumulative records: 990
time: 01:39:28.246 cumulative records: 1000
Deletion :
yugastore=# delete from yugastore.users;
DELETE 21314
yugastore=# select count(*) from yugastore.users;
count
-------
0
(1 row)
The error shows that the jar file cannot be accessed in the system directory you moved. Most likely a permission issue. Note that you don't even need to move this jar into a system directory. So keeping the jar is some normal directory is the simplest option.
Also, you don't need the stream_id as well since you want the default behavior most times. So try the following command in a user directory where you have the jar file.
java -jar ./yb_cdc_connector.jar \
--table_name yugastore.users \
--master_addrs 127.0.0.1:7100 \
--log_only

Write dataframe to CSV in spark

I am writing a spark dataframe to CSV file using below code
println("Total number of reports: " + reportDf.count())
reportDf
.coalesce(1)
.write.format("com.databricks.spark.csv")
.csv("output/cluster.csv")
And o/p is:
Total number of reports: 48720
spark#monikatest:~/output/cluster.csv$ ll
total 12
drwxrwxr-x 2 spark spark 4096 Mar 27 20:56 ./
drwxrwxr-x 3 spark spark 4096 Mar 27 20:56 ../
-rw-r--r-- 1 spark spark 0 Mar 27 20:56 _SUCCESS
-rw-r--r-- 1 spark spark 8 Mar 27 20:56 ._SUCCESS.crc
No data written to file, only success file present.
Can anyone please suggest how to overcome this error.

How to write records of a same key to multiple files (custom partitioner)

I want to write data from a directory to partitions dynamically using Spark.
Here is the sample code.
val input_DF = spark.read.parquet("input path")
input_DF.write.mode("overwrite").partitionBy("colname").parquet("output path...")
As shown below, No of records per each key are different and there is skew for a key.
input_DF.groupBy($"colname").agg(count("colname")).show()
+-----------------+------------------------+
|colname |count(colname) |
+-----------------+------------------------+
| NA| 14859816| --> More no of records
| A| 2907930|
| D| 1118504|
| B| 485151|
| C| 435305|
| F| 370095|
| G| 170060|
+-----------------+------------------------+
Because of this, job is failing when reasonable memory (8GB) is given for each executor. Job is completing successfully when high memory (15GB) per each executor is given, but taking too long to complete.
I have tried using repartition expecting it will distribute data evenly across partitions. But, as it uses default HashPartitioner, records of a key goes to a single partition.
repartition(num partition,$"colname") --> Creates HashPartition
But this is creating num part files as mentioned in repartiton, but moving all records of a key to a partition (all records with col value NA goes to a partition). Remaining part files have no records (only Parquet metadata,38364 bytes).
-rw-r--r-- 2 hadoop hadoop 0 2017-11-20 14:29 /user/hadoop/table/_SUCCESS
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00000-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00001-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00002-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00003-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:07 /user/hadoop/table/part-r-00004-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00005-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00006-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 1038264502 2017-11-20 13:20 /user/hadoop/table/part-r-00007-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00008-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00009-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00010-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00011-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00012-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00013-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00014-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 128212247 2017-11-20 13:09 /user/hadoop/table/part-r-00015-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00016-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00017-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00018-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 117142244 2017-11-20 13:08 /user/hadoop/table/part-r-00019-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00020-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 347033731 2017-11-20 13:11 /user/hadoop/table/part-r-00021-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00022-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00023-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00024-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 100306686 2017-11-20 13:08 /user/hadoop/table/part-r-00025-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 36961707 2017-11-20 13:07 /user/hadoop/table/part-r-00026-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00027-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00028-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00029-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00030-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00031-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00032-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00033-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00034-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00035-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00036-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:07 /user/hadoop/table/part-r-00037-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00038-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00039-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00040-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00041-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 68859 2017-11-20 13:06 /user/hadoop/table/part-r-00042-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 4031720288 2017-11-20 14:29 /user/hadoop/table/part-r-00043-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00044-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00045-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00046-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00047-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00048-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00049-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00050-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
I would like to know
Is there a way to write same key records to different partitions of DataFrame/RDD? Probably custom partitioner to write every Nth record to Nth partition
(1st rec to partition 1)
(2nd rec to partition 2)
(3rd rec to partition 3)
(4th rec to partition 4)
(5th rec to partition 1)
(6th rec to partition 2)
(7th rec to partition 3)
(8th rec to partition 4)
If yes, can it be controlled using parameters like max no of bytes per partition of DataFrame/RDD.
As the expected result is just writing data into different sub-directories (partitions for Hive) based on a key, I would like to write data by distributing records of a key to multiple tasks, each writing a part file under sub-directory.
The problem was resolved when reparation was done on unique key, instead of the key which was used in "partitionBy". If the dataFrame is missing a unique for some reason, one can add a sudo column using
df.withColumn("Unique_ID", monotonicallyIncreasingId)
and then reparation on "Unique_ID", this way we can distribute data evenly to multiple partitions. To increase performance further, data can be sorted with in DataFrame partition on a key used for join/group/partition

How data in an HDFS block is stored?

I was reading about HDFS and was wondering, if there is any specific format in which data in a block is arranged.
Suppose there is a file of 265 MB that is copied to a Hadoop cluster and the HDFS block size is 64 MB. So the file is broken into 5 parts- 64 MB + 64 MB + 64 MB + 64 MB + 9 MB, and distributed among data nodes. Correct ?
I have a doubt that is there any format within the 64 MB block in which data is stored ?
If there is any format/structure in which the data is stored within the block, then the stored data should be less than 64 MB, since the data structure/header etc, itself may take some space.
Since HDFS data node is a logical filesystem (It runs on top of linux and there is no separate partition for HDFS), all the blocks should be stored as files in the linux partition. Correct ?
How to know the name of the file on linux that actually stores the 64 MB HDFS block ?
Anyone, if can answer these doubts/questions, that would be great. Thanks in advance.
Regards,
(*Vipul)() ;
No, the data is just split on 64MB boundary. Metadata is stored in a small separate file and on the Namenode
No, it is exactly the size you specified, and the data is split on exact boundaries of 64MB. If you have 5 parts - 64 MB + 64 MB + 64 MB + 64 MB + 9 MB, then the last file would be 9MB, all the others are 64MB
Yes, the blocks are stored as a files, each block is represented as a separate file with some small amount of metadata stored in a separate file
hdfs fsck / -files -blocks -locations
Here's an example of how the block files are stored with 128MB block size:
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:17 blk_1073741825
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:17 blk_1073741825_1001.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741826
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741826_1002.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741827
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741827_1003.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741828
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741828_1004.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:19 blk_1073741829
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:19 blk_1073741829_1005.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:19 blk_1073741830
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:19 blk_1073741830_1006.meta
-rw-r--r--. 1 hdfs hadoop 87776064 Jan 12 09:19 blk_1073741831
-rw-r--r--. 1 hdfs hadoop 685759 Jan 12 09:19 blk_1073741831_1007.meta

Resources