How data in an HDFS block is stored? - linux

I was reading about HDFS and was wondering, if there is any specific format in which data in a block is arranged.
Suppose there is a file of 265 MB that is copied to a Hadoop cluster and the HDFS block size is 64 MB. So the file is broken into 5 parts- 64 MB + 64 MB + 64 MB + 64 MB + 9 MB, and distributed among data nodes. Correct ?
I have a doubt that is there any format within the 64 MB block in which data is stored ?
If there is any format/structure in which the data is stored within the block, then the stored data should be less than 64 MB, since the data structure/header etc, itself may take some space.
Since HDFS data node is a logical filesystem (It runs on top of linux and there is no separate partition for HDFS), all the blocks should be stored as files in the linux partition. Correct ?
How to know the name of the file on linux that actually stores the 64 MB HDFS block ?
Anyone, if can answer these doubts/questions, that would be great. Thanks in advance.
(*Vipul)() ;

No, the data is just split on 64MB boundary. Metadata is stored in a small separate file and on the Namenode
No, it is exactly the size you specified, and the data is split on exact boundaries of 64MB. If you have 5 parts - 64 MB + 64 MB + 64 MB + 64 MB + 9 MB, then the last file would be 9MB, all the others are 64MB
Yes, the blocks are stored as a files, each block is represented as a separate file with some small amount of metadata stored in a separate file
hdfs fsck / -files -blocks -locations
Here's an example of how the block files are stored with 128MB block size:
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:17 blk_1073741825
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:17 blk_1073741825_1001.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741826
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741826_1002.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741827
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741827_1003.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741828
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741828_1004.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:19 blk_1073741829
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:19 blk_1073741829_1005.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:19 blk_1073741830
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:19 blk_1073741830_1006.meta
-rw-r--r--. 1 hdfs hadoop 87776064 Jan 12 09:19 blk_1073741831
-rw-r--r--. 1 hdfs hadoop 685759 Jan 12 09:19 blk_1073741831_1007.meta


Duplicated Cassandra SSTable files

After unsuccessful nodetool repair operation I got two big sstable files (last two in the listing below) instead of one, each having the same size as a single file before. And now this files cannot be merged back by common tools (nodetool clean, nodetool compact, nodetool repair). Tables are replicated to another cassandra node (replication_factor: 2), and there are two big sstable files as well now.
-rw-r--r-- 1 cassandra cassandra 16M Mar 5 12:36 mc-116413-big-Data.db
-rw-r--r-- 1 cassandra cassandra 34M Mar 5 01:21 mc-116320-big-Index.db
-rw-r--r-- 1 cassandra cassandra 39M Mar 3 22:46 mc-116125-big-Index.db
-rw-r--r-- 1 cassandra cassandra 66M Mar 5 12:25 mc-116412-big-Data.db
-rw-r--r-- 1 cassandra cassandra 262M Mar 5 05:51 mc-116365-big-Data.db
-rw-r--r-- 1 cassandra cassandra 263M Mar 5 08:46 mc-116386-big-Data.db
-rw-r--r-- 1 cassandra cassandra 263M Mar 5 11:42 mc-116407-big-Data.db
-rw-r--r-- 1 cassandra cassandra 7.2G Mar 5 03:18 mc-116345-big-Data.db
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db```
I suppose that one of this files contains duplicated data. How can I compact files back to a single file?
Maybe I'm not looking properly but I don't see any duplicate SSTable files in the file listing you posted.
If you're referring to these 2:
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db
They're not duplicates because they have 2 different generation IDs -- 116125 and 116320. This means they also have different ancestors.
If you're referring to these:
-rw-r--r-- 1 cassandra cassandra 39M Mar 3 22:46 mc-116125-big-Index.db
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 34M Mar 5 01:21 mc-116320-big-Index.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db
Again, they're not duplicates of each other. The *-Data.db files contain the actual data. The *-Index.db files are component files which contain the partition index, i.e. the index of the partitions within the data files which are used for fast retrieval.
If you're interested, I've explained it in a bit more detail in this post -- Cheers!
[UPDATE] To respond to this follow-up question:
Could you suppose why this two files don`t compacted in a single file,
as usual do?
Assuming the table is configured with SizeTieredCompactionStrategy, it will require similar-sized sstables as candidates before they get compacted together.
The default minimum sstable candidates is min_threshold: 4 so you need 4 similarly-sized sstables for a compaction to be triggered.

Cassandra CommitLog Directory Forgetting To Remove Files

Version: DSE 6.7.5, CQL spec 3.4.5.
I have 8GB commitlog_total_space_in_mb.
Folder is currently at 13GB.
Looking at the date stamps in the folder it appears that it forgets about commitlogs or it may be failing to delete the commitlogs when it flushes memtables.
Happens on multiple nodes.
-rw-r--r--. 1 cassandra cassandra 33554338 Sep 20 02:00 CommitLog-600-1568892978830.log
-rw-r--r--. 1 cassandra cassandra 33554227 Sep 20 02:02 CommitLog-600-1568892978853.log
-rw-r--r--. 1 cassandra cassandra 33554217 Sep 20 02:02 CommitLog-600-1568892978862.log
-rw-r--r--. 1 cassandra cassandra 33554337 Sep 20 02:03 CommitLog-600-1568892978863.log
-rw-r--r--. 1 cassandra cassandra 33554169 Sep 20 02:04 CommitLog-600-1568892978864.log
-rw-r--r--. 1 cassandra cassandra 33554412 Sep 20 08:19 CommitLog-600-1568892954896.log
-rw-r--r--. 1 cassandra cassandra 33554326 Sep 20 08:19 CommitLog-600-1568892954901.log
-rw-r--r--. 1 cassandra cassandra 33554133 Sep 20 08:20 CommitLog-600-1568892954904.log
-rw-r--r--. 1 cassandra cassandra 33554281 Sep 20 08:20 CommitLog-600-1568892954905.log
-rw-r--r--. 1 cassandra cassandra 33553885 Sep 20 08:20 CommitLog-600-1568892954906.log
When i perform a nodetool flush/drain it will not remove any of the old files.
-rw-r--r--. 1 cassandra cassandra 33554338 Sep 20 02:00 CommitLog-600-1568892978830.log
-rw-r--r--. 1 cassandra cassandra 33554227 Sep 20 02:02 CommitLog-600-1568892978853.log
-rw-r--r--. 1 cassandra cassandra 33554217 Sep 20 02:02 CommitLog-600-1568892978862.log
-rw-r--r--. 1 cassandra cassandra 33554337 Sep 20 02:03 CommitLog-600-1568892978863.log
-rw-r--r--. 1 cassandra cassandra 33554169 Sep 20 02:04 CommitLog-600-1568892978864.log
-rw-r--r--. 1 cassandra cassandra 28 Sep 20 08:46 CommitLog-600-1568892981041.log
When I start the node back up it goes through them and crashes around the final commitlog. - Exception in thread Thread[PerDiskMemtableFlushWriter_0:11,5,main] java.lang.AssertionError: null
It wont start back up unless I move some of the last commitlogs out or all of them out.
What can I do to fix this problem
I have resolved this for my issue for the time being by changing compaction to
compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
For some reason having cell type of map with the following compaction was causing me the errors.
{'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '30', 'compaction_window_unit': 'DAYS', 'max_threshold': '32', 'min_threshold': '4', 'split_during_flush': 'true'}

Spark - Get part-file suffix

When Spark uses Hadoop writer to write part-file (using saveAsTextFile()), this is the general format "part-NNNNN" it saves the file in. How can I retrieve this suffix "NNNNN" in Spark at runtime?
Ps. I do not want to list the files and then retrieve the suffix.
The files are named part-00000, part-00001, and so on. Each of the RDD partitions is written to one part- file. So, the number of output files will depend upon the partitions in the RDD being written out.
You could check the RDD being written for the number of partitions (say 5), and then access the files part-00000 to part-00004.
Build a DataFrame by querying a Hive table
scala> val df1=sqlContext.sql("select * from default.hive_table");
Get number of RDD partitions
scala> df1.rdd.partitions.size
res4: Int = 11
Save DataFrame to HDFS
scala> df1.rdd.saveAsTextFile("/process_output")
Check HDFS output location
hadoop fs -ls /process_output
Found 12 items
-rw-r--r-- 3 root hdfs 0 2018-05-01 08:51 /process_output/_SUCCESS
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00000
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00001
-rw-r--r-- 3 root hdfs 182 2018-05-01 08:51 /process_output/part-00002
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00003
-rw-r--r-- 3 root hdfs 180 2018-05-01 08:51 /process_output/part-00004
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00005
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00006
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00007
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00008
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00009
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00010

Write dataframe to CSV in spark

I am writing a spark dataframe to CSV file using below code
println("Total number of reports: " + reportDf.count())
And o/p is:
Total number of reports: 48720
spark#monikatest:~/output/cluster.csv$ ll
total 12
drwxrwxr-x 2 spark spark 4096 Mar 27 20:56 ./
drwxrwxr-x 3 spark spark 4096 Mar 27 20:56 ../
-rw-r--r-- 1 spark spark 0 Mar 27 20:56 _SUCCESS
-rw-r--r-- 1 spark spark 8 Mar 27 20:56 ._SUCCESS.crc
No data written to file, only success file present.
Can anyone please suggest how to overcome this error.

Database backups not writing to disc, not enough space?

I just inherited an AIX project which I know very little about. I have a cronjob that has been failing for a few days now that does a full backup of my database(db2). Looking at the logs, I'm seeing this:
SQL2419N The target disk "/home/dbtmp/backups" has become full.
When checking out this directory:
(/var/spool/cron)> df -g /home/dbtmp
Filesystem GB blocks Free %Used Iused %Iused Mounted on
/dev/dbtmplv 10.00 0.96 91% 85 1% /home/dbtmp
The size of the previous backups:
(/var/spool/cron)> ll /home/dbtmp/backups
total 18365248
-rw------- 1 hsprd cics 4411498496 Feb 12 18:01 HSPRD.0.hsprd.NODE0000.CATN0000.20130212180036.001
-rw------- 1 hstrn cics 874287104 Feb 12 18:08 HSTRN.0.hstrn.NODE0000.CATN0000.20130212180747.001
-rw------- 1 hstst cics 3242835968 Feb 12 18:05 HSTST.0.hstst.NODE0000.CATN0000.20130212180443.001
What options to I have to fix this? Thank you.
As you can see, the size of your backup files exceeds the free space on the device. You need a larger device.
