How to write records of a same key to multiple files (custom partitioner) - apache-spark

I want to write data from a directory to partitions dynamically using Spark.
Here is the sample code.
val input_DF = spark.read.parquet("input path")
input_DF.write.mode("overwrite").partitionBy("colname").parquet("output path...")
As shown below, No of records per each key are different and there is skew for a key.
input_DF.groupBy($"colname").agg(count("colname")).show()
+-----------------+------------------------+
|colname |count(colname) |
+-----------------+------------------------+
| NA| 14859816| --> More no of records
| A| 2907930|
| D| 1118504|
| B| 485151|
| C| 435305|
| F| 370095|
| G| 170060|
+-----------------+------------------------+
Because of this, job is failing when reasonable memory (8GB) is given for each executor. Job is completing successfully when high memory (15GB) per each executor is given, but taking too long to complete.
I have tried using repartition expecting it will distribute data evenly across partitions. But, as it uses default HashPartitioner, records of a key goes to a single partition.
repartition(num partition,$"colname") --> Creates HashPartition
But this is creating num part files as mentioned in repartiton, but moving all records of a key to a partition (all records with col value NA goes to a partition). Remaining part files have no records (only Parquet metadata,38364 bytes).
-rw-r--r-- 2 hadoop hadoop 0 2017-11-20 14:29 /user/hadoop/table/_SUCCESS
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00000-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00001-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00002-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00003-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:07 /user/hadoop/table/part-r-00004-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00005-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00006-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 1038264502 2017-11-20 13:20 /user/hadoop/table/part-r-00007-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00008-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00009-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00010-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00011-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00012-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00013-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00014-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 128212247 2017-11-20 13:09 /user/hadoop/table/part-r-00015-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00016-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00017-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00018-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 117142244 2017-11-20 13:08 /user/hadoop/table/part-r-00019-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00020-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 347033731 2017-11-20 13:11 /user/hadoop/table/part-r-00021-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00022-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00023-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00024-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 100306686 2017-11-20 13:08 /user/hadoop/table/part-r-00025-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 36961707 2017-11-20 13:07 /user/hadoop/table/part-r-00026-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00027-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00028-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00029-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00030-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00031-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00032-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00033-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00034-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00035-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00036-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:07 /user/hadoop/table/part-r-00037-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00038-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00039-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00040-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00041-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 68859 2017-11-20 13:06 /user/hadoop/table/part-r-00042-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 4031720288 2017-11-20 14:29 /user/hadoop/table/part-r-00043-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00044-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00045-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00046-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00047-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00048-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00049-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00050-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
I would like to know
Is there a way to write same key records to different partitions of DataFrame/RDD? Probably custom partitioner to write every Nth record to Nth partition
(1st rec to partition 1)
(2nd rec to partition 2)
(3rd rec to partition 3)
(4th rec to partition 4)
(5th rec to partition 1)
(6th rec to partition 2)
(7th rec to partition 3)
(8th rec to partition 4)
If yes, can it be controlled using parameters like max no of bytes per partition of DataFrame/RDD.
As the expected result is just writing data into different sub-directories (partitions for Hive) based on a key, I would like to write data by distributing records of a key to multiple tasks, each writing a part file under sub-directory.

The problem was resolved when reparation was done on unique key, instead of the key which was used in "partitionBy". If the dataFrame is missing a unique for some reason, one can add a sudo column using
df.withColumn("Unique_ID", monotonicallyIncreasingId)
and then reparation on "Unique_ID", this way we can distribute data evenly to multiple partitions. To increase performance further, data can be sorted with in DataFrame partition on a key used for join/group/partition

Related

Duplicated Cassandra SSTable files

After unsuccessful nodetool repair operation I got two big sstable files (last two in the listing below) instead of one, each having the same size as a single file before. And now this files cannot be merged back by common tools (nodetool clean, nodetool compact, nodetool repair). Tables are replicated to another cassandra node (replication_factor: 2), and there are two big sstable files as well now.
-rw-r--r-- 1 cassandra cassandra 16M Mar 5 12:36 mc-116413-big-Data.db
-rw-r--r-- 1 cassandra cassandra 34M Mar 5 01:21 mc-116320-big-Index.db
-rw-r--r-- 1 cassandra cassandra 39M Mar 3 22:46 mc-116125-big-Index.db
-rw-r--r-- 1 cassandra cassandra 66M Mar 5 12:25 mc-116412-big-Data.db
-rw-r--r-- 1 cassandra cassandra 262M Mar 5 05:51 mc-116365-big-Data.db
-rw-r--r-- 1 cassandra cassandra 263M Mar 5 08:46 mc-116386-big-Data.db
-rw-r--r-- 1 cassandra cassandra 263M Mar 5 11:42 mc-116407-big-Data.db
-rw-r--r-- 1 cassandra cassandra 7.2G Mar 5 03:18 mc-116345-big-Data.db
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db```
I suppose that one of this files contains duplicated data. How can I compact files back to a single file?
Maybe I'm not looking properly but I don't see any duplicate SSTable files in the file listing you posted.
If you're referring to these 2:
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db
They're not duplicates because they have 2 different generation IDs -- 116125 and 116320. This means they also have different ancestors.
If you're referring to these:
-rw-r--r-- 1 cassandra cassandra 39M Mar 3 22:46 mc-116125-big-Index.db
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 34M Mar 5 01:21 mc-116320-big-Index.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db
Again, they're not duplicates of each other. The *-Data.db files contain the actual data. The *-Index.db files are component files which contain the partition index, i.e. the index of the partitions within the data files which are used for fast retrieval.
If you're interested, I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/5219/. Cheers!
[UPDATE] To respond to this follow-up question:
Could you suppose why this two files don`t compacted in a single file,
as usual do?
Assuming the table is configured with SizeTieredCompactionStrategy, it will require similar-sized sstables as candidates before they get compacted together.
The default minimum sstable candidates is min_threshold: 4 so you need 4 similarly-sized sstables for a compaction to be triggered.

Cassandra CommitLog Directory Forgetting To Remove Files

Version: DSE 6.7.5, CQL spec 3.4.5.
I have 8GB commitlog_total_space_in_mb.
Folder is currently at 13GB.
Looking at the date stamps in the folder it appears that it forgets about commitlogs or it may be failing to delete the commitlogs when it flushes memtables.
Happens on multiple nodes.
-rw-r--r--. 1 cassandra cassandra 33554338 Sep 20 02:00 CommitLog-600-1568892978830.log
-rw-r--r--. 1 cassandra cassandra 33554227 Sep 20 02:02 CommitLog-600-1568892978853.log
-rw-r--r--. 1 cassandra cassandra 33554217 Sep 20 02:02 CommitLog-600-1568892978862.log
-rw-r--r--. 1 cassandra cassandra 33554337 Sep 20 02:03 CommitLog-600-1568892978863.log
-rw-r--r--. 1 cassandra cassandra 33554169 Sep 20 02:04 CommitLog-600-1568892978864.log
-rw-r--r--. 1 cassandra cassandra 33554412 Sep 20 08:19 CommitLog-600-1568892954896.log
-rw-r--r--. 1 cassandra cassandra 33554326 Sep 20 08:19 CommitLog-600-1568892954901.log
-rw-r--r--. 1 cassandra cassandra 33554133 Sep 20 08:20 CommitLog-600-1568892954904.log
-rw-r--r--. 1 cassandra cassandra 33554281 Sep 20 08:20 CommitLog-600-1568892954905.log
-rw-r--r--. 1 cassandra cassandra 33553885 Sep 20 08:20 CommitLog-600-1568892954906.log
When i perform a nodetool flush/drain it will not remove any of the old files.
-rw-r--r--. 1 cassandra cassandra 33554338 Sep 20 02:00 CommitLog-600-1568892978830.log
-rw-r--r--. 1 cassandra cassandra 33554227 Sep 20 02:02 CommitLog-600-1568892978853.log
-rw-r--r--. 1 cassandra cassandra 33554217 Sep 20 02:02 CommitLog-600-1568892978862.log
-rw-r--r--. 1 cassandra cassandra 33554337 Sep 20 02:03 CommitLog-600-1568892978863.log
-rw-r--r--. 1 cassandra cassandra 33554169 Sep 20 02:04 CommitLog-600-1568892978864.log
-rw-r--r--. 1 cassandra cassandra 28 Sep 20 08:46 CommitLog-600-1568892981041.log
When I start the node back up it goes through them and crashes around the final commitlog. https://pastebin.com/Kw9Kee5C
CassandraDaemon.java:129 - Exception in thread Thread[PerDiskMemtableFlushWriter_0:11,5,main] java.lang.AssertionError: null
It wont start back up unless I move some of the last commitlogs out or all of them out.
What can I do to fix this problem
I have resolved this for my issue for the time being by changing compaction to
compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
For some reason having cell type of map with the following compaction was causing me the errors.
{'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '30', 'compaction_window_unit': 'DAYS', 'max_threshold': '32', 'min_threshold': '4', 'split_during_flush': 'true'}

Spark - Get part-file suffix

When Spark uses Hadoop writer to write part-file (using saveAsTextFile()), this is the general format "part-NNNNN" it saves the file in. How can I retrieve this suffix "NNNNN" in Spark at runtime?
Ps. I do not want to list the files and then retrieve the suffix.
The files are named part-00000, part-00001, and so on. Each of the RDD partitions is written to one part- file. So, the number of output files will depend upon the partitions in the RDD being written out.
You could check the RDD being written for the number of partitions (say 5), and then access the files part-00000 to part-00004.
Illustration
Build a DataFrame by querying a Hive table
scala> val df1=sqlContext.sql("select * from default.hive_table");
Get number of RDD partitions
scala> df1.rdd.partitions.size
res4: Int = 11
Save DataFrame to HDFS
scala> df1.rdd.saveAsTextFile("/process_output")
Check HDFS output location
hadoop fs -ls /process_output
Found 12 items
-rw-r--r-- 3 root hdfs 0 2018-05-01 08:51 /process_output/_SUCCESS
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00000
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00001
-rw-r--r-- 3 root hdfs 182 2018-05-01 08:51 /process_output/part-00002
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00003
-rw-r--r-- 3 root hdfs 180 2018-05-01 08:51 /process_output/part-00004
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00005
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00006
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00007
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00008
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00009
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00010

Write dataframe to CSV in spark

I am writing a spark dataframe to CSV file using below code
println("Total number of reports: " + reportDf.count())
reportDf
.coalesce(1)
.write.format("com.databricks.spark.csv")
.csv("output/cluster.csv")
And o/p is:
Total number of reports: 48720
spark#monikatest:~/output/cluster.csv$ ll
total 12
drwxrwxr-x 2 spark spark 4096 Mar 27 20:56 ./
drwxrwxr-x 3 spark spark 4096 Mar 27 20:56 ../
-rw-r--r-- 1 spark spark 0 Mar 27 20:56 _SUCCESS
-rw-r--r-- 1 spark spark 8 Mar 27 20:56 ._SUCCESS.crc
No data written to file, only success file present.
Can anyone please suggest how to overcome this error.

How data in an HDFS block is stored?

I was reading about HDFS and was wondering, if there is any specific format in which data in a block is arranged.
Suppose there is a file of 265 MB that is copied to a Hadoop cluster and the HDFS block size is 64 MB. So the file is broken into 5 parts- 64 MB + 64 MB + 64 MB + 64 MB + 9 MB, and distributed among data nodes. Correct ?
I have a doubt that is there any format within the 64 MB block in which data is stored ?
If there is any format/structure in which the data is stored within the block, then the stored data should be less than 64 MB, since the data structure/header etc, itself may take some space.
Since HDFS data node is a logical filesystem (It runs on top of linux and there is no separate partition for HDFS), all the blocks should be stored as files in the linux partition. Correct ?
How to know the name of the file on linux that actually stores the 64 MB HDFS block ?
Anyone, if can answer these doubts/questions, that would be great. Thanks in advance.
Regards,
(*Vipul)() ;
No, the data is just split on 64MB boundary. Metadata is stored in a small separate file and on the Namenode
No, it is exactly the size you specified, and the data is split on exact boundaries of 64MB. If you have 5 parts - 64 MB + 64 MB + 64 MB + 64 MB + 9 MB, then the last file would be 9MB, all the others are 64MB
Yes, the blocks are stored as a files, each block is represented as a separate file with some small amount of metadata stored in a separate file
hdfs fsck / -files -blocks -locations
Here's an example of how the block files are stored with 128MB block size:
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:17 blk_1073741825
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:17 blk_1073741825_1001.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741826
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741826_1002.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741827
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741827_1003.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741828
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741828_1004.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:19 blk_1073741829
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:19 blk_1073741829_1005.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:19 blk_1073741830
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:19 blk_1073741830_1006.meta
-rw-r--r--. 1 hdfs hadoop 87776064 Jan 12 09:19 blk_1073741831
-rw-r--r--. 1 hdfs hadoop 685759 Jan 12 09:19 blk_1073741831_1007.meta

Resources