Spark - Get part-file suffix - apache-spark

When Spark uses Hadoop writer to write part-file (using saveAsTextFile()), this is the general format "part-NNNNN" it saves the file in. How can I retrieve this suffix "NNNNN" in Spark at runtime?
Ps. I do not want to list the files and then retrieve the suffix.

The files are named part-00000, part-00001, and so on. Each of the RDD partitions is written to one part- file. So, the number of output files will depend upon the partitions in the RDD being written out.
You could check the RDD being written for the number of partitions (say 5), and then access the files part-00000 to part-00004.
Illustration
Build a DataFrame by querying a Hive table
scala> val df1=sqlContext.sql("select * from default.hive_table");
Get number of RDD partitions
scala> df1.rdd.partitions.size
res4: Int = 11
Save DataFrame to HDFS
scala> df1.rdd.saveAsTextFile("/process_output")
Check HDFS output location
hadoop fs -ls /process_output
Found 12 items
-rw-r--r-- 3 root hdfs 0 2018-05-01 08:51 /process_output/_SUCCESS
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00000
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00001
-rw-r--r-- 3 root hdfs 182 2018-05-01 08:51 /process_output/part-00002
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00003
-rw-r--r-- 3 root hdfs 180 2018-05-01 08:51 /process_output/part-00004
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00005
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00006
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00007
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00008
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00009
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00010

Related

kafka logs grows too high

I can see kafka logs are growing rapidly and flooding the filesystem.
How can i change settings for kafka to write less logs and rotate this logs frequently.
location of files is - /opt/kafka/kafka_2.12-2.2.2/logs and their size -
5.9G server.log.2020-11-24-14
5.9G server.log.2020-11-24-15
5.9G server.log.2020-11-24-16
5.7G server.log.2020-11-24-17
sample logs from above file.
[2020-11-24 14:59:59,999] WARN Exception when following the leader (org.apache.zookeeper.server.quorum.Learner)
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.zookeeper.common.AtomicFileOutputStream.write(AtomicFileOutputStream.java:74)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:254)
at org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:1391)
at org.apache.zookeeper.server.quorum.QuorumPeer.setCurrentEpoch(QuorumPeer.java:1426)
at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:454)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:83)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:981)
[2020-11-24 14:59:59,999] INFO shutdown called (org.apache.zookeeper.server.quorum.Learner)
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:169)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:985)
[2020-11-24 14:59:59,999] INFO Shutting down (org.apache.zookeeper.server.quorum.FollowerZooKeeperServer)
[2020-11-24 14:59:59,999] INFO LOOKING (org.apache.zookeeper.server.quorum.QuorumPeer)
[2020-11-24 14:59:59,999] INFO New election. My id = 1, proposed zxid=0x1000001d2 (org.apache.zookeeper.server.quorum.FastLeaderElection)
[2020-11-24 14:59:59,999] INFO Notification: 1 (message format version), 1 (n.leader), 0x1000001d2 (n.zxid), 0x2 (n.round), LOOKING (n.state), 1 (n.sid), 0x1 (n.peerEpoch) LOOKING (my state) (org.apache.zookeeper.server.quorum.FastLeaderElection)
it also writes to /opt/kafka/kafka_2.12-2.2.2/kafka.log.
[2020-12-05 16:51:10,109] INFO [GroupMetadataManager brokerId=1] Finished loading offsets and group metadata from __consumer_offsets-30 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-12-05 16:51:10,109] INFO [GroupMetadataManager brokerId=1] Finished loading offsets and group metadata from __consumer_offsets-36 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-12-05 16:51:10,109] INFO [GroupMetadataManager brokerId=1] Finished loading offsets and group metadata from __consumer_offsets-42 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-12-05 16:51:10,110] INFO [GroupMetadataManager brokerId=1] Finished loading offsets and group metadata from __consumer_offsets-48 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-12-05 17:01:09,528] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-12-05 17:11:09,528] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
kafka is used for elastic stack.
below is the entry from server.properties file.
# A comma seperated list of directories under which to store log files
log.dirs=/var/log/kafka
it has log files as
/var/log/kafka
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 heartbeat-1
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-12
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 auditbeat-0
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 apm-2
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-28
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 filebeat-2
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-38
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-44
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-6
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-16
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 metricbeat-0
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-22
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-32
-rw-r--r-- 1 kafka users 747 Dec 5 18:02 recovery-point-offset-checkpoint
-rw-r--r-- 1 kafka users 4 Dec 5 18:02 log-start-offset-checkpoint
-rw-r--r-- 1 kafka users 749 Dec 5 18:03 replication-offset-checkpoint
no DEBUG level logs is enabled in files in /opt/kafka/kafka_2.12-2.2.2/config path.
How can I make sure it doesn't make such a hugh files in /opt/kafka/kafka_2.12-2.2.2/logs and also how can I rotate them regularly with compression.
Thanks,
log.dirs is the actual broker storage, not the process logs, therefore should not be in /var/log with other process logs
Almost 6G a day is not unreasonable, but you can modify the log4j.properties file to only keep around 1 or 2 days from the rolling file appender
Generally, as any Linux administration task, you'd have separate disk volumes for /var/log, your OS storage, and any dedicated disks for server data - say a mount at /kafka

Does Spark support multiple users?

I have a 3-node spark 2.3.1 cluster running at the moment, and I'm also running a zeppelin server using a normal user, like ulab.
From zeppelin, I ran the commands:
%spark
val file = sc.textFile("file:///mnt/glusterfs/test/testfile")
file.saveAsTextFile("/mnt/glusterfs/test/testfile2")
It report a lot of error messages, something like:
WARN [2018-09-14 05:44:50,540] ({pool-2-thread-8} NotebookServer.java[afterStatusChange]:2302) - Job 20180907-130718_39068508 is finished, status: ERROR, exception: null, result: %text file: org.apache.spark.rdd.RDD[String] = file:///mnt/glusterfs/test/testfile MapPartitionsRDD[49] at textFile at <console>:51
org.apache.spark.SparkException: Job aborted.
...
... 64 elided
Caused by: java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/mnt/glusterfs/test/testfile2/_temporary/0/task_20180914054253_0050_m_000018/part-00018; isDirectory=false; length=33554979; replication=1; blocksize=33554432; modification_time=1536903780000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to file:/mnt/glusterfs/test/testfile2/part-00018
And I found that some temporary files owned by user root, while some owned by ulab, like the following:
bash-4.4# ls -l testfile2
total 32773
drwxr-xr-x 3 ulab ulab 4096 Sep 14 05:42 _temporary
-rw-r--r-- 1 ulab ulab 33554979 Sep 14 05:44 part-00018
bash-4.4# ls -l testfile2/_temporary/
total 4
drwxr-xr-x 210 ulab ulab 4096 Sep 14 05:44 0
bash-4.4# ls -l testfile2/_temporary/0
total 832
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000000
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000001
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000002
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000003
....
Is there any setup to let all these temporary files created by ulab? so we can use multiple users in spark driver to isolate the priviledges.
You can enable 'User Impersonate' option for spark interpreter which will start the spark job as logged-in user.
Refer this link for more info

Write dataframe to CSV in spark

I am writing a spark dataframe to CSV file using below code
println("Total number of reports: " + reportDf.count())
reportDf
.coalesce(1)
.write.format("com.databricks.spark.csv")
.csv("output/cluster.csv")
And o/p is:
Total number of reports: 48720
spark#monikatest:~/output/cluster.csv$ ll
total 12
drwxrwxr-x 2 spark spark 4096 Mar 27 20:56 ./
drwxrwxr-x 3 spark spark 4096 Mar 27 20:56 ../
-rw-r--r-- 1 spark spark 0 Mar 27 20:56 _SUCCESS
-rw-r--r-- 1 spark spark 8 Mar 27 20:56 ._SUCCESS.crc
No data written to file, only success file present.
Can anyone please suggest how to overcome this error.

How to write records of a same key to multiple files (custom partitioner)

I want to write data from a directory to partitions dynamically using Spark.
Here is the sample code.
val input_DF = spark.read.parquet("input path")
input_DF.write.mode("overwrite").partitionBy("colname").parquet("output path...")
As shown below, No of records per each key are different and there is skew for a key.
input_DF.groupBy($"colname").agg(count("colname")).show()
+-----------------+------------------------+
|colname |count(colname) |
+-----------------+------------------------+
| NA| 14859816| --> More no of records
| A| 2907930|
| D| 1118504|
| B| 485151|
| C| 435305|
| F| 370095|
| G| 170060|
+-----------------+------------------------+
Because of this, job is failing when reasonable memory (8GB) is given for each executor. Job is completing successfully when high memory (15GB) per each executor is given, but taking too long to complete.
I have tried using repartition expecting it will distribute data evenly across partitions. But, as it uses default HashPartitioner, records of a key goes to a single partition.
repartition(num partition,$"colname") --> Creates HashPartition
But this is creating num part files as mentioned in repartiton, but moving all records of a key to a partition (all records with col value NA goes to a partition). Remaining part files have no records (only Parquet metadata,38364 bytes).
-rw-r--r-- 2 hadoop hadoop 0 2017-11-20 14:29 /user/hadoop/table/_SUCCESS
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00000-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00001-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00002-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00003-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:07 /user/hadoop/table/part-r-00004-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00005-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00006-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 1038264502 2017-11-20 13:20 /user/hadoop/table/part-r-00007-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00008-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00009-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00010-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00011-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00012-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00013-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00014-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 128212247 2017-11-20 13:09 /user/hadoop/table/part-r-00015-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00016-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00017-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00018-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 117142244 2017-11-20 13:08 /user/hadoop/table/part-r-00019-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00020-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 347033731 2017-11-20 13:11 /user/hadoop/table/part-r-00021-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00022-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00023-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00024-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 100306686 2017-11-20 13:08 /user/hadoop/table/part-r-00025-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 36961707 2017-11-20 13:07 /user/hadoop/table/part-r-00026-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00027-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00028-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00029-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00030-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00031-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00032-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00033-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00034-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00035-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00036-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:07 /user/hadoop/table/part-r-00037-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00038-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00039-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00040-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00041-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 68859 2017-11-20 13:06 /user/hadoop/table/part-r-00042-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 4031720288 2017-11-20 14:29 /user/hadoop/table/part-r-00043-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00044-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00045-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00046-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00047-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00048-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00049-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
-rw-r--r-- 2 hadoop hadoop 38634 2017-11-20 13:06 /user/hadoop/table/part-r-00050-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
I would like to know
Is there a way to write same key records to different partitions of DataFrame/RDD? Probably custom partitioner to write every Nth record to Nth partition
(1st rec to partition 1)
(2nd rec to partition 2)
(3rd rec to partition 3)
(4th rec to partition 4)
(5th rec to partition 1)
(6th rec to partition 2)
(7th rec to partition 3)
(8th rec to partition 4)
If yes, can it be controlled using parameters like max no of bytes per partition of DataFrame/RDD.
As the expected result is just writing data into different sub-directories (partitions for Hive) based on a key, I would like to write data by distributing records of a key to multiple tasks, each writing a part file under sub-directory.
The problem was resolved when reparation was done on unique key, instead of the key which was used in "partitionBy". If the dataFrame is missing a unique for some reason, one can add a sudo column using
df.withColumn("Unique_ID", monotonicallyIncreasingId)
and then reparation on "Unique_ID", this way we can distribute data evenly to multiple partitions. To increase performance further, data can be sorted with in DataFrame partition on a key used for join/group/partition

How data in an HDFS block is stored?

I was reading about HDFS and was wondering, if there is any specific format in which data in a block is arranged.
Suppose there is a file of 265 MB that is copied to a Hadoop cluster and the HDFS block size is 64 MB. So the file is broken into 5 parts- 64 MB + 64 MB + 64 MB + 64 MB + 9 MB, and distributed among data nodes. Correct ?
I have a doubt that is there any format within the 64 MB block in which data is stored ?
If there is any format/structure in which the data is stored within the block, then the stored data should be less than 64 MB, since the data structure/header etc, itself may take some space.
Since HDFS data node is a logical filesystem (It runs on top of linux and there is no separate partition for HDFS), all the blocks should be stored as files in the linux partition. Correct ?
How to know the name of the file on linux that actually stores the 64 MB HDFS block ?
Anyone, if can answer these doubts/questions, that would be great. Thanks in advance.
Regards,
(*Vipul)() ;
No, the data is just split on 64MB boundary. Metadata is stored in a small separate file and on the Namenode
No, it is exactly the size you specified, and the data is split on exact boundaries of 64MB. If you have 5 parts - 64 MB + 64 MB + 64 MB + 64 MB + 9 MB, then the last file would be 9MB, all the others are 64MB
Yes, the blocks are stored as a files, each block is represented as a separate file with some small amount of metadata stored in a separate file
hdfs fsck / -files -blocks -locations
Here's an example of how the block files are stored with 128MB block size:
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:17 blk_1073741825
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:17 blk_1073741825_1001.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741826
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741826_1002.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741827
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741827_1003.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741828
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741828_1004.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:19 blk_1073741829
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:19 blk_1073741829_1005.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:19 blk_1073741830
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:19 blk_1073741830_1006.meta
-rw-r--r--. 1 hdfs hadoop 87776064 Jan 12 09:19 blk_1073741831
-rw-r--r--. 1 hdfs hadoop 685759 Jan 12 09:19 blk_1073741831_1007.meta

Resources