Write dataframe to CSV in spark - apache-spark

I am writing a spark dataframe to CSV file using below code
println("Total number of reports: " + reportDf.count())
reportDf
.coalesce(1)
.write.format("com.databricks.spark.csv")
.csv("output/cluster.csv")
And o/p is:
Total number of reports: 48720
spark#monikatest:~/output/cluster.csv$ ll
total 12
drwxrwxr-x 2 spark spark 4096 Mar 27 20:56 ./
drwxrwxr-x 3 spark spark 4096 Mar 27 20:56 ../
-rw-r--r-- 1 spark spark 0 Mar 27 20:56 _SUCCESS
-rw-r--r-- 1 spark spark 8 Mar 27 20:56 ._SUCCESS.crc
No data written to file, only success file present.
Can anyone please suggest how to overcome this error.

Related

Split fastq read into 10G mini files, assembler not accepting as fastq format

I split a 52G fastq file into 10G chunks with the following code:
split -b 10G /home/bilalm/H_glaber_quality_filtering/AfterQC/good_reads/SRR530529.good.fq outputfile
This produced the following files:
-rw-rw-r-- 1 bilalm bilalm 10G Aug 11 13:48 outputfileaa
-rw-rw-r-- 1 bilalm bilalm 10G Aug 11 13:49 outputfileab
-rw-rw-r-- 1 bilalm bilalm 10G Aug 11 13:50 outputfileac
-rw-rw-r-- 1 bilalm bilalm 10G Aug 11 13:51 outputfilead
-rw-rw-r-- 1 bilalm bilalm 10G Aug 11 13:52 outputfileae
-rw-rw-r-- 1 bilalm bilalm 1.6G Aug 11 13:53 outputfileaf
When I was attempting to assemble "outputfileab", with Velvet, I get the following error message:
velveth: /home/bilalm/H_glaber_quality_filtering/AfterQC/good_reads/split_SRR530529_file/outputfileab does not seem to be in FastQ format
Strangely, both velveth and velvetg was used normally to assemble the first 10G read i.e. "outputfileaa".
Anybody know what's going on?
split by file size rather than line counts does just that, and will split in the middle of a line if the byte limit is reached. velvet has a check to assert if every fourth line starts with #, so this check will fail considering the split method, which is why we are seeing this happen on the second file and not the first. I would suggest you split this file by line count passing the -l xxxx flag.

Cassandra CommitLog Directory Forgetting To Remove Files

Version: DSE 6.7.5, CQL spec 3.4.5.
I have 8GB commitlog_total_space_in_mb.
Folder is currently at 13GB.
Looking at the date stamps in the folder it appears that it forgets about commitlogs or it may be failing to delete the commitlogs when it flushes memtables.
Happens on multiple nodes.
-rw-r--r--. 1 cassandra cassandra 33554338 Sep 20 02:00 CommitLog-600-1568892978830.log
-rw-r--r--. 1 cassandra cassandra 33554227 Sep 20 02:02 CommitLog-600-1568892978853.log
-rw-r--r--. 1 cassandra cassandra 33554217 Sep 20 02:02 CommitLog-600-1568892978862.log
-rw-r--r--. 1 cassandra cassandra 33554337 Sep 20 02:03 CommitLog-600-1568892978863.log
-rw-r--r--. 1 cassandra cassandra 33554169 Sep 20 02:04 CommitLog-600-1568892978864.log
-rw-r--r--. 1 cassandra cassandra 33554412 Sep 20 08:19 CommitLog-600-1568892954896.log
-rw-r--r--. 1 cassandra cassandra 33554326 Sep 20 08:19 CommitLog-600-1568892954901.log
-rw-r--r--. 1 cassandra cassandra 33554133 Sep 20 08:20 CommitLog-600-1568892954904.log
-rw-r--r--. 1 cassandra cassandra 33554281 Sep 20 08:20 CommitLog-600-1568892954905.log
-rw-r--r--. 1 cassandra cassandra 33553885 Sep 20 08:20 CommitLog-600-1568892954906.log
When i perform a nodetool flush/drain it will not remove any of the old files.
-rw-r--r--. 1 cassandra cassandra 33554338 Sep 20 02:00 CommitLog-600-1568892978830.log
-rw-r--r--. 1 cassandra cassandra 33554227 Sep 20 02:02 CommitLog-600-1568892978853.log
-rw-r--r--. 1 cassandra cassandra 33554217 Sep 20 02:02 CommitLog-600-1568892978862.log
-rw-r--r--. 1 cassandra cassandra 33554337 Sep 20 02:03 CommitLog-600-1568892978863.log
-rw-r--r--. 1 cassandra cassandra 33554169 Sep 20 02:04 CommitLog-600-1568892978864.log
-rw-r--r--. 1 cassandra cassandra 28 Sep 20 08:46 CommitLog-600-1568892981041.log
When I start the node back up it goes through them and crashes around the final commitlog. https://pastebin.com/Kw9Kee5C
CassandraDaemon.java:129 - Exception in thread Thread[PerDiskMemtableFlushWriter_0:11,5,main] java.lang.AssertionError: null
It wont start back up unless I move some of the last commitlogs out or all of them out.
What can I do to fix this problem
I have resolved this for my issue for the time being by changing compaction to
compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
For some reason having cell type of map with the following compaction was causing me the errors.
{'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '30', 'compaction_window_unit': 'DAYS', 'max_threshold': '32', 'min_threshold': '4', 'split_during_flush': 'true'}

Does Spark support multiple users?

I have a 3-node spark 2.3.1 cluster running at the moment, and I'm also running a zeppelin server using a normal user, like ulab.
From zeppelin, I ran the commands:
%spark
val file = sc.textFile("file:///mnt/glusterfs/test/testfile")
file.saveAsTextFile("/mnt/glusterfs/test/testfile2")
It report a lot of error messages, something like:
WARN [2018-09-14 05:44:50,540] ({pool-2-thread-8} NotebookServer.java[afterStatusChange]:2302) - Job 20180907-130718_39068508 is finished, status: ERROR, exception: null, result: %text file: org.apache.spark.rdd.RDD[String] = file:///mnt/glusterfs/test/testfile MapPartitionsRDD[49] at textFile at <console>:51
org.apache.spark.SparkException: Job aborted.
...
... 64 elided
Caused by: java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/mnt/glusterfs/test/testfile2/_temporary/0/task_20180914054253_0050_m_000018/part-00018; isDirectory=false; length=33554979; replication=1; blocksize=33554432; modification_time=1536903780000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to file:/mnt/glusterfs/test/testfile2/part-00018
And I found that some temporary files owned by user root, while some owned by ulab, like the following:
bash-4.4# ls -l testfile2
total 32773
drwxr-xr-x 3 ulab ulab 4096 Sep 14 05:42 _temporary
-rw-r--r-- 1 ulab ulab 33554979 Sep 14 05:44 part-00018
bash-4.4# ls -l testfile2/_temporary/
total 4
drwxr-xr-x 210 ulab ulab 4096 Sep 14 05:44 0
bash-4.4# ls -l testfile2/_temporary/0
total 832
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000000
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000001
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000002
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000003
....
Is there any setup to let all these temporary files created by ulab? so we can use multiple users in spark driver to isolate the priviledges.
You can enable 'User Impersonate' option for spark interpreter which will start the spark job as logged-in user.
Refer this link for more info

Spark - Get part-file suffix

When Spark uses Hadoop writer to write part-file (using saveAsTextFile()), this is the general format "part-NNNNN" it saves the file in. How can I retrieve this suffix "NNNNN" in Spark at runtime?
Ps. I do not want to list the files and then retrieve the suffix.
The files are named part-00000, part-00001, and so on. Each of the RDD partitions is written to one part- file. So, the number of output files will depend upon the partitions in the RDD being written out.
You could check the RDD being written for the number of partitions (say 5), and then access the files part-00000 to part-00004.
Illustration
Build a DataFrame by querying a Hive table
scala> val df1=sqlContext.sql("select * from default.hive_table");
Get number of RDD partitions
scala> df1.rdd.partitions.size
res4: Int = 11
Save DataFrame to HDFS
scala> df1.rdd.saveAsTextFile("/process_output")
Check HDFS output location
hadoop fs -ls /process_output
Found 12 items
-rw-r--r-- 3 root hdfs 0 2018-05-01 08:51 /process_output/_SUCCESS
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00000
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00001
-rw-r--r-- 3 root hdfs 182 2018-05-01 08:51 /process_output/part-00002
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00003
-rw-r--r-- 3 root hdfs 180 2018-05-01 08:51 /process_output/part-00004
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00005
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00006
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00007
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00008
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00009
-rw-r--r-- 3 root hdfs 190 2018-05-01 08:51 /process_output/part-00010

How data in an HDFS block is stored?

I was reading about HDFS and was wondering, if there is any specific format in which data in a block is arranged.
Suppose there is a file of 265 MB that is copied to a Hadoop cluster and the HDFS block size is 64 MB. So the file is broken into 5 parts- 64 MB + 64 MB + 64 MB + 64 MB + 9 MB, and distributed among data nodes. Correct ?
I have a doubt that is there any format within the 64 MB block in which data is stored ?
If there is any format/structure in which the data is stored within the block, then the stored data should be less than 64 MB, since the data structure/header etc, itself may take some space.
Since HDFS data node is a logical filesystem (It runs on top of linux and there is no separate partition for HDFS), all the blocks should be stored as files in the linux partition. Correct ?
How to know the name of the file on linux that actually stores the 64 MB HDFS block ?
Anyone, if can answer these doubts/questions, that would be great. Thanks in advance.
Regards,
(*Vipul)() ;
No, the data is just split on 64MB boundary. Metadata is stored in a small separate file and on the Namenode
No, it is exactly the size you specified, and the data is split on exact boundaries of 64MB. If you have 5 parts - 64 MB + 64 MB + 64 MB + 64 MB + 9 MB, then the last file would be 9MB, all the others are 64MB
Yes, the blocks are stored as a files, each block is represented as a separate file with some small amount of metadata stored in a separate file
hdfs fsck / -files -blocks -locations
Here's an example of how the block files are stored with 128MB block size:
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:17 blk_1073741825
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:17 blk_1073741825_1001.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741826
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741826_1002.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741827
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741827_1003.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:18 blk_1073741828
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:18 blk_1073741828_1004.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:19 blk_1073741829
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:19 blk_1073741829_1005.meta
-rw-r--r--. 1 hdfs hadoop 134217728 Jan 12 09:19 blk_1073741830
-rw-r--r--. 1 hdfs hadoop 1048583 Jan 12 09:19 blk_1073741830_1006.meta
-rw-r--r--. 1 hdfs hadoop 87776064 Jan 12 09:19 blk_1073741831
-rw-r--r--. 1 hdfs hadoop 685759 Jan 12 09:19 blk_1073741831_1007.meta

Resources