Will spark wholetextfiles pick partially created file? - apache-spark

I am using Spark wholeTextFiles API to read the files from source folder and load it to hive table.
File are arriving at source folder from a remote server. File are of huge size like 1GB-3GB. SCP of the files is taking quite a while.
If i launch the spark job and file is being SCPd to the source folder and process is halfway, will spark pick the file?
If spark pick the file when it is halfway, it would be a problem since it would ignore rest of the content of the file.

Possible way to resolve:
At end of each file copy, SCP ZERO-kb file to indicate that SCP complete.
In spark job, when you do sc.wholeTextFiles(...), pick only those file names that has zero kb corresponding file - using map.
So, Here's code to check if correspondidng .ctl files are present in src folder.
val fr = sc.wholeTextFiles("D:\\DATA\\TEST\\tempstatus")
// Get only .ctl file
val temp1 = fr.map(x => x._1).filter(x => x.endsWith(".ctl"))
// Identify corresponding REAL-FILEs - without .ctl suffix
val temp2 = temp1.map(x => (x.replace(".ctl", ""),x.replace(".ctl", "")))
val result = fr
.join(xx)
.map{
case (_, (entry, x)) => (x, entry)
}
... Process rdd result as required.
The rdd temp2 is changed from RDD[String] to RDD[String, String] - for JOIN operation. Never mind.

If you are SCPing the files in to the source folder; and then spark is reading from that folder; it might happen that, half-written files are picked by spark, as SCP might take some time to copy.
That will happen for sure.
Your task would be - how not to write directly in that source folder - so that Spark doesn't pick incomplete files.
Possible way to resolve:
At end of each file copy, SCP ZERO-kb file to indicate that SCP complete.
In spark job, when you do sc.wholeTextFiles(...), pick only those file names that has zero kb corresponding file - using map.

Related

HDFS and Spark: Best way to write a file and reuse it from another program

I have some results from a Spark application saved in the HDFS as files called part-r-0000X (X= 0, 1, etc.). And, because I want to join the whole content in a file, I'm using the following command:
hdfs dfs -getmerge srcDir destLocalFile
The previous command is used in a bash script which makes empty the output directory (where the part-r-... files are saved) and, inside a loop, executes the above getmerge command.
The thing is I need to use the resultant file in another Spark program which need that merged file as input in the HDFS. So I'm saving it as local and then I upload it to the HDFS.
I've thought another option which is write the file from the Spark program in this way:
outputData.coalesce(1, false).saveAsTextFile(outPathHDFS)
But I've read coalesce() doesn't help with the performance.
Any other ideas? suggestions? Thanks!
You wish to merge all the files into a single one so that you can load all the files at once into a Spark rdd, is my guess.
Let the files be in Parts(0,1,....) in HDFS.
Why not load it with wholetextFiles, which actually does what you need.
wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn
Do rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”), then rdd contains:
(a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content)
Try SPARK BucketBy.
This is a nice feature via df.write.saveAsTable(), but this format can only be read by SPARK. Data shows up in Hive metastore but cannot be read by Hive, IMPALA.
The best solution that I've found so far was:
outputData.saveAsTextFile(outPath, classOf[org.apache.hadoop.io.compress.GzipCodec])
Which saves the outputData in compressed part-0000X.gz files under the outPath directory.
And, from the other Spark app, it reads those files using this:
val inputData = sc.textFile(inDir + "part-00*", numPartition)
Where inDir corresponds to the outPath.

Parquet file format on S3: which is the actual Parquet file?

Scala 2.12 and Spark 2.2.1 here. I used the following code to write the contents of a DataFrame to S3:
myDF.write.mode(SaveMode.Overwrite)
.parquet("s3n://com.example.mybucket/mydata.parquet")
When I go to com.example.mybucket on S3 I actually see a directory called "mydata.parquet", as well as file called "mydata.parquet_$folder$"!!! If I go into the mydata.parquet directory I see two files under it:
_SUCCESS; and
part-<big-UUID>.snappy.parquet
Whereas I was just expecting to see a single file called mydata.parquet living in the root of the bucket.
Is something wrong here (if so, what?!?) or is this expected with the Parquet file format? If its expected, which is the actual Parquet file that I should read from:
mydata.parquet directory?; or
mydata.parquet_$folder$ file?; or
mydata.parquet/part-<big-UUID>.snappy.parquet?
Thanks!
The mydata.parquet/part-<big-UUID>.snappy.parquet is the actual parquet data file. However, often tools like Spark break data sets into multiple part files, and expect to be pointed to a directory that contains multiple files. The _SUCCESS file is a simple flag indicating that the write operation has completed.
According to the api to save the parqueat file it saves inside the folder you provide. Sucess is incidation that the process is completed scuesffuly.
S3 create those $folder if you write directly commit to s3. What happens is it writes to temporory folders and copies to the final destination inside the s3. The reason is there no concept of rename.
Look at the s3-distcp and also DirectCommiter for performance issue.
The $folder$ marker is used by s3n/amazon's emrfs to indicate "empty directory". ignore.
The _SUCCESS file is, as the others note, a 0-byte file. ignore
all other .parquet files in the directory are the output; the number you end up with depends on the number of tasks executed on the input
When spark uses a directory (tree) as a source of data, all files beginning with _ or . are ignored; s3n will strip out those $folder$ things too. So if you use the path for a new query, it will only pick up that parquet file.

Recursively Read Files Spark wholeTextFiles

I have a directory in an azure data lake that has the following path:
'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib'
Within this directory there are a number of other directories (50) that have the format 20190404.
The directory 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/20180404' contains 100 or so xml files which I am working with.
I can create an rdd for each of the sub-folders which works fine, but ideally I want to pass only the top path, and have spark recursively find the files. I have read other SO posts and tried using a wildcard thus:
pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*'
rdd = sc.wholeTextFiles(pathWild)
rdd.count()
But it just freezes and does nothing at all, seems to completely destroy the kernel. I am working in Jupyter on Spark 2.x. New to spark. Thanks!
Try this:
pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*/*'

Naming the csv file in write_df

I am writing a file in sparkR using write_df, I am unable to specify the file name to this:
Code:
write.df(user_log0, path = "Output/output.csv",
source = "com.databricks.spark.csv",
mode = "overwrite",
header = "true")
Problem:
I expect inside the 'Output' folder a file called 'output.csv' but what happens is a folder called 'output.csv' and inside it called 'part-00000-6859b39b-544b-4a72-807b-1b8b55ac3f09.csv'
What am I doing wrong?
P.S: R 3.3.2, Spark 2.1.0 on OSX
Because of the distributed nature of spark, you can only define the directory into which the files would be saved and each executor writes its own file using spark's internal naming convention.
If you see only a single file, it means that you are working in a single partition, meaning only one executor is writing. This is not the normal spark behavior, however, if this fits your use case, you can collect the result to an R dataframe and write to csv from that.
In the more general case where the data is parallelized between multiple executors, you cannot set the specific name for the files.

how to read a dir containing sub-dirs using spark's textFile()

I'm using spark's textFile to read files from hdfs.
the dirs in hdfs looks like:
/user/root/kjyw.txt
/user/root/vjwy.txt
/user/root/byeq.txt
/user/root/dira/xxx.txt
when I use sc.textFile("/user/root/")
the job will fail because the dir contains sub-dirs
how to let spark only read files in the dir?
please do not let me use sc.textFile("/user/root/*.txt") because the files' name is not all end with .txt
val rdd = sc.wholeTextFiles("/user/root/*/*")
Put /* as many directory level as you have. Above will work for the directory structure you have shown.
It will give Pair RDD.

Resources