how to read a dir containing sub-dirs using spark's textFile() - apache-spark

I'm using spark's textFile to read files from hdfs.
the dirs in hdfs looks like:
/user/root/kjyw.txt
/user/root/vjwy.txt
/user/root/byeq.txt
/user/root/dira/xxx.txt
when I use sc.textFile("/user/root/")
the job will fail because the dir contains sub-dirs
how to let spark only read files in the dir?
please do not let me use sc.textFile("/user/root/*.txt") because the files' name is not all end with .txt

val rdd = sc.wholeTextFiles("/user/root/*/*")
Put /* as many directory level as you have. Above will work for the directory structure you have shown.
It will give Pair RDD.

Related

Spark - Read hidden files from hdfs

I am working with pyspark shell to analyze data in hdfs. There are hidden files in hdfs path and I want to read them through the shell. However the dot files are ignored by the spark. How can I read them?
# This is not loading hidden files into data-frame
dir="/abc/xyz"
df=spark.read.text(dir)
# This is not loading hidden files into data-frame
dir="/abc/xyz/*"
df=spark.read.text(dir)
# This is not loading hidden files into data-frame
dir="/abc/xyz/.*"
df=spark.read.text(dir)
Any suggestions would be appreciated.
Spark uses Hadoop APIs to read in data from HDFS. Hadoop input formats have path filter to filter out files starting from "_" and "."
Try setting this property, FileInputFormat.setInputPathFilter in your configuration and then use newAPIHadoopFile to create the RDD
Try to change your path.
# This is not loading hidden files into data-frame
# dir="/abc/xyz/.*"
dir = "hdfs://yourhost:yourport/abc/xyz/"
df=spark.read.text(dir)

How spark load only part files within a directory?

I have a directory which has more than 10,000 files with the same schema.
Because loading and scanning all files is very time-consuming, I hope to load only part of these files arbitrarily.
For example, the file list is 1.csv, 2.csv,......,1000.csv.
I wonder if there is a way to only load 1.csv, 10.csv, 97.csv,...(the files are picked randomly) so that I can avoid scanning all files.
Thanks!
You can pass a list of filenames to the csv reader.
e.g.
# you'll need full paths here unless the files are in your working directory
filelist = ['1.csv', '10.csv', '97.csv']
df = spark.read.csv(filelist)
in Scala it would be
val filelist = Seq("1.csv", "10.csv", "97.csv")
val df = spark.read.csv(filelist: _*)

HDFS and Spark: Best way to write a file and reuse it from another program

I have some results from a Spark application saved in the HDFS as files called part-r-0000X (X= 0, 1, etc.). And, because I want to join the whole content in a file, I'm using the following command:
hdfs dfs -getmerge srcDir destLocalFile
The previous command is used in a bash script which makes empty the output directory (where the part-r-... files are saved) and, inside a loop, executes the above getmerge command.
The thing is I need to use the resultant file in another Spark program which need that merged file as input in the HDFS. So I'm saving it as local and then I upload it to the HDFS.
I've thought another option which is write the file from the Spark program in this way:
outputData.coalesce(1, false).saveAsTextFile(outPathHDFS)
But I've read coalesce() doesn't help with the performance.
Any other ideas? suggestions? Thanks!
You wish to merge all the files into a single one so that you can load all the files at once into a Spark rdd, is my guess.
Let the files be in Parts(0,1,....) in HDFS.
Why not load it with wholetextFiles, which actually does what you need.
wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn
Do rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”), then rdd contains:
(a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content)
Try SPARK BucketBy.
This is a nice feature via df.write.saveAsTable(), but this format can only be read by SPARK. Data shows up in Hive metastore but cannot be read by Hive, IMPALA.
The best solution that I've found so far was:
outputData.saveAsTextFile(outPath, classOf[org.apache.hadoop.io.compress.GzipCodec])
Which saves the outputData in compressed part-0000X.gz files under the outPath directory.
And, from the other Spark app, it reads those files using this:
val inputData = sc.textFile(inDir + "part-00*", numPartition)
Where inDir corresponds to the outPath.

Recursively Read Files Spark wholeTextFiles

I have a directory in an azure data lake that has the following path:
'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib'
Within this directory there are a number of other directories (50) that have the format 20190404.
The directory 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/20180404' contains 100 or so xml files which I am working with.
I can create an rdd for each of the sub-folders which works fine, but ideally I want to pass only the top path, and have spark recursively find the files. I have read other SO posts and tried using a wildcard thus:
pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*'
rdd = sc.wholeTextFiles(pathWild)
rdd.count()
But it just freezes and does nothing at all, seems to completely destroy the kernel. I am working in Jupyter on Spark 2.x. New to spark. Thanks!
Try this:
pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*/*'

Will spark wholetextfiles pick partially created file?

I am using Spark wholeTextFiles API to read the files from source folder and load it to hive table.
File are arriving at source folder from a remote server. File are of huge size like 1GB-3GB. SCP of the files is taking quite a while.
If i launch the spark job and file is being SCPd to the source folder and process is halfway, will spark pick the file?
If spark pick the file when it is halfway, it would be a problem since it would ignore rest of the content of the file.
Possible way to resolve:
At end of each file copy, SCP ZERO-kb file to indicate that SCP complete.
In spark job, when you do sc.wholeTextFiles(...), pick only those file names that has zero kb corresponding file - using map.
So, Here's code to check if correspondidng .ctl files are present in src folder.
val fr = sc.wholeTextFiles("D:\\DATA\\TEST\\tempstatus")
// Get only .ctl file
val temp1 = fr.map(x => x._1).filter(x => x.endsWith(".ctl"))
// Identify corresponding REAL-FILEs - without .ctl suffix
val temp2 = temp1.map(x => (x.replace(".ctl", ""),x.replace(".ctl", "")))
val result = fr
.join(xx)
.map{
case (_, (entry, x)) => (x, entry)
}
... Process rdd result as required.
The rdd temp2 is changed from RDD[String] to RDD[String, String] - for JOIN operation. Never mind.
If you are SCPing the files in to the source folder; and then spark is reading from that folder; it might happen that, half-written files are picked by spark, as SCP might take some time to copy.
That will happen for sure.
Your task would be - how not to write directly in that source folder - so that Spark doesn't pick incomplete files.
Possible way to resolve:
At end of each file copy, SCP ZERO-kb file to indicate that SCP complete.
In spark job, when you do sc.wholeTextFiles(...), pick only those file names that has zero kb corresponding file - using map.

Resources