pyspark read multiple csv files at once - apache-spark

I'm using SPARK to read files in hdfs. There is a scenario, where we are getting files as chunks from legacy system in csv format.
ID1_FILENAMEA_1.csv
ID1_FILENAMEA_2.csv
ID1_FILENAMEA_3.csv
ID1_FILENAMEA_4.csv
ID2_FILENAMEA_1.csv
ID2_FILENAMEA_2.csv
ID2_FILENAMEA_3.csv
This files are loaded to FILENAMEA in HIVE using HiveWareHouse Connector, with few transformation like adding default values. Similarly we have around 70 tables. Hive tables are created in ORC format. Tables are partitioned on ID. Right now, I'm processing all these files one by one. It's taking much time.
I want to make this process much faster. Files will be in GBs.
Is there is any way to read all the FILENAMEA files at the same time and load it to HIVE tables.

You have two methods to read several CSV files in pyspark. If all CSV files are in the same directory and all have the same schema, you can read then at once by directly passing the path of directory as argument, as follow:
spark.read.csv('hdfs://path/to/directory')
If you have CSV files in different locations or CSV files in same directory but with other CSV/text files in it, you can pass them as string representing a list of path in .csv() method argument, as follow:
spark.read.csv('hdfs://path/to/filename1,hdfs://path/to/filename2')
You can have more information about how to read a CSV file with Spark here
If you need to build this list of paths from the list of files in HDFS directory, you can look at this answer, once you've created your list of paths, you can transform it to a string to pass to .csv() method with ','.join(your_file_list)

Using: spark.read.csv(["path1","path2","path3"...]) you can read multiple files from different paths. But that means you have first to make a list of the paths. A list not a string of comma-separated file paths

Related

Why can't I merge multiple parquet files using "cat file1.parquet file2. parquet > result.parquet"?

I have created multiple parquet files using pyspark and now I'm trying to merge all the parquet files to 1. I'm able to merge the files, but while reading in the resulting file, I'm getting an error. Have anyone faced this issue before?
You cannot simply concatenate Parquet files using cat as they are binary files with a "table of contents" in the footer. To merge two files, you would have to read them both in and write out a completely new file. This could be done easily using the merge command in the parquet-tools.
The technical background that merging two Parquet files using cat isn't working comes down to the fact that a Parquet file is useless without a footer. Every Parquet file is made up roughly by the following structure:
RowGroup(nrows=..)
Column with nrows
Column with nrows
..
RowGroup(nrows=..)
..
..
Footer
Schema (tells you the type of the columns)
total_nrows
Location of RowGroups in the file
If you cat two files together, you would only see the last footer of the two files. Additionally, if the library used to read the files does some integrity checks, it will realise that your file is broken in some fashion and completely error out.

How to read specific files from a directory based on a file name in spark?

I have a directory of CSV files. The files are named based on date similar to the image below:
I have many CSV files that go back to 2012.
So, I would like to read the CSV files that correspond to a certain date only. How is that could be possible in spark? In other words, I don't want my spark engine to bother and read all CSV files because my data is huge (TBs).
Any help is much appreciated!
You can specify a list of files to be processed when calling the load(paths) or csv(paths) methods from DataFrameReader.
So an option would be to list and filter files on the driver, then load only the "recent" files :
val files: Seq[String] = ???
spark.read.option("header","true").csv(files:_*)
Edit :
You can use this python code (not tested yet)
files=['foo','bar']
df=spark.read.csv(*files)

HDFS and Spark: Best way to write a file and reuse it from another program

I have some results from a Spark application saved in the HDFS as files called part-r-0000X (X= 0, 1, etc.). And, because I want to join the whole content in a file, I'm using the following command:
hdfs dfs -getmerge srcDir destLocalFile
The previous command is used in a bash script which makes empty the output directory (where the part-r-... files are saved) and, inside a loop, executes the above getmerge command.
The thing is I need to use the resultant file in another Spark program which need that merged file as input in the HDFS. So I'm saving it as local and then I upload it to the HDFS.
I've thought another option which is write the file from the Spark program in this way:
outputData.coalesce(1, false).saveAsTextFile(outPathHDFS)
But I've read coalesce() doesn't help with the performance.
Any other ideas? suggestions? Thanks!
You wish to merge all the files into a single one so that you can load all the files at once into a Spark rdd, is my guess.
Let the files be in Parts(0,1,....) in HDFS.
Why not load it with wholetextFiles, which actually does what you need.
wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn
Do rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”), then rdd contains:
(a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content)
Try SPARK BucketBy.
This is a nice feature via df.write.saveAsTable(), but this format can only be read by SPARK. Data shows up in Hive metastore but cannot be read by Hive, IMPALA.
The best solution that I've found so far was:
outputData.saveAsTextFile(outPath, classOf[org.apache.hadoop.io.compress.GzipCodec])
Which saves the outputData in compressed part-0000X.gz files under the outPath directory.
And, from the other Spark app, it reads those files using this:
val inputData = sc.textFile(inDir + "part-00*", numPartition)
Where inDir corresponds to the outPath.

Parquet file format on S3: which is the actual Parquet file?

Scala 2.12 and Spark 2.2.1 here. I used the following code to write the contents of a DataFrame to S3:
myDF.write.mode(SaveMode.Overwrite)
.parquet("s3n://com.example.mybucket/mydata.parquet")
When I go to com.example.mybucket on S3 I actually see a directory called "mydata.parquet", as well as file called "mydata.parquet_$folder$"!!! If I go into the mydata.parquet directory I see two files under it:
_SUCCESS; and
part-<big-UUID>.snappy.parquet
Whereas I was just expecting to see a single file called mydata.parquet living in the root of the bucket.
Is something wrong here (if so, what?!?) or is this expected with the Parquet file format? If its expected, which is the actual Parquet file that I should read from:
mydata.parquet directory?; or
mydata.parquet_$folder$ file?; or
mydata.parquet/part-<big-UUID>.snappy.parquet?
Thanks!
The mydata.parquet/part-<big-UUID>.snappy.parquet is the actual parquet data file. However, often tools like Spark break data sets into multiple part files, and expect to be pointed to a directory that contains multiple files. The _SUCCESS file is a simple flag indicating that the write operation has completed.
According to the api to save the parqueat file it saves inside the folder you provide. Sucess is incidation that the process is completed scuesffuly.
S3 create those $folder if you write directly commit to s3. What happens is it writes to temporory folders and copies to the final destination inside the s3. The reason is there no concept of rename.
Look at the s3-distcp and also DirectCommiter for performance issue.
The $folder$ marker is used by s3n/amazon's emrfs to indicate "empty directory". ignore.
The _SUCCESS file is, as the others note, a 0-byte file. ignore
all other .parquet files in the directory are the output; the number you end up with depends on the number of tasks executed on the input
When spark uses a directory (tree) as a source of data, all files beginning with _ or . are ignored; s3n will strip out those $folder$ things too. So if you use the path for a new query, it will only pick up that parquet file.

how to access multiple json files using dataframe from S3

I am using apapche spark. I want to access multiple json files from spark on date basis. How can i pick multiple files i.e. i want to provide range that files ending with 1034.json up to files ending with 1434.json. I am trying this.
DataFrame df = sql.read().json("s3://..../..../.....-.....[1034*-1434*]");
But i am getting the following error
at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.range(Pattern.java:2594)
at java.util.regex.Pattern.clazz(Pattern.java:2507)
at java.util.regex.Pattern.sequence(Pattern.java:2030)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.compile(Pattern.java:1665)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156)
at org.apache.hadoop.fs.GlobPattern.<init>(GlobPattern.java:42)
at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67)
Please specify a way out.
You can read something like this.
sqlContext.read().json("s3n://bucket/filepath/*.json")
Also, you can use wildcards in the file path.
For example:
sqlContext.read().json("s3n://*/*/*-*[1034*-1434*]")

Resources