How to read specific files from a directory based on a file name in spark? - apache-spark

I have a directory of CSV files. The files are named based on date similar to the image below:
I have many CSV files that go back to 2012.
So, I would like to read the CSV files that correspond to a certain date only. How is that could be possible in spark? In other words, I don't want my spark engine to bother and read all CSV files because my data is huge (TBs).
Any help is much appreciated!

You can specify a list of files to be processed when calling the load(paths) or csv(paths) methods from DataFrameReader.
So an option would be to list and filter files on the driver, then load only the "recent" files :
val files: Seq[String] = ???
spark.read.option("header","true").csv(files:_*)
Edit :
You can use this python code (not tested yet)
files=['foo','bar']
df=spark.read.csv(*files)

Related

pyspark read multiple csv files at once

I'm using SPARK to read files in hdfs. There is a scenario, where we are getting files as chunks from legacy system in csv format.
ID1_FILENAMEA_1.csv
ID1_FILENAMEA_2.csv
ID1_FILENAMEA_3.csv
ID1_FILENAMEA_4.csv
ID2_FILENAMEA_1.csv
ID2_FILENAMEA_2.csv
ID2_FILENAMEA_3.csv
This files are loaded to FILENAMEA in HIVE using HiveWareHouse Connector, with few transformation like adding default values. Similarly we have around 70 tables. Hive tables are created in ORC format. Tables are partitioned on ID. Right now, I'm processing all these files one by one. It's taking much time.
I want to make this process much faster. Files will be in GBs.
Is there is any way to read all the FILENAMEA files at the same time and load it to HIVE tables.
You have two methods to read several CSV files in pyspark. If all CSV files are in the same directory and all have the same schema, you can read then at once by directly passing the path of directory as argument, as follow:
spark.read.csv('hdfs://path/to/directory')
If you have CSV files in different locations or CSV files in same directory but with other CSV/text files in it, you can pass them as string representing a list of path in .csv() method argument, as follow:
spark.read.csv('hdfs://path/to/filename1,hdfs://path/to/filename2')
You can have more information about how to read a CSV file with Spark here
If you need to build this list of paths from the list of files in HDFS directory, you can look at this answer, once you've created your list of paths, you can transform it to a string to pass to .csv() method with ','.join(your_file_list)
Using: spark.read.csv(["path1","path2","path3"...]) you can read multiple files from different paths. But that means you have first to make a list of the paths. A list not a string of comma-separated file paths

Why can't I merge multiple parquet files using "cat file1.parquet file2. parquet > result.parquet"?

I have created multiple parquet files using pyspark and now I'm trying to merge all the parquet files to 1. I'm able to merge the files, but while reading in the resulting file, I'm getting an error. Have anyone faced this issue before?
You cannot simply concatenate Parquet files using cat as they are binary files with a "table of contents" in the footer. To merge two files, you would have to read them both in and write out a completely new file. This could be done easily using the merge command in the parquet-tools.
The technical background that merging two Parquet files using cat isn't working comes down to the fact that a Parquet file is useless without a footer. Every Parquet file is made up roughly by the following structure:
RowGroup(nrows=..)
Column with nrows
Column with nrows
..
RowGroup(nrows=..)
..
..
Footer
Schema (tells you the type of the columns)
total_nrows
Location of RowGroups in the file
If you cat two files together, you would only see the last footer of the two files. Additionally, if the library used to read the files does some integrity checks, it will realise that your file is broken in some fashion and completely error out.

Recursively Read Files Spark wholeTextFiles

I have a directory in an azure data lake that has the following path:
'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib'
Within this directory there are a number of other directories (50) that have the format 20190404.
The directory 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/20180404' contains 100 or so xml files which I am working with.
I can create an rdd for each of the sub-folders which works fine, but ideally I want to pass only the top path, and have spark recursively find the files. I have read other SO posts and tried using a wildcard thus:
pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*'
rdd = sc.wholeTextFiles(pathWild)
rdd.count()
But it just freezes and does nothing at all, seems to completely destroy the kernel. I am working in Jupyter on Spark 2.x. New to spark. Thanks!
Try this:
pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*/*'

how to access multiple json files using dataframe from S3

I am using apapche spark. I want to access multiple json files from spark on date basis. How can i pick multiple files i.e. i want to provide range that files ending with 1034.json up to files ending with 1434.json. I am trying this.
DataFrame df = sql.read().json("s3://..../..../.....-.....[1034*-1434*]");
But i am getting the following error
at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.range(Pattern.java:2594)
at java.util.regex.Pattern.clazz(Pattern.java:2507)
at java.util.regex.Pattern.sequence(Pattern.java:2030)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.compile(Pattern.java:1665)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156)
at org.apache.hadoop.fs.GlobPattern.<init>(GlobPattern.java:42)
at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67)
Please specify a way out.
You can read something like this.
sqlContext.read().json("s3n://bucket/filepath/*.json")
Also, you can use wildcards in the file path.
For example:
sqlContext.read().json("s3n://*/*/*-*[1034*-1434*]")

Appending filename information to RDD initialized by sc.textFile

I have a set of log files I would like to read into an RDD.
These files are all compressed .gz and are the filenames are date stamped.
The source of these files is the page view statistics data for wikipedia
http://dumps.wikimedia.org/other/pagecounts-raw/
The file names look like this:
pagecounts-20090501-000000.gz
pagecounts-20090501-010000.gz
pagecounts-20090501-020000.gz
What I would like to do is read in all such files in a directory and prepend the date from the filename (e.g. 20090501) to each row of the resulting RDD.
I first thought of using sc.wholeTextFiles(..) instead of sc.textFile(..), which creates a PairRDD with the key being the file name with a path,
but sc.wholeTextFiles() doesn't handle compressed .gz files.
Any suggestions would be welcome.
The following seems to work fine for me in Spark 1.6.0:
sc.wholeTextFiles("file:///tmp/*.gz").flatMapValues(y => y.split("\n")).take(10).foreach(println)
Sample output:
(file:/C:/tmp/pagecounts-20160101-000000.gz,aa 271_a.C 1 4675)
(file:/C:/tmp/pagecounts-20160101-000000.gz,aa Battaglia_di_Qade%C5%A1/it/Battaglia_dell%27Oronte 1 4765)
(file:/C:/tmp/pagecounts-20160101-000000.gz,aa Category:User_th 1
4770)
(file:/C:/tmp/pagecounts-20160101-000000.gz,aa Chiron_Elias_Krase 1 4694)

Resources