Reading multiple parquet files from S3 Bucket [duplicate] - apache-spark

I need to read parquet files from multiple paths that are not parent or child directories.
for example,
dir1 ---
|
------- dir1_1
|
------- dir1_2
dir2 ---
|
------- dir2_1
|
------- dir2_2
sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2
Right now I'm reading each dir and merging dataframes using "unionAll".
Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll or is there any fancy way using unionAll
Thanks

A little late but I found this while I was searching and it may help someone else...
You might also try unpacking the argument list to spark.read.parquet()
paths=['foo','bar']
df=spark.read.parquet(*paths)
This is convenient if you want to pass a few blobs into the path argument:
basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
's3://bucket/partition_value1=*/partition_value2=2017-05-*'
]
df=spark.read.option("basePath",basePath).parquet(*paths)
This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.

Both the parquetFile method of SQLContext and the parquet method of DataFrameReader take multiple paths. So either of these works:
df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')
or
df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')

In case you have a list of files you can do:
files = ['file1', 'file2',...]
df = spark.read.parquet(*files)

For ORC
spark.read.orc("/dir1/*","/dir2/*")
spark goes inside dir1/ and dir2/ folder and load all the ORC files.
For Parquet,
spark.read.parquet("/dir1/*","/dir2/*")

Just taking John Conley's answer, and embellishing it a bit and providing the full code (used in Jupyter PySpark) as I found his answer extremely useful.
from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')
import posixpath as psp
fpaths = [
psp.join("hdfs://localhost:9000" + dpath, fname)
for dpath, _, fnames in client.walk('/eta/myHdfsPath')
for fname in fnames
]
# At this point fpaths contains all hdfs files
parquetFile = sqlContext.read.parquet(*fpaths)
import pandas
pdf = parquetFile.toPandas()
# display the contents nicely formatted.
pdf

In Spark-Scala you can do this.
val df = spark.read.option("header","true").option("basePath", "s3://bucket/").csv("s3://bucket/{sub-dir1,sub-dir2}/")

Related

Filter thresold Size data to read using Pyspark or Python

I have n number orc files in a path, among them around 150 files are null or incomplete size, I want to ignore all those while reading through pyspark.
I have written the following, but I need some help as it's not working.
path = "/home/data/raw_data/"
file_list = os.listdir(path)
for file in file_list:
size=os.path.getsize(os.path.join(path, file))
if size > 6500: # want to import which is greater than 6.5 Mb
file_list.append(size)
raw_df = spark.read.format("orc").load(path)
the issue in your above code is
file_list.append(size) ---> which is not required and
reading data from spark should be inside loop.
from pyspark.sql import DataFrame
from functools import reduce
df_list =[]
path = "/home/data/raw_data/"
file_list = os.listdir(path)
for file in file_list:
size=os.path.getsize(os.path.join(path, file))
if size > 6500:
raw_df = spark.read.format("orc").load(path+file)
df_list.append(raw_df)
df_fnl = reduce(DataFrame.unionByName,df_list)
Kindly upvote of you like my solution.
The amount of files can be very big and the loop inefficient.
An alternative is to load all files and then filter only the files needed.
You can see the source of the file with the function input_file_name().
Then if you have a df all filenames you need, you can inner join on the input_file_name and your helper df and then only entries from the files you required are kept.

Spark s3 csv files read order

Let's say three files in s3 folder and whether read through spark.read.csv(s3:bucketname/folder1/*.csv) reads the files in order or not ?
If not, is there way to order the files while reading the whole folder with multiple files received at different time internal.
File name
s3 file uploaded/Last modified time
s3:bucketname/folder1/file1.csv
01:00:00
s3:bucketname/folder1/file2.csv
01:10:00
s3:bucketname/folder1/file3.csv
01:20:00
You can achive this using following
Iterate over all the files in the bucket and load that csv with adding a new column last_modified. Keep a list of all the dfs that will be loaded in dfs_list. Since pyspark does lazy evaluation it will not load the data instantly.
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucketname')
dfs_list = []
for file_object in my_bucket.objects.filter(Prefix="folder1/"):
df = spark.read.parquet('s3a://' + file_object.name).withColumn("modified_date", file_object.last_modified)
dfs_list.append(df)
Now take the union of all the dfs using pyspark unionAll function and then sort the data according to modified_date.
from functools import reduce
from pyspark.sql import DataFrame
df_combined = reduce(DataFrame.unionAll, dfs_list)
df_combined = df_combined.orderBy('modified_date')

How to iterate in Databricks to read hundreds of files stored in different subdirectories in a Data Lake?

I have to read hundreds of avro files in Databricks from an Azure Data Lake Gen2, extract data from the Body field inside every file, and concatenate all the extracted data in a unique dataframe. The point is that all avro files to read are stored in different subdirectories in the lake, following the pattern:
root/YYYY/MM/DD/HH/mm/ss.avro
This forces me to loop the ingestion and selection of data. I'm using this Python code, in which list_avro_files is the list of paths to all files:
list_data = []
for file_avro in list_avro_files:
df = spark.read.format('avro').load(file_avro)
data1 = spark.read.json(df.select(df.Body.cast('string')).rdd.map(lambda x: x[0]))
list_data.append(data1)
data = reduce(DataFrame.unionAll, list_data)
Is there any way to do this more efficiently? How can I parallelize/speed up this process?
As long as your list_avro_files can be expressed through standard wildcard syntax, you can probably use Spark's own ability to parallelize read operation. All you'd need is to specify a basepath and a filename pattern for your avro files:
scala> var df = spark.read
.option("basepath","/user/hive/warehouse/root")
.format("avro")
.load("/user/hive/warehouse/root/*/*/*/*.avro")
And, in case you find that you need to know exactly which file any given row came from, use input_file_name() built-in function to enrich your dataframe:
scala> df = df.withColumn("source",input_file_name())

Rename written CSV file Spark

I'm running spark 2.1 and I want to write a csv with results into Amazon S3.
After repartitioning the csv file has kind of a long kryptic name and I want to change that into a specific filename.
I'm using the databricks lib for writing into S3.
dataframe
.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("folder/dataframe/")
Is there a way to rename the file afterwards or even save it directly with the correct name? I've already looked for solutions and havent found much.
Thanks
You can use below to rename the output file.
dataframe.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("folder/dataframe/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "folder/dataframe/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"file.csv"))
The code as you mentioned here returns a Unit. You would need to confirm when your Spark application has completed its run (assuming this is a batch case) and then rename
dataframe
.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("folder/dataframe/")
You can rename the part files with any specific name using the dbutils command, use the below code to rename the part-generated CSV file, this code works fine for pyspark
x = 'dbfs:mnt/source_path' # your source path
y = 'dbfs:mnt/destination_path' # you destination path
Files = dbutils.fs.ls(x)
#moving or renaming the part-000 CSV file into the normal or specific name
i = 0
for file in Files:
print(file.name)
i = i+1
if file.name[-4] ='.csv': #you can use any file extension like parquet, JSON, etc.
dbutils.fs.mv(x+file.name,y+'OutputData-' + str(i) +'.csv') #you can provide any specific name here
dbutils.fs.rm(x, True) # later remove the source path after renaming all the part-generated files if you want

How to import multiple csv files in a single load?

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load tables using Spark SQL. I know this can be performed by using an individual dataframe for each file [given below], but can it be automated with a single command rather than pointing a file can I point a folder?
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downloads/2008.csv")
Use wildcard, e.g. replace 2008 with *:
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downloads/*.csv") // <-- note the star (*)
Spark 2.0
// these lines are equivalent in Spark 2.0
spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")
spark.read.option("header", "true").csv("../Downloads/*.csv")
Notes:
Replace format("com.databricks.spark.csv") by using format("csv") or csv method instead. com.databricks.spark.csv format has been integrated to 2.0.
Use spark not sqlContext
Ex1:
Reading a single CSV file. Provide complete file path:
val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\tmp\\cars1.csv")
Ex2:
Reading multiple CSV files passing names:
val df=spark.read.option("header","true").csv("C:spark\\sample_data\\tmp\\cars1.csv", "C:spark\\sample_data\\tmp\\cars2.csv")
Ex3:
Reading multiple CSV files passing list of names:
val paths = List("C:spark\\sample_data\\tmp\\cars1.csv", "C:spark\\sample_data\\tmp\\cars2.csv")
val df = spark.read.option("header", "true").csv(paths: _*)
Ex4:
Reading multiple CSV files in a folder ignoring other files:
val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\tmp\\*.csv")
Ex5:
Reading multiple CSV files from multiple folders:
val folders = List("C:spark\\sample_data\\tmp", "C:spark\\sample_data\\tmp1")
val df = spark.read.option("header", "true").csv(folders: _*)
Note that you can use other tricks like :
-- One or more wildcard:
.../Downloads20*/*.csv
-- braces and brackets
.../Downloads201[1-5]/book.csv
.../Downloads201{11,15,19,99}/book.csv
Reader's Digest: (Spark 2.x)
For Example, if you have 3 directories holding csv files:
dir1, dir2, dir3
You then define paths as a string of comma delimited list of paths as follows:
paths = "dir1/,dir2/,dir3/*"
Then use the following function and pass it this paths variable
def get_df_from_csv_paths(paths):
df = spark.read.format("csv").option("header", "false").\
schema(custom_schema).\
option('delimiter', '\t').\
option('mode', 'DROPMALFORMED').\
load(paths.split(','))
return df
By then running:
df = get_df_from_csv_paths(paths)
You will obtain in df a single spark dataframe containing the data from all the csvs found in these 3 directories.
===========================================================================
Full Version:
In case you want to ingest multiple CSVs from multiple directories you simply need to pass a list and use wildcards.
For Example:
if your data_path looks like this:
's3://bucket_name/subbucket_name/2016-09-*/184/*,
s3://bucket_name/subbucket_name/2016-10-*/184/*,
s3://bucket_name/subbucket_name/2016-11-*/184/*,
s3://bucket_name/subbucket_name/2016-12-*/184/*, ... '
you can use the above function to ingest all the csvs in all these directories and subdirectories at once:
This would ingest all directories in s3 bucket_name/subbucket_name/ according to the wildcard patterns specified. e.g. the first pattern would look in
bucket_name/subbucket_name/
for all directories with names starting with
2016-09-
and for each of those take only the directory named
184
and within that subdirectory look for all csv files.
And this would be executed for each of the patterns in the comma delimited list.
This works way better than union..
Using Spark 2.0+, we can load multiple CSV files from different directories using
df = spark.read.csv(['directory_1','directory_2','directory_3'.....], header=True). For more information, refer the documentation
here
val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\*.csv)
will consider files tmp, tmp1, tmp2, ....

Resources