Read parquet files from S3 folder using wildcard

Read parquet files from S3 folder using wildcard - apache-spark

I have S3 folders as below, each with parquet files:
s3://bucket/folder1/folder2/2020-02-26-12/key=Boston_20200226/
s3://bucket/folder1/folder2/2020-02-26-12/key=Springfield_20200223/
s3://bucket/folder1/folder2/2020-02-26-12/key=Toledo_20200226/
s3://bucket/folder1/folder2/2020-02-26-12/key=Philadelphia_20191203/
My goal is to be able to open the parquet files from '*_20200226' folders only.
I use a FOR loop to first gather a list/array of all files and then pass it to the READ operation into a DF in spark 2.4.
s3_files = []
PREFIX = "folder1/folder2/"
min_datetime = current_datetime - timedelta(hours=72)
while current_datetime >= min_datetime:
each_hour_prefix = min_datetime.strftime('%Y-%m-%d-%H')
if any(fname.key.endswith('.parquet') for fname in s3_bucket.objects.filter(Prefix=(PREFIX + each_hour_prefix))):
s3_files.append('s3://{bucket}/{prefix}'.format(bucket=INPUT_BUCKET_NAME, prefix=(PREFIX + each_hour_prefix + '/*')))
min_datetime = min_datetime + timedelta(hours=1)
spark.read.option('basePath',('s3://' + INPUT_BUCKET_NAME)).schema(fileSchema).parquet(*s3_files)
where fileSchema is the schema struct of the parquet files, s3_files is a array of all files I picked up by perusing through S3 folders above. The above FOR loop works but my goal is to read Boston_20200226 and Toledo_20200226 folders only. Is it possible to do wildcard searches like "folder1/folder2/2020-02-26-12/key=**_20200226*" or perhaps change the 'read.parquet' command in some way?
Thanks in advance.
Update:
I resorted to a rudimentary way of perusing through all the folders and only finding files that match pattern = '20200226'(not the most efficient way). I collect the keys in a list and then read each parquet file in a DF and perform a union at the end. Everything works fine except the 'key' column is not read in the final DF. It is part of the partitionBy() code that created these parquet files. Any idea on how can capture the 'key'?

Related

Handling spaces in the abfss using COPY INTO with Azure Databricks

I am trying to use the COPY INTO statement in Databricks to ingest CSV files from Cloud Storage.
The problem is that the folder name has a space in it /AP Posted/ and when I try to refer to the path the code execution raises the below error:
Error in SQL statement: URISyntaxException: Illegal character in path at index 70: abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/AP Posted/
I googled the error and found articles advising to replace the space with "%20". This solution is not effective.
So, does someone knows how to solve it? Or the only solution is indeed to prevent spaces in naming folders.
This is my current Databricks SQL Code:
COPY INTO prod_gbs_gpdi.bronze_data.my_table
FROM 'abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/AP Posted/'
FILEFORMAT = CSV
VALIDATE 500 ROWS
PATTERN = 'AP_SAPEX_KPI_001 - Posted Invoices in 2021_3.CSV'
FORMAT_OPTIONS(
'header'='true',
'delimiter'=';',
'skipRows'='8',
'mergeSchema'='true', --Whether to infer the schema across multiple files and to merge the schema of each file
'encoding'='UTF-8',
'enforceSchema'='true', --Whether to forcibly apply the specified or inferred schema to the CSV files
'ignoreLeadingWhiteSpace'='true',
'ignoreTrailingWhiteSpace'='true',
'mode'='PERMISSIVE' --Parser mode around handling malformed records
)
COPY_OPTIONS (
'force' = 'true', --If set to true, idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.
'mergeSchema'= 'true' --If set to true, the schema can be evolved according to the incoming data.
)

Trying to use the path where one of the folders has space, gave the same error.
To overcome this, you can specify the folder in PATTERN parameter as follows:
%sql
COPY INTO table1
FROM '/mnt/repro/op/'
FILEFORMAT = csv
PATTERN='has space/sample1.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');
Or, giving path as path/has?space/ also works. But if there are multiple folders like has space, hasAspace, hasBspace etc., then this would not work as expected.
%sql
COPY INTO table2
FROM '/mnt/repro/op/has?space/'
FILEFORMAT = csv
PATTERN='sample1.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');
Another alternative is to copy the file to dbfs using dbutils.fs.cp() and then use dbfs path to use COPY INTO.
dbutils.fs.cp('/mnt/repro/op/has space/sample1.csv','/FileStore/tables/mycsv.csv')
%sql
COPY INTO table3
FROM '/FileStore/tables/'
FILEFORMAT = csv
PATTERN='mycsv.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');

Read excel files and append to make one data frame in Databricks from azure data lake without specific file names

I am storing excel files in Azure data lake (gen 1). They follow filenames follow the same pattern "2021-06-18T09_00_07ONR_Usage_Dataset", "2021-06-18T09_00_07DSS_Usage_Dataset", etc. depending on the date and time. I want to read all the files in the folder located in Azure data lake to databricks without having to name the specific file so in the future new files are read and appended to make one big data set. The files are all the same schema, columns are in the same order, etc.
So far I have tried for loops with regex expressions:
path = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
for fi in path: `for fi in path:
print(fi)
read = spark.read.format("com.crealytics.spark.excel").option("header", "True").option("inferSchema", "true").option("dataAddress", "'Usage Dataset'!A2").load(fi.path)
display(read)
print(read.count())
The output print all the paths and it counts each dataset that is being read, but it only displays the last one. I understand because I'm not storing it or appending in the for loop, but when I add append it breaks.
appended_data = []
path = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
for fi in path: `for fi in path:
print(fi)
read = spark.read.format("com.crealytics.spark.excel").option("header", "True").option("inferSchema", "true").option("dataAddress", "'Usage Dataset'!A2").load(fi.path)
display(read)
print(read.count())
appended_data.append(read)
But I get this error, FileInfo(path='dbfs:/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/Initialization_DSS.xlsx', name='Initialization_DSS.xlsx', size=39781)
TypeError: not supported type: <class 'py4j.java_gateway.JavaObject'>
The final way I tried:
li = []
for f in glob.glob('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/*_Usage_Dataset.xlsx'):
df = pd.read_xlsx(f)
li.append(df)
frame = pd.concat(li, axis =0, ignore_index = True)
This says that there are no object to concatenate. I have been researching everywhere and trying everything. Please help.

If you want to use pandas to read excel file in databricks, the path should be like /dbfs/mnt/....
For example
import os
import glob
import pandas as pd
li = []
os.chdir(r'/dbfs/mnt/<mount-name>/<>')
allFiles = glob.glob("*.xlsx") # match your csvs
for file in allFiles:
df = pd.read_xlsx(f)
li.append(df)
frame = pd.concat(li, axis =0, ignore_index = True)

Pyspark- Save each dataframe to a single file

I am trying to save filtered dataframe back to the same source file.
I wrote below code to transform the content of each file in a directory to separate Dataframe, filter it and save it back to the same file
rdd = sparkSession.sparkContext.wholeTextFiles("/content/sample_data/test_data")
# collect the RDD to a list
list_elements = rdd.collect()
for element in list_elements:
path, data = element
df = spark.read.json(spark.sparkContext.parallelize([data]))
df = df.filter('d != 721')
df.write.save(path, format="json", mode="overwrite")
I was expecting that it will overwrite the file with the updated data, but it is creating a folder with the file name and creating below structure and part files:
How can I save each updated dataframe back to the same source file(.txt)?
Thanks in Advance.

To save it to 1 file use .coalesce(1) or .repartition(1) option before .save(), that will result in the same folder-like structure, but there will be 1 json file inside.
To save it with a „normal” name after saving it you’d need to cut the 1 json file inside, paste and rename it with desired name. You can see code how it could look like for csv files here

Spark: Read multiple AVRO files with different schema in parallel

I have many (relatively small) AVRO files with different schema, each in one location like this:
Object Name: A
/path/to/A
A_1.avro
A_2.avro
...
A_N.avro
Object Name: B
/path/to/B
B_1.avro
B_2.avro
...
B_N.avro
Object Name: C
/path/to/C
C_1.avro
C_2.avro
...
C_N.avro
...
and my goal is to read them in parallel via Spark and store each row as a blob in one column of the output. As a result my output data will have a consistent schema, something like the following columns:
ID, objectName, RecordDate, Data
Where the 'Data' field contains a string JSON of the original record.
My initial thought was to put the spark read statements in a loop, create the fields shown above for each dataframe, and then apply a union operation to get my final dataframe, like this:
all_df = []
for obj_name in all_object_names:
file_path = get_file_path(object_name)
df = spark.read.format(DATABRIKS_FORMAT).load(file_path)
all_df.append(df)
df_o = all_df.drop()
for df in all_df:
df_o = df_o.union(df)
# write df_o to the output
However I'm not sure if the read operations are going to be parallelized.
I also came across the sc.textFile() function to read all the AVRO files in one shot as string, but couldn't make it work.
So I have 2 questions:
Would the multiple read statements in a loop be parallelized by
Spark? Or is there a more efficient way to achieve this?
Can sc.textFile() be used to read the AVRO files as a string JSON in one column?
I'd appreciate your thoughts and suggestions.

Spark: PartitionBy, change output file name

Currently, when I use the paritionBy() to write to HDFS:
DF.write.partitionBy("id")
I will get output structure looking like (which is the default behaviour):
../id=1/
../id=2/
../id=3/
I would like a structure looking like:
../a/
../b/
../c/
such that:
if id = 1, then a
if id = 2, then b
.. etc
Is there a way to change the filename output? If not, what is the best way to do this?

You won't be able to use Spark's partitionBy to achieve this.
Instead, you have to break your DataFrame into its component partitions, and save them one by one, like so:
base = ord('a') - 1
for id in range(1, 4):
DF.filter(DF['id'] == id).write.save("..." + chr(base + id))
}
Alternatively, you can write the entire dataframe using Spark's partitionBy facility, and then manually rename the partitions using HDFS APIs.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Read parquet files from S3 folder using wildcard - apache-spark

Related

Handling spaces in the abfss using COPY INTO with Azure Databricks

Read excel files and append to make one data frame in Databricks from azure data lake without specific file names

Pyspark- Save each dataframe to a single file

Spark: Read multiple AVRO files with different schema in parallel

Spark: PartitionBy, change output file name

Categories

Resources