Can we exclude or include only particular file extensions from Databricks Autoloader? - databricks

Right now the databricks autoloader requires a directory path where all the files will be loaded from. But in case some other kind of log files also start coming in in that directory - is there a way to ask Autoloader to exclude those files while preparing dataframe?
df = spark.readStream.format("cloudFiles") \
.option(<cloudFiles-option>, <option-value>) \
.schema(<schema>) \
.load(<input-path>)

Autoloader supports specification of the glob string as <input-path> - from documentation:
<input-path> can contain file glob patterns
Glob syntax support different options, like, * for any character, etc. So you can specify input-path as, path/*.json for example. You can exclude files as well, but building that pattern could be slightly more complicated, compared to inclusion pattern, but it's still possible - for example, *.[^l][^o][^g] should exclude files with .log extension

Use pathGlobFilter as one of the option and provide a regex to filter a file type or file with specific name.
For instance, to skip files with filename as A1.csv, A2.csv .... A9.csv from load location, the value for pathGlobFilter will look like:
df = spark.read.load("/file/load/location,
format="csv",
schema=schema,
pathGlobFilter="A[0-9].csv")

Related

python 3.x: how to combine dictionary zipped log file if exist in two different directories into another directory

I have below structure of directories having .log.gz files. Also these logs files are having String dictionary type. Anyone of these directories may not exist as well.
xyz1/
2022-08-08T01:31Z.log.gz
2022-08-08T01:33Z.log.gz
xyz12/
2022-08-08T01:30Z.log.gz
2022-08-08T01:33Z.log.gz
I want to create another directory and combine above files
xyz/
2022-08-08T01:30Z.log.gz
2022-08-08T01:31Z.log.gz
2022-08-08T01:33Z.log.gz
Conditions:
any one of xyz1 and xyz2 may exist or both can exist
If same name file exist in both directory combine them into third one "xyz"
While combining the String dictionary format should be retained
Solution I opted:
Check one directory if it exists and iterate over files.
For each files check if same file exist in other directory.
If yes, than decompress both files and combine them, zip them into xyz directory
if not than copy it into xyz
Is there any better way to perform above operation. Below is how I combine two log files.
combinefile = {}
combinefile.update(json.loads(xyz1/file1.log))
combinefile.update(json.loads(xyz2/file1.log))

Copy files in subdirs to azure storage with ADF

I have a folder structures like this:
folder1/folder2
/YearNumber1
/monthYear1
/somefile.csv, tbFiles.csv
/monthYear2
/somefile2.csv, tbFiles2.csv
...(many folders as above)
/YearNumber2
/montYear11
/somefileXXYYZz.csv, otherFile.csv
/monthYear12
/someFileRandom.csv. dedFile.csv
...(many folders as above)
Source:
Binary, linked via fileshare linked service
Destination:
Binary, on azure blob storage
I don't want to retain the structure, just need to copy all csv files.
Using CopyActivity:
Wildcard Path: #concat('folder1/folder2/','*/','*/',) / '*.csv'
with recursive
But it copies nothing, 0 Bytes.
You can use the below options in the CopyActivity Source Setting:
1. File path type
Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character); use ^ to escape if your actual folder name has wildcard or this escape char inside.
See official MS docs for more examples in Folder and file filter examples.
wildcardFolderPath - The folder path with wildcard characters under your file system configured in dataset to filter source folders.
wildcardFileName - The file name with wildcard characters under your file system + folderPath/wildcardFolderPath to filter source files.
2. recursive - When set to true the data is read recursively from the subfolders.
Example:
If there are only .csv files in your source directories you can simply specify wildcardFileName as just *

Copy a set of files using ADF

I have 10 files in a folder and want to move 4 of them in a different location.
I tried 2 approaches to achieve this -
using lookup to retrieve the filenames from a json file- then feeding it to a for each iterator
using metadata to get file names from source folder and then adding if condition inside a for each to copy the files.
But in both the cases, all the files in source folder gets copied.
Any help would be appreciated.
Thanks!!
There a 3 ways you can consider selecting your files depending on the requirement or blockers.
Checkout official MS doc: Copy activity properties
1. Dynamic content for FilePath property in Source Dataset.
2. You can use Wildcard character in the source folder and file path in the source Dataset.
Allowed wildcards are: * (matches zero or more characters) and ?
(matches zero or single character); use ^ to escape if your actual
folder name has wildcard or this escape char inside. See more
examples in Folder and file filter
examples.
3. List of Files
Point to a text file that includes a list of files you want to copy,
one file per line, which is the relative path to the path configured
in the dataset. When using this option, do not specify file name in
dataset. See more examples in File list
examples.
Example:
Parameterize source dataset and set source file name to that which passes the expression evaluation in IfCondition Activity.

How to read multiple CSV (leaving out specific ones) from a nested directory in PySpark?

Lets say I have a directory called 'all_data', and inside this, I have several other directories based on the date of the data that it contains. These directories are named date_2020_11_01 to date_2020_11_30 and each one of these contain csv files which I intend to read in a single dataframe.
But I don't want to read the data for date_2020_11_15 and date_2020_11_16. How do I do it?
I'm not sure how to exclude certain files, but you can specify a range of file names using brackets. Code below would select all files without 11_15 and 11_16:
spark.read.csv("date_2020_11_{1[0-4,7-9],[0,2-3][0-9]}.csv")
df= spark.read.format("parquet").option("header", "true").load(paths)
where paths is a list of all the paths where data is present, worked for me.
Simple method is, read all data directory as it is and apply filter condition
df.filter("dataColumn != 'date_2020_11_15' & 'date_2020_11_16'")
Else you can use OS module read directory and iterate to that list to eliminate those date directory using condition.

Spark 2.3 - How to read subdirectories with out asterisks?

String folder = "/Users/test/data/*/*";
sparkContext.textFile(folder, 1).toJavaRDD()
Is asterisks mandatory to read a folder. Yes, otherwise it is not reading files the subdirectories.
What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ?
For example:
1) /Users/test/data/*/*
This would work ONLY if I get data as /Users/test/data/folder1/file.txt
2)How to make this expression as generic ? It should still work if I get a folder as: /Users/test/data/folder1/folder2/folder3/folder4
My input folder structure is not same all the time.
Is there anything exists in Spark to handle this kind of scenario ?
On hadoop you could try sparkContext.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
But THB I don't think this will work in your case.
I would write a small function that returns the nested file structure as a list of paths and pass them to spark
Like:
val filePaths = List("rec1/subrec1.1/", "rec1/subrec1.2/", "rec1/subrec1.1/subsubrec1.1.1/", "rec2/subrec2.1/")
val files = spark.read.text(filePaths: _*)

Resources