String folder = "/Users/test/data/*/*";
sparkContext.textFile(folder, 1).toJavaRDD()
Is asterisks mandatory to read a folder. Yes, otherwise it is not reading files the subdirectories.
What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ?
For example:
1) /Users/test/data/*/*
This would work ONLY if I get data as /Users/test/data/folder1/file.txt
2)How to make this expression as generic ? It should still work if I get a folder as: /Users/test/data/folder1/folder2/folder3/folder4
My input folder structure is not same all the time.
Is there anything exists in Spark to handle this kind of scenario ?
On hadoop you could try sparkContext.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
But THB I don't think this will work in your case.
I would write a small function that returns the nested file structure as a list of paths and pass them to spark
Like:
val filePaths = List("rec1/subrec1.1/", "rec1/subrec1.2/", "rec1/subrec1.1/subsubrec1.1.1/", "rec2/subrec2.1/")
val files = spark.read.text(filePaths: _*)
Related
I have 10 files in a folder and want to move 4 of them in a different location.
I tried 2 approaches to achieve this -
using lookup to retrieve the filenames from a json file- then feeding it to a for each iterator
using metadata to get file names from source folder and then adding if condition inside a for each to copy the files.
But in both the cases, all the files in source folder gets copied.
Any help would be appreciated.
Thanks!!
There a 3 ways you can consider selecting your files depending on the requirement or blockers.
Checkout official MS doc: Copy activity properties
1. Dynamic content for FilePath property in Source Dataset.
2. You can use Wildcard character in the source folder and file path in the source Dataset.
Allowed wildcards are: * (matches zero or more characters) and ?
(matches zero or single character); use ^ to escape if your actual
folder name has wildcard or this escape char inside. See more
examples in Folder and file filter
examples.
3. List of Files
Point to a text file that includes a list of files you want to copy,
one file per line, which is the relative path to the path configured
in the dataset. When using this option, do not specify file name in
dataset. See more examples in File list
examples.
Example:
Parameterize source dataset and set source file name to that which passes the expression evaluation in IfCondition Activity.
Below is the s3 folder :
s3://bucket-name/20210802-123429/DM/US/2021/08/02/12/test.json
20210802-123429 is archive job which puts the files .
what i could achieve:
cred_obj = cred_conn.list_objects_v2(Bucket=cfg.Bucket_Details['extractjson'], Prefix="DM"+'/'+"US"+'/'+self.yr+'/'+self.mth+'/'+self.day+'/'+self.hr+'/')
Problem statement :
But, in above line, im not sure how to match the criteria for 20210802 and parse the "test.json"
list_objects_v2 does not support RegEx match. The only way to search is using the prefix. Therefore, you must know the prefix or part of the prefix in order to search.
timestr_arc = todays_dt.strftime("%Y%m%d")
cred_obj = cred_conn.list_objects_v2(Bucket=cfg.Bucket_Details['extractjson'], Prefix="DM"+'/'+"US"+'/'+str(self.timestr_arc))
This will check for the specific condition
Right now the databricks autoloader requires a directory path where all the files will be loaded from. But in case some other kind of log files also start coming in in that directory - is there a way to ask Autoloader to exclude those files while preparing dataframe?
df = spark.readStream.format("cloudFiles") \
.option(<cloudFiles-option>, <option-value>) \
.schema(<schema>) \
.load(<input-path>)
Autoloader supports specification of the glob string as <input-path> - from documentation:
<input-path> can contain file glob patterns
Glob syntax support different options, like, * for any character, etc. So you can specify input-path as, path/*.json for example. You can exclude files as well, but building that pattern could be slightly more complicated, compared to inclusion pattern, but it's still possible - for example, *.[^l][^o][^g] should exclude files with .log extension
Use pathGlobFilter as one of the option and provide a regex to filter a file type or file with specific name.
For instance, to skip files with filename as A1.csv, A2.csv .... A9.csv from load location, the value for pathGlobFilter will look like:
df = spark.read.load("/file/load/location,
format="csv",
schema=schema,
pathGlobFilter="A[0-9].csv")
i'm tryimg to loop over a diffrent countries folder that got fixed sub folder named survey (i.e Spain/survey , USA/survey ).
where and how I Need to define a wildcard / parameter for the countries so I could loop over all the files that in the survey folder ?
what is the right wildcard syntax ? ( the equivalent of - like 'survey%' in SQL) ?
I tried several ways to define it with no success and I would be happy to get some help on this - Thanks !
In case if the list of paths are static, you can create a parameter or add it in a SQL database and get that result from a lookup activity.
Pass the output to a for each activity and within foreach activity use a copy activity.
You can parameterize the input dataset to get the file paths thereby you need not think of any wildcard characters but use the actual paths itself.
Hope this is helpful.
Lets say I have a directory called 'all_data', and inside this, I have several other directories based on the date of the data that it contains. These directories are named date_2020_11_01 to date_2020_11_30 and each one of these contain csv files which I intend to read in a single dataframe.
But I don't want to read the data for date_2020_11_15 and date_2020_11_16. How do I do it?
I'm not sure how to exclude certain files, but you can specify a range of file names using brackets. Code below would select all files without 11_15 and 11_16:
spark.read.csv("date_2020_11_{1[0-4,7-9],[0,2-3][0-9]}.csv")
df= spark.read.format("parquet").option("header", "true").load(paths)
where paths is a list of all the paths where data is present, worked for me.
Simple method is, read all data directory as it is and apply filter condition
df.filter("dataColumn != 'date_2020_11_15' & 'date_2020_11_16'")
Else you can use OS module read directory and iterate to that list to eliminate those date directory using condition.