Spark glob filter to match a specific nested partition - apache-spark

I'm using Pyspark, but I guess this is valid to scala as well
My data is stored on s3 in the following structure
 main_folder
└──  year=2022
└──  month=03
├──  day=01
│ ├──  valid=false
│ │ └──  example1.parquet
│ └──  valid=true
│ └──  example2.parquet
└──  day=02
├──  valid=false
│ └──  example3.parquet
└──  valid=true
└──  example4.parquet
(For simplicity there is only one file in any folder, and only two days, in reality, there can be thousands of files and many days/months/years)
The files that are under the valid=true and valid=false partitions have a completely different schema, and I only want to read the files in the valid=true partition
I tried using the glob filter, but it fails with AnalysisException: Unable to infer schema for Parquet. It must be specified manually. which is a symptom of having no data (so no files matched)
spark.read.parquet('s3://main_folder', pathGlobFilter='*valid=true*)
I noticed that something like this works
spark.read.parquet('s3://main_folder', pathGlobFilter='*example4*)
however, as soon as I try to use a slash or do something above the bottom level it fails.
spark.read.parquet('s3://main_folder', pathGlobFilter='*/example4*)
spark.read.parquet('s3://main_folder', pathGlobFilter='*valid=true*example4*)
I did try to replace the * with ** in all locations, but it didn't work

pathGlobFilter seems to work only for the ending filename, but for subdirectories you can try below, however it may ignore partition discovery. To consider partition discovery add basePath property in load option
spark.read.format("parquet")\
.option("basePath","s3://main_folder")\
.load("s3://main_folder/*/*/*/valid=true/*")
However I am not sure if you can combine both wildcarding and pathGlobFilter if you want to match based on both subdirectories and end filenames.
Reference:
https://simplernerd.com/java-spark-read-multiple-files-with-glob/
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Related

What does the "/" mean in VSCode file formatting?

I have started using VSCode and was wondering what the / meant when I click on a file (see attached screenshot). Is it simply the full path of the file that I've clicked on? Thanks!
It means sub folder inside main folder you have created
I think you may be referring to the feature where a folder with a single subfolder are shown on the same line.
For example, let's consider the following structure:
X/
├─ a/
│ ├─ a1/
│ │ └- a1.txt
│ └- a2/
│ └- a2.txt
└ b/
└- b1/
└- b1.txt
Since b1 is the only folder under b the explorer window will show b and b1 on a single line:
eth-sig-util is child folder of metamask folder which is parent.
To access child we have to use '/' after parent folder.

Is "root" the part of "directory"?

Analyzing the path, the Node.js considering the root as the part of the directory:
/home/user/dir/file.txt
┌─────────────────────┬────────────┐
│ dir │ base │
├──────┬ ├──────┬─────┤
│ root │ │ name │ ext │
" / home/user/dir / file .txt "
└──────┴──────────────┴──────┴─────┘
C:\\path\\dir\\file.txt
┌─────────────────────┬────────────┐
│ dir │ base │
├──────┬ ├──────┬─────┤
│ root │ │ name │ ext │
" C:\ path\dir \ file .txt "
└──────┴──────────────┴──────┴─────┘
Is it actually so? Developing a new library, I am thinking must I consider the root as the part of directory, or no.
Well, actually the "directory" is a fuzzy term. According the definition,
In computing, a directory is a file system cataloging structure which
contains references to other computer files, and possibly other
directories.
Wikipedia
Nothing that answers on my question. What else we know?
When we are using the cd (the abbreviation of "change directory") command, we are specifying the path relative to current location. Nothing related with root.
The cd command works inside the specific drive (at least, on Windows). This indirectly could means that the root and directory could be combined but initially separated.
More exact terms are the "absolute path of the directory" and "relative path of the directory". But what is the "directory" itself?
For the Windows case, the data storage name could be different on separate computers but it does not affect to files structure inside the storage. Again, the root and directory are separate in this case.

Terraform modules: correct references of variables?

I'm writing a terraform script to create an EKS cluster with its worker nodes on AWS. First time doing it so I'm a bit confused.
Here is the folder organisation:
├─── Int AWS Account
│ ├─── variables.tf
│ ├─── eks-cluster.tf (refers the modules)
│ ├─── others
│
├─── Prod AWS Account
│ ├─── (will be the same than Int with different settings in variables)
│
├─── ReadMe.md
│
├─── data sources
│
├─── Modules
│ ├─── cluster.tf
│ ├─── worker-nodes.tf
│ ├─── worker-nodes-sg.tf
I am a bit confused regarding how to use and pass variables. Right now, what I'm doing is that I refer to ${var.name} in the module folder, in the eks-cluster.tf, I either put a direct value name = blabla (mostly avoiding it), or refer to the variable again and have a variable file in the account folder.
Is that correct?
I'm not sure if I get your question correctly but in general you would want to keep your module files with variables only, as modules are intended to be generic so you can easily include them in different environments.
When including the module in eks_cluster_int.tf or eks_cluster_prod.tf you would then pass the values for all variables defined in the module itself. This way you can use the environment specific values in the same module.
module "cluster" {
source = "..."
var1 = value1 # directly passing value
var2 = ${var.int_specific_var} # can be defined in variables.tf of environment
...
}
Does this answer your question?

Write dataframe to path outside current directory with a function?

I got a question that relates to (maybe is a duplicate of) this question here.
I try to write a pandas dataframe to an Excel file (non-existing before) in a given path. Since I have to do it quite a few times, I try to wrap it in a function. Here is what I do:
df = pd.DataFrame({'Data': [10, 20, 30, 20, 15, 30, 45]})
def excel_to_path(frame, path):
writer = pd.ExcelWriter(path , engine='xlsxwriter')
frame.to_excel(writer, sheet_name='Output')
writer.save()
excel_to_path(df, "../foo/bar/myfile.xlsx")
I get thrown the error [Errno 2] No such file or directory: '../foo/bar/myfile.xlsx'. How come and how can I fix it?
EDIT : It works as long the defined pathis inside the current working directory. But I'd like to specify any given pathinstead. Ideas?
I usually get bitten by forgetting to create the directories. Perhaps the path ../foo/bar/ doesn't exist yet? Pandas will create the file for you, but not the parent directories.
To elaborate, I'm guessing that your setup looks like this:
.
└── src
├── foo
│   └── bar
└── your_script.py
with src being your working directory, so that foo/bar exists relative to you, but ../foo/bar does not - yet!
So you should add the foo/bar directories one level up:
.
├── foo_should_go_here
│   └── bar_should_go_here
└── src
├── foo
│   └── bar
└── your_script.py

Read all files in a nested folder in Spark

If we have a folder folder having all .txt files, we can read them all using sc.textFile("folder/*.txt"). But what if I have a folder folder containing even more folders named datewise, like, 03, 04, ..., which further contain some .log files. How do I read these in Spark?
In my case, the structure is even more nested & complex, so a general answer is preferred.
If directory structure is regular, lets say something like this:
folder
├── a
│   ├── a
│   │   └── aa.txt
│   └── b
│   └── ab.txt
└── b
├── a
│   └── ba.txt
└── b
└── bb.txt
you can use * wildcard for each level of nesting as shown below:
>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()
[u'file:/folder/a/a/aa.txt',
u'file:/folder/a/b/ab.txt',
u'file:/folder/b/a/ba.txt',
u'file:/folder/b/b/bb.txt']
Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders.
val df= sparkSession.read
.option("recursiveFileLookup","true")
.option("header","true")
.csv("src/main/resources/nested")
This recursively loads the files from src/main/resources/nested and it's subfolders.
if you want use only files which start with name "a" ,you can use
sc.wholeTextFiles("/folder/a*/*/*.txt") or sc.wholeTextFiles("/folder/a*/a*/*.txt")
as well. We can use * as wildcard.
sc.wholeTextFiles("/directory/201910*/part-*.lzo") get all match files name, not files content.
if you want to load the contents of all matched files in a directory, you should use
sc.textFile("/directory/201910*/part-*.lzo")
and setting reading directory recursive!
sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
TIPS: scala differ with python, below set use to scala!
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

Resources