Understanding --archive in dataproc pyspark - apache-spark

This is what the commmand help says:
--archives=[ARCHIVE,...]
Comma separated list of archives to be extracted into the working
directory of each executor. Must be one of the following file formats:
.zip, .tar, .tar.gz, or .tgz.
and, this answer here tells me that --archives will only be extracted on worker nodes
I am testing the --archive behavior the following way :
tl;dr - 1. I create an archive and zip it. 2. I create a simple rdd and map its element to os. walk('./'). 3. The archive.zip gets listed as a directory but os.walk does not traverse down this branch
My archive directory:
.
├── archive
│   ├── a1.py
│   ├── a1.txt
│   └── archive1
│   ├── a1_in.py
│   └── a1_in.txt
├── archive.zip
└── main.py
2 directories, 6 files
Testing code:
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(1))
walk_worker = rdd.map(lambda x: str(list(os.walk('./')))).distinct().collect()
walk_driver = list(os.walk('./'))
print('driver walk:', walk_driver)
print('worker walk:',walk_worker)
Dataproc run command:
gcloud dataproc jobs submit pyspark main.py --cluster pyspark-monsoon31 --region us-central1 --archives archive.zip
output:
driver walk: [('./', [], ['.main.py.crc', 'archive.zip', 'main.py', '.archive.zip.crc'])]
worker walk: ["[('./', ['archive.zip', '__spark_conf__', 'tmp'], ['pyspark.zip', '.default_container_executor.sh.crc', '.container_tokens.crc', 'default_container_executor.sh', 'launch_container.sh', '.launch_container.sh.crc', 'default_container_executor_session.sh', '.default_container_executor_session.sh.crc', 'py4j-0.10.9-src.zip', 'container_tokens']), ('./tmp', [], ['liblz4-java-5701923559211144129.so.lck', 'liblz4-java-5701923559211144129.so'])]"]
The output for driver node: The archive.zip is available but not extracted - EXPECTED
The output for worker node: os.walk is listing archive.zip as an extracted directory. The 3 directories available are ['archive.zip', '__spark_conf__', 'tmp']. But, to my surprise, only ./tmp is further traveresed and that is it
I have checked using os.listdir that archive.zip actually is a directory and not a zip. It's structure is:
└── archive.zip
└── archive
├── a1.py
├── a1.txt
└── archive1
├── a1_in.py
└── a1_in.txt
So, why is os.walk not walking down the archive.zip directory?

archive.zip is added as a symlink to worker nodes. Symlinks are not traversed by default.
If you change to walk_worker = rdd.map(lambda x: str(list(os.walk('./', followlinks=True)))).distinct().collect() you will get the output you are looking for:
worker walk: ["[('./', ['__spark_conf__', 'tmp', 'archive.zip'], ...
('./archive.zip', ['archive'], []), ('./archive.zip/archive', ['archive1'], ['a1.txt', 'a1.py']), ...."]

Related

Spark glob filter to match a specific nested partition

I'm using Pyspark, but I guess this is valid to scala as well
My data is stored on s3 in the following structure
 main_folder
└──  year=2022
└──  month=03
├──  day=01
│ ├──  valid=false
│ │ └──  example1.parquet
│ └──  valid=true
│ └──  example2.parquet
└──  day=02
├──  valid=false
│ └──  example3.parquet
└──  valid=true
└──  example4.parquet
(For simplicity there is only one file in any folder, and only two days, in reality, there can be thousands of files and many days/months/years)
The files that are under the valid=true and valid=false partitions have a completely different schema, and I only want to read the files in the valid=true partition
I tried using the glob filter, but it fails with AnalysisException: Unable to infer schema for Parquet. It must be specified manually. which is a symptom of having no data (so no files matched)
spark.read.parquet('s3://main_folder', pathGlobFilter='*valid=true*)
I noticed that something like this works
spark.read.parquet('s3://main_folder', pathGlobFilter='*example4*)
however, as soon as I try to use a slash or do something above the bottom level it fails.
spark.read.parquet('s3://main_folder', pathGlobFilter='*/example4*)
spark.read.parquet('s3://main_folder', pathGlobFilter='*valid=true*example4*)
I did try to replace the * with ** in all locations, but it didn't work
pathGlobFilter seems to work only for the ending filename, but for subdirectories you can try below, however it may ignore partition discovery. To consider partition discovery add basePath property in load option
spark.read.format("parquet")\
.option("basePath","s3://main_folder")\
.load("s3://main_folder/*/*/*/valid=true/*")
However I am not sure if you can combine both wildcarding and pathGlobFilter if you want to match based on both subdirectories and end filenames.
Reference:
https://simplernerd.com/java-spark-read-multiple-files-with-glob/
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

How to use custom config file for SparkSession (without using spark-submit to submit application)?

I have an independent python script that creates a SparkSession by invoking the following lines of code and I can see that it configures the spark session perfectly as mentioned in the spark-defaults.conf file.
spark = SparkSession.builder.appName("Tester").enableHiveSupport().getOrCreate()
If I want to pass as a parameter, another file that contains spark configuration that I want to be used instead of the spark-default.conf, how can I specify this while creating a SparkSession?
I can see that I can pass a SparkConf object but is there a way to create one automatically from a file containing all the configurations?
Do I have to manually parse the input file and set the appropriate configuration manually?
If you don't use spark-submit your best here is overriding SPARK_CONF_DIR. Create separate directory for each configurations set:
$ configs tree
.
├── conf1
│   ├── docker.properties
│   ├── fairscheduler.xml
│   ├── log4j.properties
│   ├── metrics.properties
│   ├── spark-defaults.conf
│   ├── spark-defaults.conf.template
│   └── spark-env.sh
└── conf2
├── docker.properties
├── fairscheduler.xml
├── log4j.properties
├── metrics.properties
├── spark-defaults.conf
├── spark-defaults.conf.template
└── spark-env.sh
And set environment variable before you initialize any JVM dependent objects:
import os
from pyspark.sql import SparkSession
os.environ["SPARK_CONF_DIR"] = "/path/to/configs/conf1"
spark = SparkSession.builder.getOrCreate()
or
import os
from pyspark.sql import SparkSession
os.environ["SPARK_CONF_DIR"] = "/path/to/configs/conf2"
spark = SparkSession.builder.getOrCreate()
This is workaround and might not work in complex scenarios.

Write dataframe to path outside current directory with a function?

I got a question that relates to (maybe is a duplicate of) this question here.
I try to write a pandas dataframe to an Excel file (non-existing before) in a given path. Since I have to do it quite a few times, I try to wrap it in a function. Here is what I do:
df = pd.DataFrame({'Data': [10, 20, 30, 20, 15, 30, 45]})
def excel_to_path(frame, path):
writer = pd.ExcelWriter(path , engine='xlsxwriter')
frame.to_excel(writer, sheet_name='Output')
writer.save()
excel_to_path(df, "../foo/bar/myfile.xlsx")
I get thrown the error [Errno 2] No such file or directory: '../foo/bar/myfile.xlsx'. How come and how can I fix it?
EDIT : It works as long the defined pathis inside the current working directory. But I'd like to specify any given pathinstead. Ideas?
I usually get bitten by forgetting to create the directories. Perhaps the path ../foo/bar/ doesn't exist yet? Pandas will create the file for you, but not the parent directories.
To elaborate, I'm guessing that your setup looks like this:
.
└── src
├── foo
│   └── bar
└── your_script.py
with src being your working directory, so that foo/bar exists relative to you, but ../foo/bar does not - yet!
So you should add the foo/bar directories one level up:
.
├── foo_should_go_here
│   └── bar_should_go_here
└── src
├── foo
│   └── bar
└── your_script.py

ZipFile creating zip with all the folders in zip

From the Python docs, I picked follwing snippet to zip a single file (For a flask project).
I have to create a zip file in temploc here:
/home/workspace/project/temploc/zipfile.zip
And here is my file to be zipped:
/home/workspace/project/temploc/file_to_be_zipped.csv
from zipfile import ZipFile
def zip_file(self, output, file_to_zip):
try:
with ZipFile(output, 'w') as myzip:
myzip.write(file_to_zip)
except:
return None
return output
This code creating a zip file in temploc but with full directory structure of zip file path.
def prepare_zip(self):
cache_dir = app.config["CACHE_DIR"] #-- /home/workspace/project/temploc
zip_file_path = os.path.join(cache_dir, "zipfile.zip")
input_file = '/home/workspace/project/temploc/file_to_be_zipped.csv'
self.zip_file(zip_file_path, input_file)
But above code is creating a zip file with given path directory structure:
zipfile.zip
├──home
│   ├── workspace
│   │   └── project
│   │   └──temploc
│   │   └── file_to_be_zipped.csv
BUt I want only this structure:
zipfile.zip
└── file_to_be_zipped.csv
I'm not getting what I'm missing.
You shuld use second argument of ZipFile.write to set proper name of file in archive:
import os.path
...
myzip.write(file_to_zip, os.path.basename(file_to_zip))

Read all files in a nested folder in Spark

If we have a folder folder having all .txt files, we can read them all using sc.textFile("folder/*.txt"). But what if I have a folder folder containing even more folders named datewise, like, 03, 04, ..., which further contain some .log files. How do I read these in Spark?
In my case, the structure is even more nested & complex, so a general answer is preferred.
If directory structure is regular, lets say something like this:
folder
├── a
│   ├── a
│   │   └── aa.txt
│   └── b
│   └── ab.txt
└── b
├── a
│   └── ba.txt
└── b
└── bb.txt
you can use * wildcard for each level of nesting as shown below:
>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()
[u'file:/folder/a/a/aa.txt',
u'file:/folder/a/b/ab.txt',
u'file:/folder/b/a/ba.txt',
u'file:/folder/b/b/bb.txt']
Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders.
val df= sparkSession.read
.option("recursiveFileLookup","true")
.option("header","true")
.csv("src/main/resources/nested")
This recursively loads the files from src/main/resources/nested and it's subfolders.
if you want use only files which start with name "a" ,you can use
sc.wholeTextFiles("/folder/a*/*/*.txt") or sc.wholeTextFiles("/folder/a*/a*/*.txt")
as well. We can use * as wildcard.
sc.wholeTextFiles("/directory/201910*/part-*.lzo") get all match files name, not files content.
if you want to load the contents of all matched files in a directory, you should use
sc.textFile("/directory/201910*/part-*.lzo")
and setting reading directory recursive!
sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
TIPS: scala differ with python, below set use to scala!
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

Resources