Write dataframe to path outside current directory with a function? - python-3.x

I got a question that relates to (maybe is a duplicate of) this question here.
I try to write a pandas dataframe to an Excel file (non-existing before) in a given path. Since I have to do it quite a few times, I try to wrap it in a function. Here is what I do:
df = pd.DataFrame({'Data': [10, 20, 30, 20, 15, 30, 45]})
def excel_to_path(frame, path):
writer = pd.ExcelWriter(path , engine='xlsxwriter')
frame.to_excel(writer, sheet_name='Output')
writer.save()
excel_to_path(df, "../foo/bar/myfile.xlsx")
I get thrown the error [Errno 2] No such file or directory: '../foo/bar/myfile.xlsx'. How come and how can I fix it?
EDIT : It works as long the defined pathis inside the current working directory. But I'd like to specify any given pathinstead. Ideas?

I usually get bitten by forgetting to create the directories. Perhaps the path ../foo/bar/ doesn't exist yet? Pandas will create the file for you, but not the parent directories.
To elaborate, I'm guessing that your setup looks like this:
.
└── src
├── foo
│   └── bar
└── your_script.py
with src being your working directory, so that foo/bar exists relative to you, but ../foo/bar does not - yet!
So you should add the foo/bar directories one level up:
.
├── foo_should_go_here
│   └── bar_should_go_here
└── src
├── foo
│   └── bar
└── your_script.py

Related

Spark glob filter to match a specific nested partition

I'm using Pyspark, but I guess this is valid to scala as well
My data is stored on s3 in the following structure
 main_folder
└──  year=2022
└──  month=03
├──  day=01
│ ├──  valid=false
│ │ └──  example1.parquet
│ └──  valid=true
│ └──  example2.parquet
└──  day=02
├──  valid=false
│ └──  example3.parquet
└──  valid=true
└──  example4.parquet
(For simplicity there is only one file in any folder, and only two days, in reality, there can be thousands of files and many days/months/years)
The files that are under the valid=true and valid=false partitions have a completely different schema, and I only want to read the files in the valid=true partition
I tried using the glob filter, but it fails with AnalysisException: Unable to infer schema for Parquet. It must be specified manually. which is a symptom of having no data (so no files matched)
spark.read.parquet('s3://main_folder', pathGlobFilter='*valid=true*)
I noticed that something like this works
spark.read.parquet('s3://main_folder', pathGlobFilter='*example4*)
however, as soon as I try to use a slash or do something above the bottom level it fails.
spark.read.parquet('s3://main_folder', pathGlobFilter='*/example4*)
spark.read.parquet('s3://main_folder', pathGlobFilter='*valid=true*example4*)
I did try to replace the * with ** in all locations, but it didn't work
pathGlobFilter seems to work only for the ending filename, but for subdirectories you can try below, however it may ignore partition discovery. To consider partition discovery add basePath property in load option
spark.read.format("parquet")\
.option("basePath","s3://main_folder")\
.load("s3://main_folder/*/*/*/valid=true/*")
However I am not sure if you can combine both wildcarding and pathGlobFilter if you want to match based on both subdirectories and end filenames.
Reference:
https://simplernerd.com/java-spark-read-multiple-files-with-glob/
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Understanding --archive in dataproc pyspark

This is what the commmand help says:
--archives=[ARCHIVE,...]
Comma separated list of archives to be extracted into the working
directory of each executor. Must be one of the following file formats:
.zip, .tar, .tar.gz, or .tgz.
and, this answer here tells me that --archives will only be extracted on worker nodes
I am testing the --archive behavior the following way :
tl;dr - 1. I create an archive and zip it. 2. I create a simple rdd and map its element to os. walk('./'). 3. The archive.zip gets listed as a directory but os.walk does not traverse down this branch
My archive directory:
.
├── archive
│   ├── a1.py
│   ├── a1.txt
│   └── archive1
│   ├── a1_in.py
│   └── a1_in.txt
├── archive.zip
└── main.py
2 directories, 6 files
Testing code:
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(1))
walk_worker = rdd.map(lambda x: str(list(os.walk('./')))).distinct().collect()
walk_driver = list(os.walk('./'))
print('driver walk:', walk_driver)
print('worker walk:',walk_worker)
Dataproc run command:
gcloud dataproc jobs submit pyspark main.py --cluster pyspark-monsoon31 --region us-central1 --archives archive.zip
output:
driver walk: [('./', [], ['.main.py.crc', 'archive.zip', 'main.py', '.archive.zip.crc'])]
worker walk: ["[('./', ['archive.zip', '__spark_conf__', 'tmp'], ['pyspark.zip', '.default_container_executor.sh.crc', '.container_tokens.crc', 'default_container_executor.sh', 'launch_container.sh', '.launch_container.sh.crc', 'default_container_executor_session.sh', '.default_container_executor_session.sh.crc', 'py4j-0.10.9-src.zip', 'container_tokens']), ('./tmp', [], ['liblz4-java-5701923559211144129.so.lck', 'liblz4-java-5701923559211144129.so'])]"]
The output for driver node: The archive.zip is available but not extracted - EXPECTED
The output for worker node: os.walk is listing archive.zip as an extracted directory. The 3 directories available are ['archive.zip', '__spark_conf__', 'tmp']. But, to my surprise, only ./tmp is further traveresed and that is it
I have checked using os.listdir that archive.zip actually is a directory and not a zip. It's structure is:
└── archive.zip
└── archive
├── a1.py
├── a1.txt
└── archive1
├── a1_in.py
└── a1_in.txt
So, why is os.walk not walking down the archive.zip directory?
archive.zip is added as a symlink to worker nodes. Symlinks are not traversed by default.
If you change to walk_worker = rdd.map(lambda x: str(list(os.walk('./', followlinks=True)))).distinct().collect() you will get the output you are looking for:
worker walk: ["[('./', ['__spark_conf__', 'tmp', 'archive.zip'], ...
('./archive.zip', ['archive'], []), ('./archive.zip/archive', ['archive1'], ['a1.txt', 'a1.py']), ...."]

Import Scripts From Folders Dynamically in For Loop

I have my main.py script in a folder, along with about 10 other folders with varying names. These names change from time to time so I can't just import by a specific folder name each time; so I thought I could create a For loop that would dynamically load all the folder names into a list first, then iterate over them to import the template.py script that is within each folder. And yes, they are all named template.py, but each folder has one that is unique to that folder.
My main.py script looks like this:
import os
import sys
# All items in the current directory that do not have a dot extension, and isn't the pycache folder,
# are considered folders to iterate through
pipeline_folder_names = [name for name in os.listdir("./") if not '.' in name and not 'pycache' in name]
for i in pipeline_folder_names:
print(i)
path = sys.path.insert(0, './' + i)
import template
It works on the first folder just fine, but then doesn't change into the next directory to import the next template script. I've tried adding both:
os.chdir('../')
and
sys.path.remove('./' + i)
to the end to "reset" the directory but neither of them work. Any ideas? Thanks!
When you import a module in python, it's loaded into the cache. The second time you import template, its not the new file that's imported, python just reloads the first one.
This is what worked for me.
The directory structure and content:
.
├── 1
│   ├── __pycache__
│   │   └── template.cpython-38.pyc
│   └── template.py
├── 2
│   ├── __pycache__
│   │   └── template.cpython-38.pyc
│   └── template.py
└── temp.py
$ cat 1/template.py
print("1")
$ cat 2/template.py
print("2")
Load the first one manualy, then use the reload function from importlib to load the new template.py file.
import os
import sys
import importlib
# All items in the current directory that do not have a dot extension, and isn't the pycache folder,
# are considered folders to iterate through
pipeline_folder_names = [name for name in os.listdir("./") if not '.' in name and not 'pycache' in name]
sys.path.insert(1, './' + pipeline_folder_names[0])
import template
sys.path.remove('./' + pipeline_folder_names[0])
for i in pipeline_folder_names[1:]:
path = sys.path.insert(0, './' + i)
importlib.reload(template)
sys.path.remove('./' + i)
Running this give the output:
$ python temp.py
1
2
considering the above folder structure.
You need to create each folder a module, which can done by creating an
empty
__init__.py file in each folder parallel to template.py
then below code in temp.py will solve your issue
import os
import sys
import importlib
pipeline_folder_names = [name for name in os.listdir("./") if not '.' in name and not 'pycache' in name]
def import_template(directory):
importlib.import_module(directory+'.template')
for i in pipeline_folder_names:
import_template(i)

ZipFile creating zip with all the folders in zip

From the Python docs, I picked follwing snippet to zip a single file (For a flask project).
I have to create a zip file in temploc here:
/home/workspace/project/temploc/zipfile.zip
And here is my file to be zipped:
/home/workspace/project/temploc/file_to_be_zipped.csv
from zipfile import ZipFile
def zip_file(self, output, file_to_zip):
try:
with ZipFile(output, 'w') as myzip:
myzip.write(file_to_zip)
except:
return None
return output
This code creating a zip file in temploc but with full directory structure of zip file path.
def prepare_zip(self):
cache_dir = app.config["CACHE_DIR"] #-- /home/workspace/project/temploc
zip_file_path = os.path.join(cache_dir, "zipfile.zip")
input_file = '/home/workspace/project/temploc/file_to_be_zipped.csv'
self.zip_file(zip_file_path, input_file)
But above code is creating a zip file with given path directory structure:
zipfile.zip
├──home
│   ├── workspace
│   │   └── project
│   │   └──temploc
│   │   └── file_to_be_zipped.csv
BUt I want only this structure:
zipfile.zip
└── file_to_be_zipped.csv
I'm not getting what I'm missing.
You shuld use second argument of ZipFile.write to set proper name of file in archive:
import os.path
...
myzip.write(file_to_zip, os.path.basename(file_to_zip))

Read all files in a nested folder in Spark

If we have a folder folder having all .txt files, we can read them all using sc.textFile("folder/*.txt"). But what if I have a folder folder containing even more folders named datewise, like, 03, 04, ..., which further contain some .log files. How do I read these in Spark?
In my case, the structure is even more nested & complex, so a general answer is preferred.
If directory structure is regular, lets say something like this:
folder
├── a
│   ├── a
│   │   └── aa.txt
│   └── b
│   └── ab.txt
└── b
├── a
│   └── ba.txt
└── b
└── bb.txt
you can use * wildcard for each level of nesting as shown below:
>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()
[u'file:/folder/a/a/aa.txt',
u'file:/folder/a/b/ab.txt',
u'file:/folder/b/a/ba.txt',
u'file:/folder/b/b/bb.txt']
Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders.
val df= sparkSession.read
.option("recursiveFileLookup","true")
.option("header","true")
.csv("src/main/resources/nested")
This recursively loads the files from src/main/resources/nested and it's subfolders.
if you want use only files which start with name "a" ,you can use
sc.wholeTextFiles("/folder/a*/*/*.txt") or sc.wholeTextFiles("/folder/a*/a*/*.txt")
as well. We can use * as wildcard.
sc.wholeTextFiles("/directory/201910*/part-*.lzo") get all match files name, not files content.
if you want to load the contents of all matched files in a directory, you should use
sc.textFile("/directory/201910*/part-*.lzo")
and setting reading directory recursive!
sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
TIPS: scala differ with python, below set use to scala!
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

Resources