How to ignore non-existent paths In Pyspark - apache-spark

I am looking for a way to read a bunch of files from S3, but there is a potential for a path to not exist. I would just like to ignore the fact that the path does not exist, and process all information possible. Example I want to read in files:
files_to_read = []
for id in ids_to_process:
for date in dates_to_process:
files_to_read.append('s3://bucket/date=' + date + '/id=' + id + '/*.parquet')
sqlContext.read.parquet(*files_to_read)
The issue is that some id's may not have started until a certain date, an while I can figure that out, it's not very easy to do it programmatically. What would the easiest way be to either a) ignore a file if the path does not exist. b) check if a path exists.
I have tried sqlContext.sql("spark.sql.files.ignoreMissingFiles=true"), which does not seem to work. Would there be any similar option that I am missing?

Here, missing file really means the deleted file under directory after you construct the DataFrame.
It is recommended to judge whether the target file exists in python in advance instead of handing it over to spark.

You could try something like this, maybe looking to catch the specific exception that is being thrown when a file does not exist (I believe in Scala it's an AnalysisException):
df = None
for path in paths_to_read:
try:
temp_df = sqlContext \
.read \
.parquet(path)
if df is None:
df = temp_df
else:
df = df.union(temp_df)
except:
# Ignoring this path
# print("Path " + path + " cannot be read. Ignoring.")
pass

Haven't seen something in pyspark that can do that. I also faced this and this is what I did:
Have a list of S3 addresses that you want to read.
addrs = ["s3a://abc", "s3a://xyz", ... ]
Test the links beforehand, and remove them if not accessible
for add in addrs:
try:
spark.read.format("parquet").load(add)
except:
print(add)
addrs.remove(add)
Read the updated list using spark
sdf_a = spark\
.read\
.format("parquet")\
.load(addrs)

Related

Azure Databricks: specify programmatically notebook path that contains special character '$' in dbutils.notebook.run

I have a databricks notebook with the following line:
dbutils.notebook.run(f"{notebooks_base_path}/test_notebook", 60, {})
The value string for "notebooks_base_path" parameter is an existing root path in the workspace with this value: "/base/oracle/dim/ops$test"
When execute it, recieve an exception related to parsing the notebook path:
com.databricks.WorkflowException: com.databricks.NotebookExecutionException: FAILED: Notebook not found: /base/oracle/dim/ops$test/test_notebook
I suppose there is an issue parsing "$" in the path, any suggestion?
Thanks
we can handle it in different approach as python variables as like this ,
var_1 = 'Users'
var_2 = 'id#heb.com'
var_3 = 'pull'
var_2 = 'r183532#heb.com'
var = '/'+ var_1 + '/' + var_2 +'/'+ var_3
print(var)
dbutils.notebook.run(var,10)
It looks like it's not possible to do right now because of the some internal implementation details. Escaping of the $ character with \ and $ doesn't help.
I recommend to open a support ticket to solve it.
P.S. Also, you may look onto multitask jobs instead, and or arbitrary files in Repos (if your code is in Python)

I have one person in a dataframe that keeps showing up as \ufeff in my dataframe when I print to console

I have python code that loads a group of exam results. Each exam is saved in it's own csv file.
files = glob.glob('Exam *.csv')
frame = []
files1 = glob.glob('Exam 1*.csv')
for file in files:
frame.append(pd.read_csv(file, index_col=[0], encoding='utf-8-sig'))
for file in files1:
frame.append(pd.read_csv(file, index_col=[0], encoding='utf-8-sig'))
There is one person in the whole dataframe in their name column it shows up as
\ufeffStudents Name
It happens for every single exam. I tried using the encoding argument but that's not fixing the issue. I am out of ideas. Anyone else have anything?
That character is the BOM or "Byte Order Mark."
There are serveral ways to resovle it.
First, I want to suggest to add engine parameter (for example, engine='python' in pd.read_csv() when reading csv files.
pd.read_csv(file, index_col=[0], engine='python', encoding='utf-8-sig')
Secondly, you can simply remove it by replacing with empty string ('').
df['student_name'] = df['student_name'].apply(lambda x: x.replace("\ufeff", ""))

How to add ( and ) in Python result

I am new Python3 so for give me for asking such question but I couldn't find answer on Google, I have Python scanning a file directory and I need to add in an open and close bracket and a new var into the Python result, so that I can insert it into the database.
Mysql requires inserts to be wrapped in brackets in the val =[('test.mp4',newip) ] This works as I get 1 was inserted when I run the hard coded script.
So what I am trying to archive is to modify the result of the scan and add the following
open/close brackets and the new newip into the result of the scan like the following example
Scan result
['test.mp4', 'test_2.mp4', 'test_3.mp4', test_4.mp4']
Insert new result (modified)
[('test',newip), ('test_2.mp4',newip), ('test_3.mp4',newip), ('test_4.mp4',newip)]
When hard coded its works
root#ubuntu:~# python3 testscan.py
['test.mp4', 'test_2.mp4', 'test_3.mp4', test_4.mp4']
1 was inserted.
Please can anyone advise how to achieve this, below is the full code
import os, mysql.connector, re, uuid
files = [f.name for f in (os.scandir('/var/www/html/media/usb1')) if f.name.endswith('.mp4')]
print(files)
newip = (':'.join(re.findall('..', '%012x' % uuid.getnode())))
mydb = mysql.connector.connect(
host="127.0.0.1",
user="user",
password="password",
database="database"
)
mycursor = mydb.cursor()
sql = "INSERT IGNORE INTO files (file,ip) VALUES (%s,%s)"
val =[('test.mp4',newip)]
mycursor.executemany(sql, val)
mydb.commit()
print(mycursor.rowcount, "was inserted.")
So if you want to add the newip to the scan you can use a list comprehension:
files = ['test.mp4', 'test_2.mp4', 'test_3.mp4', 'test_4.mp4']
sql_values = [(file, newip) for file in files]
the result looks like this:
[('test.mp4', newip), ('test2.mp4', newip), ('test3.mp4', newip), ('test4.mp4', newip)]

PySpark load files between timestamps

I have a list of xml files containing a timestamp in the filename. I need to conditionally load those files based on the timestamp value. For this I am using wildcards.
Here the code I am using that is not working:
spark.read \
.format("com.databricks.spark.xml") \
.load("/path/file_[1533804409548-1533873609934]*")
I think you can't do this using wildcards as you want to load the files that are within time range. As it's possible to load a dataframe from multiple locations, you can just make a array of file paths that are within the time range and load the paths. Here is the sample code i've tried,
target_files = []
st = 123
et = 321
path="<files_base_path>"
for file in os.listdir(path):
try:
ts = int(file[5:8])
if ts >= st and ts <= et:
target_files.append(path+file)
except Exception as ex:
continue
spark.read.parquet(*target_files)
Change the constant values based your input. Hopefully it'll help you..

The code seems correct but my files aren't getting deleted

I heard that python can make life easier, I wanted to remove duplicates in folderA by comparing folderB with folderA, so I decided to download python and try coding with python. My code seems correct, however, my files are failing to delete, what's wrong with it?
I tried unlink but doesn't work.
import os
with open(r"C:\pathto\output.txt", "w") as a:
for path, subdirs, files in os.walk(r'C:\pathto\directoryb'):
for filename in files:
#f = os.path.join(path, filename)
#a.write(str(f) + os.linesep)
a.write(str(filename) + '\n')
textFile = open(r'C:\output.txt', 'r')
line = textFile.readline()
while line:
target = str(line)
todelete = 'C:\directorya' + target
if (os.path.exists(todelete)):
os.remove(todelete)
else:
print("failed")
line = textFile.readline()
textFile.close()
I want my files deleted, basically folderA contains some files in folderB, and I'm trying to delete it.
The problem is that the place where you're deleting the file isn't actually deleting a file - it's deleting a variable that contains the file's information.
todelete = 'C:\directorya' + target
if (os.path.exists(todelete)):
os.remove(todelete) # this is deleting todelete, but doesn't get rid of the file!
I had a similar problem in a program I've started on but with a list, and in the end I had to use this kind of format:
lst.remove(lst[val1][val2][val3]) # as opposed to something cleaner-looking, like 'lst.remove(var_to_del)'
It's a pain, but I hope that clarifies the issue! You'll have to go to the file without giving it a variable name.

Resources