Issue while trying to read a text file in databricks using Local File API's rather than Spark API - apache-spark

I'm trying to read a small txt file which is added as a table to the default db on Databricks. While trying to read the file via Local File API, I get a FileNotFoundError, but I'm able to read the same file as Spark RDD using SparkContext.
Please find the code below:
with open("/FileStore/tables/boringwords.txt", "r") as f_read:
for line in f_read:
print(line)
This gives me the error:
FileNotFoundError Traceback (most recent call last)
<command-2618449717515592> in <module>
----> 1 with open("dbfs:/FileStore/tables/boringwords.txt", "r") as f_read:
2 for line in f_read:
3 print(line)
FileNotFoundError: [Errno 2] No such file or directory: 'dbfs:/FileStore/tables/boringwords.txt'
Where as, I have no problem reading the file using SparkContext:
boring_words = sc.textFile("/FileStore/tables/boringwords.txt")
set(i.strip() for i in boring_words.collect())
And as expected, I get the result for the above block of code:
Out[4]: {'mad',
'mobile',
'filename',
'circle',
'cookies',
'immigration',
'anticipated',
'editorials',
'review'}
I was also referring to the DBFS documentation here to understand the Local File API's limitations but of no lead on the issue.
Any help would be greatly appreciated. Thanks!

The problem is that you're using the open function that works only with local files, and doesn't know anything about DBFS, or other file systems. To get this working, you need to use DBFS local file API and append the /dbfs prefix to file path: /dbfs/FileStore/....:
with open("/dbfs/FileStore/tables/boringwords.txt", "r") as f_read:
for line in f_read:
print(line)

Alternatively you can simply use the built-in csv method:
df = spark.read.csv("dbfs:/FileStore/tables/boringwords.txt")

Alternatively we can use dbutils
files = dbutils.fs.ls('/FileStore/tables/')
li = []
for fi in files:
print(fi.path)
Example ,

Related

FastText Error! ValueError: (file-name) cannot be opened for training

I Have installed fasttext module in Python and loaded the model [ 'cc.en.300.bin'].
I already made the data frame format according to the fasttext. and then generating the files
train.to_csv(" ecomm.train",columns=['Category_description'], index= False, header= False)
test.to_csv("ecom.test", columns=['Category_description'], index= False, header= False)
the files created successfully! then when I run this code
import fasttext
mod= fasttext.train_supervised(input='ecomm.train')
I get this error:
Traceback (most recent call last):
File "/Users/rosie/Documents/ProGraMinG/Python/pythonProject/FastText/FastText_overview.py", line 97, in <module>
mod= fasttext.train_supervised(input='ecomm.train')
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fasttext/FastText.py", line 533, in train_supervised
fasttext.train(ft.f, a)
ValueError: ecomm.train cannot be opened for training!
{ UPDATE } !!!
Used both isfile() and exists() functions to check if the file exists:
path = 'Users/rosie/Documents/ProGraMinG/Python/pythonProject/FastText/ecomm.train'
check_file = os.path.isfile(path)
print("isfile method ",check_file)
check_file = os.path.exists(path)
print("exists method ",check_file)
Both methods returns ' False '.
I also checked if the file is readable or not
doc= open(' ecomm.train', 'r')
print('checking if the file is readable', doc.readable())
However, it returned 'True', now I'm confused. As for the size of the ' ecomm.train', it is 29.4 MB
Are you sure the file is readable, at the simple (local) path 'ecomm.train', from your Python process, given its current local orking directory?
For example, try specifying the file as its full absolute path – on MacOS probably something like /Users/yourusername/yourdirectory/etc/etc/ecomm.train. If that works, the problem was that your Python code's effective directory wasn't what you expected.
Alternatively, if the process that wrote the file was in some way a different user than the later process trying to read it, there might be permission errors.
Totally separate from fasttext, you could check, from the same code that's about to try fasttext operations, if the file is readable (via either the local path, or the absolute path) using a recipe lie that in this other answer: https://stackoverflow.com/a/44213239/130288
Even if it fails, it might give a more-explanatory error.

Cannot find '/dbfs/databricks-datasets' in my notebook [duplicate]

Trying to read delta log file in databricks community edition cluster. (databricks-7.2 version)
df=spark.range(100).toDF("id")
df.show()
df.repartition(1).write.mode("append").format("delta").save("/user/delta_test")
with open('/user/delta_test/_delta_log/00000000000000000000.json','r') as f:
for l in f:
print(l)
Getting file not found error:
FileNotFoundError: [Errno 2] No such file or directory: '/user/delta_test/_delta_log/00000000000000000000.json'
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<command-1759925981994211> in <module>
----> 1 with open('/user/delta_test/_delta_log/00000000000000000000.json','r') as f:
2 for l in f:
3 print(l)
FileNotFoundError: [Errno 2] No such file or directory: '/user/delta_test/_delta_log/00000000000000000000.json'
I have tried with adding /dbfs/,dbfs:/ nothing got worked out,Still getting same error.
with open('/dbfs/user/delta_test/_delta_log/00000000000000000000.json','r') as f:
for l in f:
print(l)
But using dbutils.fs.head i was able to read the file.
dbutils.fs.head("/user/delta_test/_delta_log/00000000000000000000.json")
'{"commitInfo":{"timestamp":1598224183331,"userId":"284520831744638","userName":"","operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"notebook":{"","isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputBytes":"1171","numOutputRows":"100"}}}\n{"protocol":{"minReaderVersi...etc
How can we read/cat a dbfs file in databricks with python open method?
By default, this data is on the DBFS, and your code need to understand how to access it. Python doesn't know about it - that's why it's failing.
But there is a workaround - DBFS is mounted to the nodes at /dbfs, so you just need to append it to your file name: instead of /user/delta_test/_delta_log/00000000000000000000.json, use /dbfs/user/delta_test/_delta_log/00000000000000000000.json
update: on community edition, in DBR 7+, this mount is disabled. The workaround would be to use dbutils.fs.cp command to copy file from DBFS to local directory, like, /tmp, or /var/tmp, and then read from it:
dbutils.fs.cp("/file_on_dbfs", "file:///tmp/local_file")
please note that if you don't specify URI schema, then the file by default is referring DBFS, and to refer the local file you need to use file:// prefix (see docs).

How to import text file in Data bricks

I am trying to write text file with some text and loading same text file in data-bricks but i am getting error
Code
#write a file to DBFS using Python I/O APIs
with open("/dbfs/FileStore/tables/test_dbfs.txt", 'w') as f:
f.write("Apache Spark is awesome!\n")
f.write("End of example!")
# read the file
with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
for line in f_read:
print(line)
Error
FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStore/tables/test_dbfs.txt'
The /dbfs mount doesn't work on Community Edition with DBR >= 7.x - it's a known limitation.
To workaround this limitation you need to work with files on the driver node and upload or download files using the dbutils.fs.cp command (docs). So your writing will look as following:
#write a file to local filesystem using Python I/O APIs
with open("'file:/tmp/local-path'", 'w') as f:
f.write("Apache Spark is awesome!\n")
f.write("End of example!")
# upload file to DBFS
dbutils.fs.cp('file:/tmp/local-path', 'dbfs:/FileStore/tables/test_dbfs.txt')
and reading from DBFS will look as following:
# copy file from DBFS to local file_system
dbutils.fs.cp('dbfs:/tmp/test_dbfs.txt', 'file:/tmp/local-path')
# read the file locally
with open("/tmp/local-path", "r") as f_read:
for line in f_read:
print(line)

FileNotFoundError: [WinError 2] The system cannot find the file specified while loading model from s3

I have recently saved a model into s3 using joblib
model_doc is the model object
import subprocess
import joblib
save_d2v_to_s3_current_doc2vec_model(model_doc,"doc2vec_model")
def save_d2v_to_s3_current_doc2vec_model(model,fname):
model_name = fname
joblib.dump(model,model_name)
s3_base_path = 's3://sd-flikku/datalake/current_doc2vec_model'
path = s3_base_path+'/'+model_name
command = "aws s3 cp {} {}".format(model_name,path).split()
print('saving...'+model_name)
subprocess.call(command)
It was successful, but after that when i try to load the model back from s3 it gives me an error
model = load_d2v("doc2vec_model")
def load_d2v(fname):
model_name = fname
s3_base_path='s3://sd-flikku/datalake/current_doc2vec_model'
path = s3_base_path+'/'+model_name
command = "aws s3 cp {} {}".format(path,model_name).split()
print('loading...'+model_name)
subprocess.call(command)
model=joblib.load(model_name)
return model
This is the error i get:
loading...doc2vec_model
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 7, in load_d2v
File "C:\Users\prane\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 339, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\prane\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 800, in __init__
restore_signals, start_new_session)
File "C:\Users\prane\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 1207, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
I don't even understand why it is saying File not found, this was the path i used to save the model but now i'm unable to get the model back from s3. Please help me!!
I suggest that rather than your generic print() lines, showing your intent, you should print the actual command you've composed, to verify that it makes sense upon observation.
If it does, then also try that exact same aws ... command directly, at the command prompt where you had been launching your python code, to make sure it runs that way. If it doesn't, you may get a more clear error.
Note that the error you're getting doesn't particularly look like it's coming from the aws command, of from the S3 service - which might talk about 'paths' or 'objects'. Rather, it's from the Python subprocess system & Popen' call. I think those are via your call tosubprocess.call(), but for some reason your line-of-code isn't shown. (How are you running the block of code with theload_d2v()`?)
That suggests the file that's no found might be the aws command itself. Are you sure it's installed & runnable from the exact working-directory/environment that your Python is running in, and invoking via subprocess.call()?
(BTW, if my previous answer got you over your sklearn.externals.joblib problem, it'd be good for you to mark the answer as accepted, to save other potential answerers from thinking that's still an unsolved question that's blocking you.)
try to add extension of your model file to your fname if you are confident the model file is there.
e.g. doc2vec_model.h3

Python: Cannot Create Zip File

I simply copy-and-pasted this code from a Python tutorial website, but the code won't work. What's missing? I am using version 3.4.3. Thank you.
import zipfile
# Create zip file
print("Creating zip archive")
zf = zipfile.ZipFile("python_zip_file.zip", mode = "w")
try:
# Add file to our zip
zf.write("zippy2.py")
finally:
print("closing")
zf.close()
Traceback (most recent call last):
File "/Users/Cindy/Documents/Python/Zip.py", line 9, in <module>
zf.write("zippy2.py")
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 1326, in write
st = os.stat(filename)
FileNotFoundError: [Errno 2] No such file or directory: 'zippy2.py'
# Add file to our zip
zf.write("zippy2.py")
You should have a file named zippy2.py in the folder.
Since you just copied the code, you might not have the file that was mentioned in the code. create file
zippy2.py in the same folder and check.
Try learning with this..
#!/usr/bin/env python
import zipfile
print("Creating zip archive")
zip = zipfile.ZipFile(‘Archive.zip’, ‘w’) #Archive is the name of the zip file
zip.write(‘file.txt’) #file.txt should be in the current working directory
zip.write(‘file1.txt’) #file1.txt too
zip.close()

Resources