Can not save python pickled file in spark yarn cluster - apache-spark

I submit spark code to yarn, and in the code, I want to save a python pickled file, but it failed with FileNotFoundError: No such file or directory, my code:
pdf = spark_df.select("select a, b from table").toPandas()
with open('./myfile', 'wb') as f:
pickle.dump(pdf, f)
I substitute ./myfile with file:///myfile or hdfs file, but it still show that error. How can I save this pickled file to local or hdfs?

Related

Issue while trying to read a text file in databricks using Local File API's rather than Spark API

I'm trying to read a small txt file which is added as a table to the default db on Databricks. While trying to read the file via Local File API, I get a FileNotFoundError, but I'm able to read the same file as Spark RDD using SparkContext.
Please find the code below:
with open("/FileStore/tables/boringwords.txt", "r") as f_read:
for line in f_read:
print(line)
This gives me the error:
FileNotFoundError Traceback (most recent call last)
<command-2618449717515592> in <module>
----> 1 with open("dbfs:/FileStore/tables/boringwords.txt", "r") as f_read:
2 for line in f_read:
3 print(line)
FileNotFoundError: [Errno 2] No such file or directory: 'dbfs:/FileStore/tables/boringwords.txt'
Where as, I have no problem reading the file using SparkContext:
boring_words = sc.textFile("/FileStore/tables/boringwords.txt")
set(i.strip() for i in boring_words.collect())
And as expected, I get the result for the above block of code:
Out[4]: {'mad',
'mobile',
'filename',
'circle',
'cookies',
'immigration',
'anticipated',
'editorials',
'review'}
I was also referring to the DBFS documentation here to understand the Local File API's limitations but of no lead on the issue.
Any help would be greatly appreciated. Thanks!
The problem is that you're using the open function that works only with local files, and doesn't know anything about DBFS, or other file systems. To get this working, you need to use DBFS local file API and append the /dbfs prefix to file path: /dbfs/FileStore/....:
with open("/dbfs/FileStore/tables/boringwords.txt", "r") as f_read:
for line in f_read:
print(line)
Alternatively you can simply use the built-in csv method:
df = spark.read.csv("dbfs:/FileStore/tables/boringwords.txt")
Alternatively we can use dbutils
files = dbutils.fs.ls('/FileStore/tables/')
li = []
for fi in files:
print(fi.path)
Example ,

Cannot find '/dbfs/databricks-datasets' in my notebook [duplicate]

Trying to read delta log file in databricks community edition cluster. (databricks-7.2 version)
df=spark.range(100).toDF("id")
df.show()
df.repartition(1).write.mode("append").format("delta").save("/user/delta_test")
with open('/user/delta_test/_delta_log/00000000000000000000.json','r') as f:
for l in f:
print(l)
Getting file not found error:
FileNotFoundError: [Errno 2] No such file or directory: '/user/delta_test/_delta_log/00000000000000000000.json'
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<command-1759925981994211> in <module>
----> 1 with open('/user/delta_test/_delta_log/00000000000000000000.json','r') as f:
2 for l in f:
3 print(l)
FileNotFoundError: [Errno 2] No such file or directory: '/user/delta_test/_delta_log/00000000000000000000.json'
I have tried with adding /dbfs/,dbfs:/ nothing got worked out,Still getting same error.
with open('/dbfs/user/delta_test/_delta_log/00000000000000000000.json','r') as f:
for l in f:
print(l)
But using dbutils.fs.head i was able to read the file.
dbutils.fs.head("/user/delta_test/_delta_log/00000000000000000000.json")
'{"commitInfo":{"timestamp":1598224183331,"userId":"284520831744638","userName":"","operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"notebook":{"","isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputBytes":"1171","numOutputRows":"100"}}}\n{"protocol":{"minReaderVersi...etc
How can we read/cat a dbfs file in databricks with python open method?
By default, this data is on the DBFS, and your code need to understand how to access it. Python doesn't know about it - that's why it's failing.
But there is a workaround - DBFS is mounted to the nodes at /dbfs, so you just need to append it to your file name: instead of /user/delta_test/_delta_log/00000000000000000000.json, use /dbfs/user/delta_test/_delta_log/00000000000000000000.json
update: on community edition, in DBR 7+, this mount is disabled. The workaround would be to use dbutils.fs.cp command to copy file from DBFS to local directory, like, /tmp, or /var/tmp, and then read from it:
dbutils.fs.cp("/file_on_dbfs", "file:///tmp/local_file")
please note that if you don't specify URI schema, then the file by default is referring DBFS, and to refer the local file you need to use file:// prefix (see docs).

How to import text file in Data bricks

I am trying to write text file with some text and loading same text file in data-bricks but i am getting error
Code
#write a file to DBFS using Python I/O APIs
with open("/dbfs/FileStore/tables/test_dbfs.txt", 'w') as f:
f.write("Apache Spark is awesome!\n")
f.write("End of example!")
# read the file
with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
for line in f_read:
print(line)
Error
FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStore/tables/test_dbfs.txt'
The /dbfs mount doesn't work on Community Edition with DBR >= 7.x - it's a known limitation.
To workaround this limitation you need to work with files on the driver node and upload or download files using the dbutils.fs.cp command (docs). So your writing will look as following:
#write a file to local filesystem using Python I/O APIs
with open("'file:/tmp/local-path'", 'w') as f:
f.write("Apache Spark is awesome!\n")
f.write("End of example!")
# upload file to DBFS
dbutils.fs.cp('file:/tmp/local-path', 'dbfs:/FileStore/tables/test_dbfs.txt')
and reading from DBFS will look as following:
# copy file from DBFS to local file_system
dbutils.fs.cp('dbfs:/tmp/test_dbfs.txt', 'file:/tmp/local-path')
# read the file locally
with open("/tmp/local-path", "r") as f_read:
for line in f_read:
print(line)

Pysaprk: IOError: [Errno 2] No such file or directory

My PySpark code runs directly in hadoop cluster. But when i am opening this file it gives me this error :IOError: [Errno 2] No such file or directory:
with open("/tmp/CIP_UTILITIES/newjsonfile.json", "w") as fp:
json.dump("json_output", fp)
for this cases when you are working with files it is better to use pathlib module.
Can you run Debug to see where this path is actually pointing?
greeting

MLlib not saving the model data in Spark 2.1

We have a machine learning model that looks roughly like this:
sc = SparkContext(appName = "MLModel")
sqlCtx = SQLContext(sc)
df = sqlCtx.createDataFrame(data_res_promo)
#where data_res promo comes from a pandas dataframe
indexer = StringIndexer(inputCol="Fecha_Code", outputCol="Fecha_Index")
train_indexer = indexer.fit(df)
train_indexer.save('ALSIndexer') #This saves the indexer architecture
In my machine, when I run it as a local, it generates a folder ALSIndexer/ that has the parquet and all the information on the model.
When I run it in our Azure cluster of Spark, it does not generate the folder in the main node (nor in the slaves). However, if we try to rewrite it, it says:
cannot overwrite folder
Which means is somewhere, but we can't find it.
Would you have any pointers?
Spark will by default save files to the distributed filesystem (probably HDFS). The files will therefore not be visible on the nodes themselves but, as they are present, you get the "cannot overwrite folder" error message.
You can easily access the files through the HDFS to copy them to the main node. This can be done in the command line by one of these commands:
1.hadoop fs -get <HDFS file path> <Local system directory path>
2.hadoop fs -copyToLocal <HDFS file path> <Local system directory path>
It can also be done by importing the org.apache.hadoop.fs.FileSystem and utilize the commands available there.

Resources