Saving a file locally in Databricks PySpark - apache-spark

I am sure there is documentation for this somewhere and/or the solution is obvious, but I've come up dry in all of my searching.
I have a dataframe that I want to export to a text file to my local machine. The dataframe contains strings with commas, so just display -> download full results ends up with a distorted export. I'd like to export out with a tab-delimiter, but I cannot figure out for the life of me how to download it locally.
I have
match1.write.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.save("file:\\\C:\\Users\\user\\Desktop\\NewsArticle.txt")
but clearly this isn't right. I suspect it is writing somewhere else (somewhere I don't want it to be...) because running it again gives me the error that the path already exists. So... what is the correct way?

cricket_007 pointed me along the right path--ultimately, I needed to save the file to the Filestore of Databricks (not just dbfs), and then save the resulting output of the xxxxx.databricks.com/file/[insert file path here] link.
My resulting code was:
df.repartition(1) \ #repartitioned to save as one collective file
.write.format('csv') \ #in csv format
.option("header", True) \ #with header
.option("quote", "") \ #get rid of quote escaping
.option(delimiter="\t") \ #delimiter of choice
.save('dbfs:/FileStore/df/') #saved to the FileStore

Check if it is present at below location. Multiple part files should be there in that folder.
import os
print os.getcwd()
If you want to create a single file (not multiple part files) then you can use coalesce()(but note that it'll force one worker to fetch whole data and write these sequentially so it's not advisable if dealing with huge data)
df.coalesce(1).write.format("csv").\
option("delimiter", "\t").\
save("<file path>")
Hope this helps!

Related

Spark Streaming finds file but claims it can't find the file

I have the below - which monitors a directory & pulls in the logs every X seconds.
The issue I have is this:
I set the script running
I then create a file in the directory (let's say testfile.txt)
The script then errors saying textfile.txt does not exist
It found the file and the filename, so it does exist and it finds it.
What I can see is that I define the path with a file:/// and it returns an error that it can't find file:/. So it seems to be missing two // for some reason:
Thanks for any help!!!!
Code
#only files after stream starts
df = spark_session\
.readStream\
.option('newFilesOnly', 'true')\
.option('header', 'true')\
.schema(myschema)\
.text('file:///home/keenek1/analytics/logs/')\
.withColumn("FileName", input_file_name())
Error
FileNotFoundException: File file:/home/keenek1/analytics/logs/loggywoggywoo.txt does not exist\
Please change file:/// to hdfs://.
df = spark_session\
.readStream\
.option('newFilesOnly', 'true')\
.option('header', 'true')\
.schema(myschema)\
.text('hdfs://home/keenek1/analytics/logs/')\ # changed file:/// to hdfs://
.withColumn("FileName", input_file_name())
For below question
If the same log file is overwritten lets say hourly, the checkpoint doesn't reprocess the file. I need it to say 'if modified time changes, reprocess' - is that possible?
workaround will be, point your spark streaming to different directory & use spark listeners to check file timestamp from actual directory if any changes in file timestamp, move that file to your streaming directory with new name
Let me know if you want code, I can give you in scala, may be you need to convert that into python.

Databricks - Creating output file

I'm pretty new to databricks, so excuse my ignorance.
I have a databricks notebook that creates a table to hold data. I'm trying to output the data to a pipe delimited file using another notebook which is using python. If I use the 'Order By' clause each record is created in a seperate file. If I leave the clause out of the code I get 1 file, but it's not in order
The code from the notebook is as follows
%python
try:
dfsql = spark.sql("select field_1, field_2, field_3, field_4, field_5, field_6, field_7, field_8, field_9, field_10, field_11, field_12, field_13, field_14, field_15, field_16 from dbsmets1mig02_technical_build.tbl_tech_output_bsmart_update ORDER BY MSN,Sort_Order") #Replace with your SQL
except:
print("Exception occurred")
if dfsql.count() == 0:
print("No data rows")
else:
dfsql.write.format("com.databricks.spark.csv").option("header","false").option("delimiter", "|").mode("overwrite").save("/mnt/publisheddatasmets1mig/smetsmig1/mmt/bsmart")
Spark creates a file per partition when writing files. So your order by is creating lots of partitions. Generally you want multiple files as that means you get more throughput - if you have 1 file/partition then you are only using one thread - therefore only 1 CPU on your workers is active - the others are idle which makes it a very expensive way of solving your problem.
You could leave the order by in and coalesce back into a single partition:
dfsql.coalesce(1).write.format("com.databricks.spark.csv").option("header","false").option("delimiter", "|").mode("overwrite").save("/mnt/publisheddatasmets1mig/smetsmig1/mmt/bsmart")
Even if you have multiple files you can point your other notebook at the folder and it will read all files in the folder.
To accomplish this I have done something similar to what simon_dmorias suggested. I am not sure if there is a better way to do so, as this doesn't scale very well but if you are working with a small dataset it will work.
simon_dmorias suggested: df.coalesce(1).write.format("com.databricks.spark.csv").option("header","false").option("delimiter", "|").mode("overwrite").save("/mnt/mountone/data/")
This will write a single partition in a directory /mnt/mountone/data/data-<guid>-.csv, which I believe is not what you are looking for, right? You just want /mnt/mountone/data.csv, similar to the pandas .to_csv function.
Therefore, I will write it to a temporary location on the cluster (not on the mount).
df.coalesce(1).write.format("com.databricks.spark.csv").option("header","false").option("delimiter", "|").mode("overwrite").save("/tmpdir/data")
I will then use the dbutils.fs.ls("/tmpdir/data") command to list the directory contents and identify the name of the csv file that was written in the directory i.e. /tmpdir/data/data-<guid>-.csv.
Once you have the CSV file name, I will use the dbutils.fs.cp function to copy the file to a mount location and rename the file. This allows you to have a single file without the directory, which is what I believe you were looking for.
dbutils.fs.cp("/tmpdir/data/data-<guid>-.csv", "/mnt/mountone/data.csv")

HDFS and Spark: Best way to write a file and reuse it from another program

I have some results from a Spark application saved in the HDFS as files called part-r-0000X (X= 0, 1, etc.). And, because I want to join the whole content in a file, I'm using the following command:
hdfs dfs -getmerge srcDir destLocalFile
The previous command is used in a bash script which makes empty the output directory (where the part-r-... files are saved) and, inside a loop, executes the above getmerge command.
The thing is I need to use the resultant file in another Spark program which need that merged file as input in the HDFS. So I'm saving it as local and then I upload it to the HDFS.
I've thought another option which is write the file from the Spark program in this way:
outputData.coalesce(1, false).saveAsTextFile(outPathHDFS)
But I've read coalesce() doesn't help with the performance.
Any other ideas? suggestions? Thanks!
You wish to merge all the files into a single one so that you can load all the files at once into a Spark rdd, is my guess.
Let the files be in Parts(0,1,....) in HDFS.
Why not load it with wholetextFiles, which actually does what you need.
wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn
Do rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”), then rdd contains:
(a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content)
Try SPARK BucketBy.
This is a nice feature via df.write.saveAsTable(), but this format can only be read by SPARK. Data shows up in Hive metastore but cannot be read by Hive, IMPALA.
The best solution that I've found so far was:
outputData.saveAsTextFile(outPath, classOf[org.apache.hadoop.io.compress.GzipCodec])
Which saves the outputData in compressed part-0000X.gz files under the outPath directory.
And, from the other Spark app, it reads those files using this:
val inputData = sc.textFile(inDir + "part-00*", numPartition)
Where inDir corresponds to the outPath.

Difficulty with encoding while reading data in Spark

In connection with my earlier question, when I give the command,
filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.collect()
some part of the data has '\xa0' prefixed to every word, and other part of the data doesn't have that special character. I am attaching 2 pictures, one with '\xa0', and another without '\xa0'. The content shown in 2 pictures belong to same file. Only some part of the data from same file is read that way by Spark. I have checked the original data file present in HDFS, and it was problem free.
I feel that it has something to do with encoding. I tried all methods like using replaceoption in flatMap like flatMap(lambda line: line.replace(u'\xa0', ' ').split(" ")), flatMap(lambda line: line.replace(u'\xa0', u' ').split(" ")), but none worked for me. This question might sound dump, but I am newbie in using Apache Spark, and I require some assistance to overcome this problem.
Can anyone please help me? Thanks in advance.
Check the encoding of your file. When you use sc.textFile, spark expects an UTF-8 encoded file.
One of the solution is to acquire your file with sc.binaryFiles and then apply the expected encoding.
sc.binaryFile create a key/value rdd where key is the path to file and value is the content as a byte.
If you need to keep only the text and apply an decoding function, :
filePath = sc.binaryFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.map(lambda x :x[1].decode('utf-8')) #or another encoding depending on your file

Spark CSV 2.1 File Names

i'm trying to save DataFrame into CSV using the new spark 2.1 csv option
df.select(myColumns: _*).write
.mode(SaveMode.Overwrite)
.option("header", "true")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.csv(absolutePath)
everything works fine and i don't mind haivng the part-000XX prefix
but now seems like some UUID was added as a suffix
i.e
part-00032-10309cf5-a373-4233-8b28-9e10ed279d2b.csv.gz ==> part-00032.csv.gz
Anyone knows how i can remove this file ext and stay only with part-000XX convension
Thanks
You can remove the UUID by overriding the configuration option "spark.sql.sources.writeJobUUID":
https://github.com/apache/spark/commit/0818fdec3733ec5c0a9caa48a9c0f2cd25f84d13#diff-c69b9e667e93b7e4693812cc72abb65fR75
Unfortunately this solution will not fully mirror the old saveAsTextFile style (i.e. part-00000), but could make the output file name more sane such as part-00000-output.csv.gz where "output" is the value you pass to spark.sql.sources.writeJobUUID. The "-" is automatically appended
SPARK-8406 is the relevant Spark issue and here's the actual Pull Request: https://github.com/apache/spark/pull/6864

Resources