Spark: Local file system as default filesystem for spark application - apache-spark

I wrote a spark application in which i want to save dataframe in local filesystem.Spark needs to write a file in local filesystem. Then I use java.io.FileReader and FileWriter to read the local file written by spark , do some modification and then write it back in again in local filesystem. So the filepath I need to use is constant . for ex: file:////name.txt , This will be used for both dataframa.save and java fileReader and fileWriter
i used api like this:
dataframe.save(/abc/name.txt)
But spark is saving this file into HDFS. Do we have any env variable which needs to be set to make spark save file into local fs??
Thanks

try dataframe.save("file:///<LOCAL_PATH>/name.txt)

Related

How to save files in same directory using saveAsNewAPIHadoopFile spark scala

I am using spark streaming and I want to save each batch of spark streaming on my local in Avro format. I have used saveAsNewAPIHadoopFile to save data in Avro format. This works well. But it overwrites the existing file. Next batch data will overwrite the old data. Is there any way to save Avro file in common directory? I tried by adding some properties of Hadoop job conf for adding a prefix in the file name. But not working any properties.
dstream.foreachRDD {
rdd.saveAsNewAPIHadoopFile(
path,
classOf[AvroKey[T]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[T]],
job.getConfiguration()
)
}
Try this -
You can make your process split into 2 steps :
Step-01 :- Write Avro file using saveAsNewAPIHadoopFile to <temp-path>
Step-02 :- Move file from <temp-path> to <actual-target-path>
This will definitely solve your problem for now. I will share my thoughts if I get to fulfill this scenario in one step instead of two.
Hope this is helpful.

HDFS and Spark: Best way to write a file and reuse it from another program

I have some results from a Spark application saved in the HDFS as files called part-r-0000X (X= 0, 1, etc.). And, because I want to join the whole content in a file, I'm using the following command:
hdfs dfs -getmerge srcDir destLocalFile
The previous command is used in a bash script which makes empty the output directory (where the part-r-... files are saved) and, inside a loop, executes the above getmerge command.
The thing is I need to use the resultant file in another Spark program which need that merged file as input in the HDFS. So I'm saving it as local and then I upload it to the HDFS.
I've thought another option which is write the file from the Spark program in this way:
outputData.coalesce(1, false).saveAsTextFile(outPathHDFS)
But I've read coalesce() doesn't help with the performance.
Any other ideas? suggestions? Thanks!
You wish to merge all the files into a single one so that you can load all the files at once into a Spark rdd, is my guess.
Let the files be in Parts(0,1,....) in HDFS.
Why not load it with wholetextFiles, which actually does what you need.
wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn
Do rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”), then rdd contains:
(a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content)
Try SPARK BucketBy.
This is a nice feature via df.write.saveAsTable(), but this format can only be read by SPARK. Data shows up in Hive metastore but cannot be read by Hive, IMPALA.
The best solution that I've found so far was:
outputData.saveAsTextFile(outPath, classOf[org.apache.hadoop.io.compress.GzipCodec])
Which saves the outputData in compressed part-0000X.gz files under the outPath directory.
And, from the other Spark app, it reads those files using this:
val inputData = sc.textFile(inDir + "part-00*", numPartition)
Where inDir corresponds to the outPath.

Naming the csv file in write_df

I am writing a file in sparkR using write_df, I am unable to specify the file name to this:
Code:
write.df(user_log0, path = "Output/output.csv",
source = "com.databricks.spark.csv",
mode = "overwrite",
header = "true")
Problem:
I expect inside the 'Output' folder a file called 'output.csv' but what happens is a folder called 'output.csv' and inside it called 'part-00000-6859b39b-544b-4a72-807b-1b8b55ac3f09.csv'
What am I doing wrong?
P.S: R 3.3.2, Spark 2.1.0 on OSX
Because of the distributed nature of spark, you can only define the directory into which the files would be saved and each executor writes its own file using spark's internal naming convention.
If you see only a single file, it means that you are working in a single partition, meaning only one executor is writing. This is not the normal spark behavior, however, if this fits your use case, you can collect the result to an R dataframe and write to csv from that.
In the more general case where the data is parallelized between multiple executors, you cannot set the specific name for the files.

How to read a spark saved file in java code

I am new to Spark. I have a file TrainDataSpark.java in which I am processing some data and at end of it I am saving my spark processed data to a directory called Predictions with below code
predictions.saveAsTextFile("Predictions");
In same TrainDataSpark.java i am adding below code part just after above line.
OutputGeneratorOptimized ouputGenerator = new OutputGeneratorOptimized();
final Path predictionFilePath = Paths.get("/Predictions/part-00000");
final Path outputHtml = Paths.get("/outputHtml.html");
ouputGenerator.getFormattedHtml(input,predictionFilePath,outputHtml);
And I am getting NoSuchFile exception for /Predictions/part-00000 . I have tried all possible paths but it fails. I think the java code searches for the File on my local system and not hdfs cluster. Is there a way to get file path from cluster so I can pass it furthur? OR is there a way to load my Predictions file to local instead of cluster so as the java part runs with out error?
This can happen if you are running Spark on a cluster. Paths.get looks for the file in the local file system on every node separately, while it exists on hdfs. You can probably load the file using sc.textFile("hdfs:/Predictions") (or sc.textFile("Predictions")).
If, on the other hand, you'd like to save the local file system, you'r gonna need to collect the RDD first and save it using regular Java IO.
I figured it out this way...
String predictionFilePath ="hdfs://pathToHDFS/user/username/Predictions/part-00000";
String outputHtml = "hdfs://pathToHDFS/user/username/outputHtml.html";
URI uriRead = URI.create(predictionFilePath);
URI uriOut = URI.create(outputHtml);
Configuration conf = new Configuration ();
FileSystem fileRead = FileSystem.get (uriRead, conf);
FileSystem fileWrite = FileSystem.get (uriOut, conf);
FSDataInputStream in = fileRead.open(new org.apache.hadoop.fs.Path(uriRead));
FSDataOutputStream out = fileWrite.append(new org.apache.hadoop.fs.Path(uriOut));
/*Java code that uses stream objects to write and read*/
OutputGeneratorOptimized ouputGenerator = new OutputGeneratorOptimized();
ouputGenerator.getFormattedHtml(input,in,out);

Saving DStream on HDFS custom location

Spark DStream has method saveAsTextFiles(prefix, [suffix]) which can be used to save data on HDFS but this function does not accept any path parameter.
myDStream.saveAsTextFiles("prefix_","_suffix")
By default , it is saving data into current logged in user directory on HDFS i.e. if you are running application with root user then data is stored in
/user/root/prefix_TIMESTAMP_suffx
How do I change output directory?
Thanks
Give it a path to the desired HDFS directory as the prefix argument:
myDStream.saveAsTextFiles("hdfs://my/custom/path/prefix_","_suffix")

Resources