How many times does the script used in spark pipes gets executed.? - apache-spark

I tried the below spark scala code and got the output as mentioned below.
I have tried to pass the inputs to script, but it didn't receive and when i used collect the print statement i used in the script appeared twice.
My simple and very basic perl script first:
#!/usr/bin/perl
print("arguments $ARGV[0] \n"); // Just print the arguments.
My Spark code:
object PipesExample {
def main(args:Array[String]){
val conf = new SparkConf();
val sc = new SparkContext(conf);
val distScript = "/home/srinivas/test.pl"
sc.addFile(distScript)
val rdd = sc.parallelize(Array("srini"))
val piped = rdd.pipe(Seq(SparkFiles.get("test.pl")))
println(" output " + piped.collect().mkString(" "));
}
}
Output looked like this..
output arguments arguments
1) What mistake i have done to make it fail receiving the arguments.?
2) Why it executed twice.?
If it looks too basic, please apologize me. I was trying to understand to the best and want to clear my doubts.

From my experience, it is executed twice because spark divides your RDD into two partitions and each partition is passed to your external script.

The reason your application couldn't pick test.pl file is, the file is in some node's location. But the application master is created in one of the nodes in cluster. So prolly if the file isn't in that node, it can't pick the file.
You should always save the file in HDFS or S3 to access the external files. Or else pass the HDFS file location through spark command options.

Related

Spark unzip and write CSV as parquet in executor

My issue is that my CSV files come in ZIP .csv.zip format so I cannot just use Sparks .csv directly as I could with .csv.gzip | .csv.gz. Which means I need to decompress the file, read the contents (the files are quite big ~5gb) and write them as parquet files.
My approach is as such:
String paths = "s3a://...,s3a://...,...";
JavaRDD<Tuple2<String, PortableDataStream>> zipRDD = context.binaryFiles(paths, sparkContext.context.defaultParallelism()).toJavaRDD();
JavaRDD<Tuple2<String, List<Row>>> filenameRowsRDD = zipRDD.flatMap(new ConvertLinesToRows());
The first JavaRDD returns a pair of Filename, InputStream. This is then passed to class ConvertLinesToRows which calls ZipInputStream reads the contents of CSV files and for each line creates a new spark Row and finally returning tuple pair of Filename, List<Row> where the list contains all lines from CSV converted to Row.
I now want to save each read CSV as parquet file.
filenameRowsRDD.foreach(tuple -> {
SparkContext newContext = MySparkConfig.createNewContext();
SparkSession newSpark = SparkSession.builder()
.sparkContext(newContext)
.getOrCreate();
Dataset<Row> dataset = newSpark.createDataFrame(tuple._2, datasetSchema.value());
dataset.write().parquet("s3a://...");
});
I recreate the SparkSession in my executor so to be able to use SparkSession.write.
The idea is that this will all run in an executor (I'm hoping). However with this approach, reading file is not an issue problem comes when executor wants to write this output file. Throwing exception for: A master URL must be set in your configuration.
This seems like I'm doing something anti-spark way. It also does not work. I have also tried broadcasting my SparkSession however that throws a NPE inside SparkSession before trying to write.
What would be the correct way to approach my problem here ?
What would be the spark way of doing this.
All of the above code in in my main() method. Am I correct in assuming that first zipRDD is run on master node and than the second filenameRowsRDD runs on executor nodes as well as the .foreach.

pyspark textFileStreaming can not detect txt file while textFile works

To explain my question is different: This question is different from the marked one. First, the input parameter is already a directory(which is correct but the marked question is wrong). Second, I copied the txt file to the directory during streaming running to simulate the new txt file arriving(So new files are generated instead of same files existing in this directory)
My questionis below
I have a directory and txt file /tmp/a.txt, the contents in file is
aaa
bbb
I use pyspark and manually copy this file in same directory, continuously (during streaming running the files are created at the same time)
def count(x):
if x.isEmpty:
print("empty")
return
print(x.count())
sc = SparkContext()
ssc = StreamingContext(sc, 3)
ssc.textFileStream("/tmp/").foreachRDD(count)
The output shows the RDD is empty
However I use
c = sc.textFile("/tmp/").count()
print(c)
it shows c is 2 (consistent with the txt file contents)
Why streaming does not work?
Are you trying to pick up new lines being appended to the /tmp/a.txt file or are you trying to pick up new files being added to the tmp directory?
If its the latter try replacing your last line with this
ssc.textFileStream("/tmp/*").foreachRDD(count)
I have found the solution in scala (still can not pick up new files in python)
First, the sc.textFile and sc.textFileStream takes same paramter, which is a directory name. So above code is right.
However, the difference is, it's Okay for sc.textFile to pick up the files if the directory exists (and it must exist otherwise InvalidInputException would be raised), but in streaming mode sc.textFileStream (local file system), it demands that the directory does not exist and created by streaming program, otherwise the new files could not be picked up (seems to be a bug, only exists in local file system, in HDFS seems to work well according to others experience).
What's more, from some others experience they say that if you delete the directory and re-run the program, the recycle bin must also be emptied.
However, in python this issue still exists, and during no files exist in the directory, scala program would just print 0 but python program would raise warning of
WARN FileInputDStream:87 - Error finding new files
java.lang.NullPointerException
Here is my code in python and scala, the way of writing new files are same so I do not post it here
python code:
if __name__ == "__main__":
sc = SparkContext()
ssc = StreamingContext(sc, 3)
ssc.textFileStream(path).foreachRDD(lambda x: print(x.count()))
ssc.start()
ssc.awaitTermination()
scala code:
def main(args: Array[String]): Unit = {
val sc = new SparkContext()
val ssc = new StreamingContext(sc, Seconds(3))
ssc.textFileStream(params.inputPath).foreachRDD { x =>
print(x.count())
}
ssc.start()
ssc.awaitTermination()
}

trigger.Once() metadata needed

Hi guys simple question for experienced guys.
I have a spark job reading files under a path.
I wanted to use structured streaming even when the source is not really a stream but just a folder with a bunch of files in it.
My question can I use trigger.Once() for this. And if yes how do I make trigger.Once recognizing new files as such.
I tried it out on my laptop and the first run reads everything but when I start the job again files written in the mean time are not recognized and processed at all.
my method looks like this:
def executeSql(spark:SparkSession):Unit ={
val file = "home/hansherrlich/input_event/"
val df = spark.readStream.format("json").schema(getStruct).load("home/hansherrlich/some_event/")
val out = df.writeStream.trigger(Trigger.Once()).format("json").option("path","home/hansherrlich/some_event_processed/").start()
out.processAllAvailable()
out.stop()
//out.awaitTermination()
println("done writing")
}
if reading from files this seems only to work if files where written Delta by Data Bricks.

How to read a spark saved file in java code

I am new to Spark. I have a file TrainDataSpark.java in which I am processing some data and at end of it I am saving my spark processed data to a directory called Predictions with below code
predictions.saveAsTextFile("Predictions");
In same TrainDataSpark.java i am adding below code part just after above line.
OutputGeneratorOptimized ouputGenerator = new OutputGeneratorOptimized();
final Path predictionFilePath = Paths.get("/Predictions/part-00000");
final Path outputHtml = Paths.get("/outputHtml.html");
ouputGenerator.getFormattedHtml(input,predictionFilePath,outputHtml);
And I am getting NoSuchFile exception for /Predictions/part-00000 . I have tried all possible paths but it fails. I think the java code searches for the File on my local system and not hdfs cluster. Is there a way to get file path from cluster so I can pass it furthur? OR is there a way to load my Predictions file to local instead of cluster so as the java part runs with out error?
This can happen if you are running Spark on a cluster. Paths.get looks for the file in the local file system on every node separately, while it exists on hdfs. You can probably load the file using sc.textFile("hdfs:/Predictions") (or sc.textFile("Predictions")).
If, on the other hand, you'd like to save the local file system, you'r gonna need to collect the RDD first and save it using regular Java IO.
I figured it out this way...
String predictionFilePath ="hdfs://pathToHDFS/user/username/Predictions/part-00000";
String outputHtml = "hdfs://pathToHDFS/user/username/outputHtml.html";
URI uriRead = URI.create(predictionFilePath);
URI uriOut = URI.create(outputHtml);
Configuration conf = new Configuration ();
FileSystem fileRead = FileSystem.get (uriRead, conf);
FileSystem fileWrite = FileSystem.get (uriOut, conf);
FSDataInputStream in = fileRead.open(new org.apache.hadoop.fs.Path(uriRead));
FSDataOutputStream out = fileWrite.append(new org.apache.hadoop.fs.Path(uriOut));
/*Java code that uses stream objects to write and read*/
OutputGeneratorOptimized ouputGenerator = new OutputGeneratorOptimized();
ouputGenerator.getFormattedHtml(input,in,out);

How to overwrite the output directory in spark

I have a spark streaming application which produces a dataset for every minute.
I need to save/overwrite the results of the processed data.
When I tried to overwrite the dataset org.apache.hadoop.mapred.FileAlreadyExistsException stops the execution.
I set the Spark property set("spark.files.overwrite","true") , but there is no luck.
How to overwrite or Predelete the files from spark?
UPDATE: Suggest using Dataframes, plus something like ... .write.mode(SaveMode.Overwrite) ....
Handy pimp:
implicit class PimpedStringRDD(rdd: RDD[String]) {
def write(p: String)(implicit ss: SparkSession): Unit = {
import ss.implicits._
rdd.toDF().as[String].write.mode(SaveMode.Overwrite).text(p)
}
}
For older versions try
yourSparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = SparkContext(yourSparkConf)
In 1.1.0 you can set conf settings using the spark-submit script with the --conf flag.
WARNING (older versions): According to #piggybox there is a bug in Spark where it will only overwrite files it needs to to write it's part- files, any other files will be left unremoved.
since df.save(path, source, mode) is deprecated, (http://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.DataFrame)
use df.write.format(source).mode("overwrite").save(path)
where df.write is DataFrameWriter
'source' can be ("com.databricks.spark.avro" | "parquet" | "json")
From the pyspark.sql.DataFrame.save documentation (currently at 1.3.1), you can specify mode='overwrite' when saving a DataFrame:
myDataFrame.save(path='myPath', source='parquet', mode='overwrite')
I've verified that this will even remove left over partition files. So if you had say 10 partitions/files originally, but then overwrote the folder with a DataFrame that only had 6 partitions, the resulting folder will have the 6 partitions/files.
See the Spark SQL documentation for more information about the mode options.
The documentation for the parameter spark.files.overwrite says this: "Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source." So it has no effect on saveAsTextFiles method.
You could do this before saving the file:
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9000"), hadoopConf)
try { hdfs.delete(new org.apache.hadoop.fs.Path(filepath), true) } catch { case _ : Throwable => { } }
Aas explained here:
http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html
df.write.mode('overwrite').parquet("/output/folder/path") works if you want to overwrite a parquet file using python. This is in spark 1.6.2. API may be different in later versions
val jobName = "WordCount";
//overwrite the output directory in spark set("spark.hadoop.validateOutputSpecs", "false")
val conf = new
SparkConf().setAppName(jobName).set("spark.hadoop.validateOutputSpecs", "false");
val sc = new SparkContext(conf)
This overloaded version of the save function works for me:
yourDF.save(outputPath, org.apache.spark.sql.SaveMode.valueOf("Overwrite"))
The example above would overwrite an existing folder. The savemode can take these parameters as well (https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html):
Append: Append mode means that when saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
ErrorIfExists: ErrorIfExists mode means that when saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
Ignore: Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data.
Spark – Overwrite the output directory:
Spark by default doesn’t overwrite the output directory on S3, HDFS, and any other file systems, when you try to write the DataFrame contents to an existing directory, Spark returns runtime error hence. To overcome this Spark provides an enumeration org.apache.spark.sql.SaveMode.Overwrite to overwrite the existing folder.
We need to use this Overwrite as an argument to mode() function of the DataFrameWrite class, for example.
df. write.mode(SaveMode.Overwrite).csv("/tmp/out/foldername")
or you can use the overwrite string.
df.write.mode("overwrite").csv("/tmp/out/foldername")
Besides Overwrite, SaveMode also offers other modes like SaveMode.Append, SaveMode.ErrorIfExists and SaveMode.Ignore
For older versions of Spark, you can use the following to overwrite the output directory with the RDD contents.
sparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sparkContext = SparkContext(sparkConf)
If you are willing to use your own custom output format, you would be able to get the desired behaviour with RDD as well.
Have a look at the following classes:
FileOutputFormat,
FileOutputCommitter
In file output format you have a method named checkOutputSpecs, which is checking whether the output directory exists.
In FileOutputCommitter you have the commitJob which is usually transferring data from the temporary directory to its final place.
I wasn't able to verify it yet (would do it, as soon as I have few free minutes) but theoretically: If I extend FileOutputFormat and override checkOutputSpecs to a method that doesn't throw exception on directory already exists, and adjust the commitJob method of my custom output committer to perform which ever logic that I want (e.g. Override some of the files, append others) than I may be able to achieve the desired behaviour with RDDs as well.
The output format is passed to: saveAsNewAPIHadoopFile (which is the method saveAsTextFile called as well to actually save the files). And the Output committer is configured at the application level.

Resources