Read multiple files with SparkSession in Spark 2.0 - apache-spark

In Spark 1.6 to read multiple files, I have used:
JavaSparkContext ctx;
ctx.textFile(filePaths);
With filePaths is the directory to files. For example we have:
/home/user/folderA/0.log,/home/user/folderB/0.log. Each path separates by comma character.
But, when I upgrade to Spark 2.0. Method
SparkSession sparkSession;
sparkSession.read().textFile(filePaths);
doesn't work. The code throws exception: Path does not exist:
Question: Is there any solution to read multiple files, from multiple paths in Spark 2.0 just like Spark 1.6?
Edit: I try to call the method like Spark 1.6 using:
sparkSession.sparkContext().textFile(filePaths, 1).toJavaRDD();
The problem will solved. But, Is there have another solution?

Related

Sharing a spark session

Lets say I have a python file my_python.py in which I have created a SparkSession 'spark' . I have a jar say my_jar.jar in which some spark logic is written. I am not creating SparkSession in my jar , rather I want to use the same session created in my_python.py. How to write a spark-submit command which take my python file , my jar and my sparksession 'spark' as an argument to my jar file.
Is it possible ?
If not , please share the alternative to do so.
So I feel there are two questions -
Q1. How in scala file you can reuse already created spark session?
Ans: Inside your scala code, you should use builder to get an existing session:
SparkSession.builder().getOrCreate()
Please check the Spark doc
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html
Q2: How you do spark-submit with a .py file as driver and scala jar(s) as supporting jars?
And: It should be in something like this
./spark-submit --jars myjar.jar,otherjar.jar --py-files path/to/myegg.egg path/to/my_python.py arg1 arg2 arg3
So if you notice the method name, it is getOrCreate() - that means if a spark session is already created, no new session will be created rather existing session will be used.
Check this link for full implementation example:
https://www.crowdstrike.com/blog/spark-hot-potato-passing-dataframes-between-scala-spark-and-pyspark/

How to specify Hadoop Configuration when reading CSV

I am using Spark 2.0.2. How can I specify the Hadoop configuration item textinputformat.record.delimiter for the TextInputFormat class when reading a CSV file into a Dataset?
In Java I can code: spark.read().csv(<path>); However, there doesn't seem to be a way to provide a Hadoop configuration specific to the read.
It is possible to set the item using the spark.sparkContext().hadoopConfiguration() but that is global.
Thanks,
You cannot. Data Source API uses its own configuration which, as of 2.0 is not even compatible with Hadoop configuration.
If you want to use custom input format or other Hadoop configuration use SparkContext.hadoopFile, SparkContext.newAPIHadoopRDD or related classes.
Delimiter can be set using option() in spark2.0
var df = spark.read.option("header", "true").option("delimiter", "\t").csv("/hdfs/file/locaton")

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Spark not using spark.sql.parquet.compression.codec

I'm comparing spark's parquets file vs apache-drill's.
Drill's parquet are way more lightweight then spark's. Spark uses GZIP as compression codec as default, for experimenting I tried to change it to
snappy : same size
uncompressed: same size
lzo : exception
I tried both ways:
sqlContext.sql("SET spark.sql.parquet.compression.codec=uncompressed")
sqlContext.setConf("spark.sql.parquet.compression.codec.", "uncompressed")
But seems like it dosen't change his settings
Worked for me in 2.1.1
df.write.option("compression","snappy").parquet(filename)
For spark 1.3 and spark.sql.parquet.compression.codec parameter did not compress the output. But the below one did work.
sqlContext.sql("SET parquet.compression=SNAPPY")
Try this. Seems to work for me in 1.6.0
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")
For Spark 1.6 :
You can use different compression codecs. Try :
sqlContext.setConf("spark.sql.parquet.compression.codec","gzip")
sqlContext.setConf("spark.sql.parquet.compression.codec","lzo")
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
sqlContext.setConf("spark.sql.parquet.compression.codec","uncompressed")
Try:
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
I see you already did this, but I'm unable to delete my answer on mobile. Try setting this before the sqlcontext as suggested in the comment.
When facing issues while storing into Hive via hive context use:
hc.sql("set parquet.compression=snappy")

How to overwrite the output directory in spark

I have a spark streaming application which produces a dataset for every minute.
I need to save/overwrite the results of the processed data.
When I tried to overwrite the dataset org.apache.hadoop.mapred.FileAlreadyExistsException stops the execution.
I set the Spark property set("spark.files.overwrite","true") , but there is no luck.
How to overwrite or Predelete the files from spark?
UPDATE: Suggest using Dataframes, plus something like ... .write.mode(SaveMode.Overwrite) ....
Handy pimp:
implicit class PimpedStringRDD(rdd: RDD[String]) {
def write(p: String)(implicit ss: SparkSession): Unit = {
import ss.implicits._
rdd.toDF().as[String].write.mode(SaveMode.Overwrite).text(p)
}
}
For older versions try
yourSparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = SparkContext(yourSparkConf)
In 1.1.0 you can set conf settings using the spark-submit script with the --conf flag.
WARNING (older versions): According to #piggybox there is a bug in Spark where it will only overwrite files it needs to to write it's part- files, any other files will be left unremoved.
since df.save(path, source, mode) is deprecated, (http://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.DataFrame)
use df.write.format(source).mode("overwrite").save(path)
where df.write is DataFrameWriter
'source' can be ("com.databricks.spark.avro" | "parquet" | "json")
From the pyspark.sql.DataFrame.save documentation (currently at 1.3.1), you can specify mode='overwrite' when saving a DataFrame:
myDataFrame.save(path='myPath', source='parquet', mode='overwrite')
I've verified that this will even remove left over partition files. So if you had say 10 partitions/files originally, but then overwrote the folder with a DataFrame that only had 6 partitions, the resulting folder will have the 6 partitions/files.
See the Spark SQL documentation for more information about the mode options.
The documentation for the parameter spark.files.overwrite says this: "Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source." So it has no effect on saveAsTextFiles method.
You could do this before saving the file:
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9000"), hadoopConf)
try { hdfs.delete(new org.apache.hadoop.fs.Path(filepath), true) } catch { case _ : Throwable => { } }
Aas explained here:
http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html
df.write.mode('overwrite').parquet("/output/folder/path") works if you want to overwrite a parquet file using python. This is in spark 1.6.2. API may be different in later versions
val jobName = "WordCount";
//overwrite the output directory in spark set("spark.hadoop.validateOutputSpecs", "false")
val conf = new
SparkConf().setAppName(jobName).set("spark.hadoop.validateOutputSpecs", "false");
val sc = new SparkContext(conf)
This overloaded version of the save function works for me:
yourDF.save(outputPath, org.apache.spark.sql.SaveMode.valueOf("Overwrite"))
The example above would overwrite an existing folder. The savemode can take these parameters as well (https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html):
Append: Append mode means that when saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
ErrorIfExists: ErrorIfExists mode means that when saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
Ignore: Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data.
Spark – Overwrite the output directory:
Spark by default doesn’t overwrite the output directory on S3, HDFS, and any other file systems, when you try to write the DataFrame contents to an existing directory, Spark returns runtime error hence. To overcome this Spark provides an enumeration org.apache.spark.sql.SaveMode.Overwrite to overwrite the existing folder.
We need to use this Overwrite as an argument to mode() function of the DataFrameWrite class, for example.
df. write.mode(SaveMode.Overwrite).csv("/tmp/out/foldername")
or you can use the overwrite string.
df.write.mode("overwrite").csv("/tmp/out/foldername")
Besides Overwrite, SaveMode also offers other modes like SaveMode.Append, SaveMode.ErrorIfExists and SaveMode.Ignore
For older versions of Spark, you can use the following to overwrite the output directory with the RDD contents.
sparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sparkContext = SparkContext(sparkConf)
If you are willing to use your own custom output format, you would be able to get the desired behaviour with RDD as well.
Have a look at the following classes:
FileOutputFormat,
FileOutputCommitter
In file output format you have a method named checkOutputSpecs, which is checking whether the output directory exists.
In FileOutputCommitter you have the commitJob which is usually transferring data from the temporary directory to its final place.
I wasn't able to verify it yet (would do it, as soon as I have few free minutes) but theoretically: If I extend FileOutputFormat and override checkOutputSpecs to a method that doesn't throw exception on directory already exists, and adjust the commitJob method of my custom output committer to perform which ever logic that I want (e.g. Override some of the files, append others) than I may be able to achieve the desired behaviour with RDDs as well.
The output format is passed to: saveAsNewAPIHadoopFile (which is the method saveAsTextFile called as well to actually save the files). And the Output committer is configured at the application level.

Resources