How to read a file from classpath without external dependencies? - string

Is there a one-liner in Scala to read a file from classpath without using external dependencies, e.g. commons-io?
IOUtils.toString(getClass.getClassLoader.getResourceAsStream("file.xml"), "UTF-8")

val text = io.Source.fromInputStream(getClass.getResourceAsStream("file.xml")).mkString
If you want to ensure that the file is closed:
val source = io.Source.fromInputStream(getClass.getResourceAsStream("file.xml"))
val text = try source.mkString finally source.close()

If the file is in the resource folder (then it will be in the root of the class path), you should use the Loader class that it is too in the root of the class path.
This is the code line if you want to get the content (in scala 2.11):
val content: String = scala.io.Source.fromInputStream(getClass.getClassLoader.getResourceAsStream("file.xml")).mkString
In other versions of Scala, Source class could be in other classpath
If you only want to get the Resource:
val resource = getClass.getClassLoader.getResource("file.xml")

Just an update, with Scala 2.13 it is possible to do something like this:
import scala.io.Source
import scala.util.Using
Using.resource(getClass.getResourceAsStream("file.xml")) { stream =>
Source.fromInputStream(stream).mkString
}
Hope it might help someone.

In Read entire file in Scala? #daniel-spiewak proposed a bit different approach which I personally like better than the #dacwe's response.
// scala is imported implicitly
import io.Source._
val content = fromInputStream(getClass.getResourceAsStream("file.xml")).mkString
I however wonder whether or not it's still a one-liner?

Related

Unable to infer schema when loading Parquet file

response = "mi_or_chd_5"
outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite") # Success
print outcome.schema
StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
But then:
outcome2 = sqlc.read.parquet(response) # fail
fails with:
AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
in
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)
The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?
Using Spark 2.1.1. Also fails in 2.2.0.
Found this bug report, but was fixed in
2.0.1, 2.1.0.
UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".
This error usually occurs when you try to read an empty directory as parquet.
Probably your outcome Dataframe is empty.
You could check if the DataFrame is empty with outcome.rdd.isEmpty() before writing it.
In my case, the error occured because I was trying to read a parquet file which started with an underscore (e.g. _lots_of_data.parquet). Not sure why this was an issue, but removing the leading underscore solved the problem.
See also:
Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2
I'm using AWS Glue and I received this error while reading data from a data catalog table (location: s3 bucket).
After bit of analysis I realised that, this is due to file not available in file location(in my case s3 bucket path).
Glue was trying to apply data catalog table schema on a file which doesn't exist.
After copying file into s3 bucket file location, issue got resolved.
Hope this helps someone who encounters/encountered an error in AWS Glue.
This case occurs when you try to read a table that is empty. If the table had correctly inserted data, there should be no problem.
Besides that with parquet, the same thing happens with ORC.
I see there are already so many Answers. But the issue I faced was my Spark job was trying to read a file which is being overwritten by another Spark job that was previously started. It sounds bad, but I did that mistake.
Just to emphasize #Davos answer in a comment, you will encounter this exact exception error, if your file name has a dot . or an underscore _ at start of the filename
val df = spark.read.format("csv").option("delimiter", "|").option("header", "false")
.load("/Users/myuser/_HEADER_0")
org.apache.spark.sql.AnalysisException:
Unable to infer schema for CSV. It must be specified manually.;
Solution is to rename the file and try again (e.g. _HEADER rename to HEADER)
Happened to me for a parquet file that was in the process of being written to.
Just need to wait for it to be completely written.
In my case, the error occurred because the filename contained underscores. Rewriting / reading the file without underscores (hyphens were OK) solved the problem...
For me this happened when I thought loading the correct file path but instead pointed a incorrect folder
I ran into a similar problem with reading a csv
spark.read.csv("s3a://bucket/spark/csv_dir/.")
gave an error of:
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.;
I found if I removed the trailing . and then it works. ie:
spark.read.csv("s3a://bucket/spark/csv_dir/")
I tested this for parquet adding a trailing . and you get an error of:
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
I just encountered the same problem but none of the solutions here work for me. I try to merge the row groups of my parquet files on hdfs by first reading them and write it to another place using:
df = spark.read.parquet('somewhere')
df.write.parquet('somewhere else')
But later when I query it with
spark.sql('SELECT sth FROM parquet.`hdfs://host:port/parquetfolder/` WHERE .. ')
It shows the same problem.
I finally solve this by using pyarrow:
df = spark.read.parquet('somewhere')
pdf = df.toPandas()
adf = pa.Table.from_pandas(pdf) # import pyarrow as pa
fs = pa.hdfs.connect()
fw = fs.open(path, 'wb')
pq.write_table(adf, fw) # import pyarrow.parquet as pq
fw.close()
Check if .parquet files available at the response path. I am assuming, either files don't exist or it may be exist in some internal(partitioned) folders. If files are available under multiple hierarchy folders then append /* for each folder.
As in my case .parquet files were under 3 folders from base_path, so I have given path as base_path/*/*/*
As others mentioned, in my case this error appeared when I was reading S3 keys that did not exist.
A solution is filter-in keys that do exist:
import com.amazonaws.services.s3.AmazonS3URI
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
import java.net.URI
def addEndpointToUrl(url: String, domain: String = "s3.amazonaws.com"): String = {
val uri = new URI(url)
val hostWithEndpoint = uri.getHost + "." + domain
new URI(uri.getScheme, uri.getUserInfo, hostWithEndpoint, uri.getPort, uri.getPath, uri.getQuery, uri.getFragment).toString
}
def createS3URI(url: String): AmazonS3URI = {
try {
// try to instantiate AmazonS3URI with url
new AmazonS3URI(url)
} catch {
case e: IllegalArgumentException if e.getMessage.
startsWith("Invalid S3 URI: hostname does not appear to be a valid S3 endpoint") => {
new AmazonS3URI(addEndpointToUrl(url))
}
}
}
def s3FileExists(spark: SparkSession, url: String): Boolean = {
val amazonS3Uri: AmazonS3URI = createS3URI(url)
val s3BucketUri = new URI(s"${amazonS3Uri.getURI().getScheme}://${amazonS3Uri.getBucket}")
FileSystem
.get(s3BucketUri, spark.sparkContext.hadoopConfiguration)
.exists(new Path(url))
}
and you can use it as:
val partitions = List(yesterday, today, tomorrow)
.map(f => somepath + "/date=" + f)
.filter(f => s3FileExists(spark, f))
val df = spark.read.parquet(partitions: _*)
For that solution I took some code out of spark-redshift project, here.
You are just loading a parquet file , Of course parquet had valid
schema. Otherwise it would not been saved as parquet. This error means
-
Either parquet file does not exist . (99.99% cases this is the issue. Spark error messages are often less obvious)
Somehow the parquet file got corrupted or Or It's not a parquet file at all
I ran into this issue because of folder in folder issue.
for example folderA.parquet was supposed to have partion.... but instead it have folderB.parquet which inside have partition.
Resolution,
transfer the file to parent folder and delete the subfolder.
you can read with /*
outcome2 = sqlc.read.parquet(f"{response}/*") # work for me
My problem was moving the files to another location and not changing the path in the files in the folder _spark_metadata.
To fix it do something like:
sed -i "s|/old/path|/new/path|g" ** .*
Seems this issue can be caused by a lot of reasons, I am providing another scenario:
By default the spark parquet source is using "partition inferring" which means it requires the file path to be partition in Key=Value pairs and the loads happens at the root.
To avoid this, if we assure all the leaf files have identical schema, then we can use
df = spark.read.format("parquet")\
.option("recursiveFileLookup", "true")
to disable the "partition inferring" manually. It basically load files one by one using the parquet's logical_type.

Spark workflow with jar

I'm trying to understand the extend to which one must compile a jar to use Spark.
I'd normally write ad-hoc analysis code in an IDE, then run it locally against data with a single click (in the IDE). If my experiments with Spark are giving me the right indication then I have to compile my script into a jar, and send it to all the Spark nodes. I.e. my workflow would be
Writing analysis script, which will upload a jar of itself (created
below)
Go make the jar.
Run the script.
For ad-hoc iterative work this seems a bit much, and I don't understand how the REPL gets away without it.
Update:
Here's an example, which I couldn't get to work unless I compiled it into a jar and did sc.addJar. But the fact that I must do this seems odd, since there is only plain Scala and Spark code.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.SparkFiles
import org.apache.spark.rdd.RDD
object Runner {
def main(args: Array[String]) {
val logFile = "myData.txt"
val conf = new SparkConf()
.setAppName("MyFirstSpark")
.setMaster("spark://Spark-Master:7077")
val sc = new SparkContext(conf)
sc.addJar("Analysis.jar")
sc.addFile(logFile)
val logData = sc.textFile(SparkFiles.get(logFile), 2).cache()
Analysis.run(logData)
}
}
object Analysis{
def run(logData: RDD[String]) {
val numA = logData.filter(line => line.contains("a")).count()
val numB = logData.filter(line => line.contains("b")).count()
println("Lines with 'a': %s, Lines with 'b': %s".format(numA, numB))
}
}
You are creating an anonymous function in the use of 'filter':
scala> (line: String) => line.contains("a")
res0: String => Boolean = <function1>
That function's generated name is not available unless the jar is distributed to the workers. Did the stack trace on the worker highlight a missing symbol?
If you just want to debug locally without having to distribute the jar you could use the 'local' master:
val conf = new SparkConf().setAppName("myApp").setMaster("local")
While creating JARs is the most common way of handling long-running Spark jobs, for interactive development work Spark has shells available directly in Scala, Python & R. The current quick start guide ( https://spark.apache.org/docs/latest/quick-start.html ) only mentions the Scala & Python shells, but the SparkR guide discusses how to work with SparkR interactively as well (see https://spark.apache.org/docs/latest/sparkr.html ). Best of luck with your journeys into Spark as you find yourself working with larger datasets :)
You can use SparkContext.jarOfObject(Analysis.getClass) to automatically include the jar that you want to distribute without packaging it yourself.
Find the JAR from which a given class was loaded, to make it easy for
users to pass their JARs to SparkContext.
def jarOfClass(cls: Class[_]): Option[String]
def jarOfObject(obj: AnyRef): Option[String]
You want to do something like:
sc.addJar(SparkContext.jarOfObject(Analysis.getClass).get)
HTH!

How to overwrite the output directory in spark

I have a spark streaming application which produces a dataset for every minute.
I need to save/overwrite the results of the processed data.
When I tried to overwrite the dataset org.apache.hadoop.mapred.FileAlreadyExistsException stops the execution.
I set the Spark property set("spark.files.overwrite","true") , but there is no luck.
How to overwrite or Predelete the files from spark?
UPDATE: Suggest using Dataframes, plus something like ... .write.mode(SaveMode.Overwrite) ....
Handy pimp:
implicit class PimpedStringRDD(rdd: RDD[String]) {
def write(p: String)(implicit ss: SparkSession): Unit = {
import ss.implicits._
rdd.toDF().as[String].write.mode(SaveMode.Overwrite).text(p)
}
}
For older versions try
yourSparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = SparkContext(yourSparkConf)
In 1.1.0 you can set conf settings using the spark-submit script with the --conf flag.
WARNING (older versions): According to #piggybox there is a bug in Spark where it will only overwrite files it needs to to write it's part- files, any other files will be left unremoved.
since df.save(path, source, mode) is deprecated, (http://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.DataFrame)
use df.write.format(source).mode("overwrite").save(path)
where df.write is DataFrameWriter
'source' can be ("com.databricks.spark.avro" | "parquet" | "json")
From the pyspark.sql.DataFrame.save documentation (currently at 1.3.1), you can specify mode='overwrite' when saving a DataFrame:
myDataFrame.save(path='myPath', source='parquet', mode='overwrite')
I've verified that this will even remove left over partition files. So if you had say 10 partitions/files originally, but then overwrote the folder with a DataFrame that only had 6 partitions, the resulting folder will have the 6 partitions/files.
See the Spark SQL documentation for more information about the mode options.
The documentation for the parameter spark.files.overwrite says this: "Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source." So it has no effect on saveAsTextFiles method.
You could do this before saving the file:
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9000"), hadoopConf)
try { hdfs.delete(new org.apache.hadoop.fs.Path(filepath), true) } catch { case _ : Throwable => { } }
Aas explained here:
http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html
df.write.mode('overwrite').parquet("/output/folder/path") works if you want to overwrite a parquet file using python. This is in spark 1.6.2. API may be different in later versions
val jobName = "WordCount";
//overwrite the output directory in spark set("spark.hadoop.validateOutputSpecs", "false")
val conf = new
SparkConf().setAppName(jobName).set("spark.hadoop.validateOutputSpecs", "false");
val sc = new SparkContext(conf)
This overloaded version of the save function works for me:
yourDF.save(outputPath, org.apache.spark.sql.SaveMode.valueOf("Overwrite"))
The example above would overwrite an existing folder. The savemode can take these parameters as well (https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html):
Append: Append mode means that when saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
ErrorIfExists: ErrorIfExists mode means that when saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
Ignore: Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data.
Spark – Overwrite the output directory:
Spark by default doesn’t overwrite the output directory on S3, HDFS, and any other file systems, when you try to write the DataFrame contents to an existing directory, Spark returns runtime error hence. To overcome this Spark provides an enumeration org.apache.spark.sql.SaveMode.Overwrite to overwrite the existing folder.
We need to use this Overwrite as an argument to mode() function of the DataFrameWrite class, for example.
df. write.mode(SaveMode.Overwrite).csv("/tmp/out/foldername")
or you can use the overwrite string.
df.write.mode("overwrite").csv("/tmp/out/foldername")
Besides Overwrite, SaveMode also offers other modes like SaveMode.Append, SaveMode.ErrorIfExists and SaveMode.Ignore
For older versions of Spark, you can use the following to overwrite the output directory with the RDD contents.
sparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sparkContext = SparkContext(sparkConf)
If you are willing to use your own custom output format, you would be able to get the desired behaviour with RDD as well.
Have a look at the following classes:
FileOutputFormat,
FileOutputCommitter
In file output format you have a method named checkOutputSpecs, which is checking whether the output directory exists.
In FileOutputCommitter you have the commitJob which is usually transferring data from the temporary directory to its final place.
I wasn't able to verify it yet (would do it, as soon as I have few free minutes) but theoretically: If I extend FileOutputFormat and override checkOutputSpecs to a method that doesn't throw exception on directory already exists, and adjust the commitJob method of my custom output committer to perform which ever logic that I want (e.g. Override some of the files, append others) than I may be able to achieve the desired behaviour with RDDs as well.
The output format is passed to: saveAsNewAPIHadoopFile (which is the method saveAsTextFile called as well to actually save the files). And the Output committer is configured at the application level.

Submit Spark with additional input

I have used Spark to build a machine learning pipeline, which takes a job XML file as an input where users can specify data, features, models and their parameters. The reason for using this job XML input file is that users can simply modify their XML file to config the pipeline and do not need to re-compile from the source code. However, currently the Spark job is typically packaged into an uber-Jar file, and it seems that there is no way to provide additional XML inputs when the job is submitted to YARN.
I wonder if there are any solutions or alternatives?
I'd look into Spark-JobServer You can use it to submit your job to a Spark Cluster together with a configuration. You might have to adapt your XML to the JSON format used by the config or maybe encapsulate it somehow.
Here's an example on how to submit a job + config:
curl -d "input.string = a b c a b see" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'
{
"status": "STARTED",
"result": {
"jobId": "5453779a-f004-45fc-a11d-a39dae0f9bf4",
"context": "b7ea0eb5-spark.jobserver.WordCountExample"
}
}
You should use the resources directory to place the xml file if you want it to be bundled with the jar. This is a basic Java/Scala thing.
Suggest reading: Get a resource using getResource()
To replace the xml in the jar without rebuilding the jar: How do I update one file in a jar without repackaging the whole jar?
The final solution that I used to solve this problem is:
Store the XML file in HDFS,
Pass in the file location of the XML file,
Use the InputStreamHDFS to directly read from HDFS:
val hadoopConf = sc.hadoopConfiguration
val jobfileIn:Option[InputStream] = inputStreamHDFS(hadoopConf, filename)
if (jobfileIn.isDefined) {
logger.info("Job file found in file system: " + filename)
xml = Some(XML.load(jobfileIn.get))
}

JAXB API to generate Java source files directly to OutputStream

I have a schema file and I want to generate the class files directly into MEMORY instead of file system. I have searched a lot, but everywhere I am finding API to generate java files into filesystem only.
Can any please provide links of API to generate the java source files directly into memory.
Thanks,
Harish
I haven't leveraged this code in the way you described, but this fragment might point you in the right direction:
import com.sun.codemodel.*;
import com.sun.tools.xjc.*;
import com.sun.tools.xjc.api.*;
SchemaCompiler sc = XJC.createSchemaCompiler();
sc.setEntityResolver(new YourEntityResolver());
sc.setErrorListener(new YourErrorListener());
sc.parseSchema(SYSTEM_ID, element);
S2JJAXBModel model = sc.bind();

Resources