How can I use a JSR-203 file system provider with Apache Spark? - apache-spark

We'd like to use the HDFS NIO.2 file system provider in a spark job. However, we've run into the classpath issues with file system providers: they have to be in the system classpath to be used through the Paths.get(URI) API. As a result, the provider is not found even when it is provided in the jar files supplied to spark-submit.
Here's the spark-submit command:
spark-submit --master local["*"] \
--jars target/dependency/jimfs-1.1.jar,target/dependency/guava-16.0.1.jar \
--class com.basistech.tc.SparkFsTc \
target/spark-fs-tc-0.0.1-SNAPSHOT.jar
And here's the job class, which fails with 'file system not found.'
public final class SparkFsTc {
private SparkFsTc() {
//
}
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("File System Test Case");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.parallelize(Collections.singletonList("foo"));
System.out.println(logData.getNumPartitions());
logData.mapPartitions(itr -> {
FileSystem fs = Jimfs.newFileSystem();
Path path = fs.getPath("/root");
URI uri = path.toUri();
Paths.get(uri); // expect this to go splat.
return null;
}).collect();
}
}
Is there some mechanism to persuade spark to add the FS provider to the appropriate classpath?
Readers should note that file system providers are special. If you read the code in the JRE, you will see
ServiceLoader<FileSystemProvider> sl = ServiceLoader
.load(FileSystemProvider.class, ClassLoader.getSystemClassLoader()).
They have to be in 'the system class loader'. They are not found locally.
This thing would work fine if I acquired the FileSystem object reference myself instead of using Paths.get(URI).

Related

Has any one been able to connect Spark Job Server to HDInsight

HDinsight used to support Spark Job Server https://github.com/Huachao/azure-content/blob/master/articles/hdinsight/hdinsight-apache-spark-job-server.md but it is no longer supported does anyone still connect to HDInsight via SJS, thanks, I am trying to connect by embedding SJS in an identical spark distribution and running it as a service while pointing to HDinsigt cluster via configurations
spark {
master = yarn
# Default # of CPUs for jobs to use for Spark standalone cluster
job-number-cpus = 4
jobserver {
port = 8090
context-per-jvm = true
daorootdir = "/tmp/spark-jobserver"
binarydao {
class = spark.jobserver.io.HdfsBinaryDAO
}
metadatadao {
class = spark.jobserver.io.MetaDataSqlDAO
}
sqldao {
# Slick database driver, full classpath
slick-driver = slick.jdbc.MySQLProfile
# JDBC driver, full classpath
jdbc-driver = com.mysql.jdbc.Driver
jdbc {
url = "jdbc:mysql://x.x.x.x/spark_jobserver"
user = "xxxxx"
password = "xxxx"
}
dbcp {
maxactive = 20
maxidle = 10
initialsize = 10
}
}
}
flyway.locations="db/mysql/migration"
# predefined Spark contexts
contexts {
my-low-latency-context {
num-cpu-cores = 1 # Number of cores to allocate. Required.
memory-per-node = 512m # Executor memory per node, -Xmx style eg 512m, 1G, etc.
}
# define additional contexts here
}
# universal context configuration. These settings can be overridden, see README.md
context-settings {
# choose a port that is free on your system and also the 16 (depends on max retries for submitting the job) next portnumbers should be free
spark.driver.port = 32456 # important
# defines the place where your spark-assembly jar is located in your hdfs
spark.yarn.jar = "hdfs://hadoopHDFSCluster/spark_jars/spark-assembly-1.6.0-hadoop2.6.0.jar" # important
# defines the YARN queue the job is submitted to
#spark.yarn.queue = root.myYarnQueue
num-cpu-cores = 2 # Number of cores to allocate. Required.
memory-per-node = 512m # Executor memory per node, -Xmx style eg 512m, #1G, etc.
# in case spark distribution should be accessed from HDFS (as opposed to being installed on every mesos slave)
# spark.executor.uri = "hdfs://namenode:8020/apps/spark/spark.tgz"
# uris of jars to be loaded into the classpath for this context. Uris is a string list, or a string separated by commas ','
# dependent-jar-uris = ["file:///some/path/present/in/each/mesos/slave/somepackage.jar"]
# If you wish to pass any settings directly to the sparkConf as-is, add them here in passthrough,
# such as hadoop connection settings that don't use the "spark." prefix
passthrough {
#es.nodes = "192.1.1.1"
}
}
# This needs to match SPARK_HOME for cluster SparkContexts to be created successfully
home = "/usr/hdp/current/spark2-client"
}

Add Sentry Log4j2 appender at runtime

I've been browsing previous threads about adding Log4j2 appenders at runtime but none of them really seem to fit my scenario.
We have a longrunning Flink Job packaged into a FAT jar that we essentially submit to a running Flink cluster. We want to forward error logs to Sentry. Conveniently enough Sentry provides a Log4j2 appender that I want to be able to use, but all attempts to get Log4j2 to work have failed -- going a bit crazy about this (spent days).
Since Flink (who also uses log4j2) provides a set of default logging configurations that takes precedence of any configuration files we bundle in our jar; I'm essentially left with attempting to configure the appender at runtime to see if that will make it register the appender and forward the LogEvents to it.
As a side note: I attempted to override the Flink provided configuration file (to essentially add the appender directly into the Log4j2.properties file) but Flink fails to load the plugin due to a missing dependency - io.sentry.IHub - which doesn't make sense since all examples/sentry docs don't mention any other dependencies outside of log4j related ones which already exists in the classpath.
I've followed the example in the log4j docs: Programmatically Modifying the Current Configuration after Initialization but the logs are not getting through to Sentry.
SentryLog4j.scala
package com.REDACTED.thoros.config
import io.sentry.log4j2.SentryAppender
import org.apache.logging.log4j.Level
import org.apache.logging.log4j.LogManager
import org.apache.logging.log4j.core.LoggerContext
import org.apache.logging.log4j.core.config.AppenderRef
import org.apache.logging.log4j.core.config.Configuration
import org.apache.logging.log4j.core.config.LoggerConfig
object SentryLog4j2 {
val SENTRY_LOGGER_NAME = "Sentry"
val SENTRY_BREADCRUMBS_LEVEL: Level = Level.ALL
val SENTRY_MINIMUM_EVENT_LEVEL: Level = Level.ERROR
val SENTRY_DSN =
"REDACTED"
def init(): Unit = {
// scalafix:off
val loggerContext: LoggerContext =
LogManager.getContext(false).asInstanceOf[LoggerContext]
val configuration: Configuration = loggerContext.getConfiguration
val sentryAppender: SentryAppender = SentryAppender.createAppender(
SENTRY_LOGGER_NAME,
SENTRY_BREADCRUMBS_LEVEL,
SENTRY_MINIMUM_EVENT_LEVEL,
SENTRY_DSN,
false,
null
)
sentryAppender.start()
configuration.addAppender(sentryAppender)
// Creating a new dedicated logger for Sentry
val ref: AppenderRef =
AppenderRef.createAppenderRef("Sentry", null, null)
val refs: Array[AppenderRef] = Array(ref)
val loggerConfig: LoggerConfig = LoggerConfig.createLogger(
false,
Level.ERROR,
"org.apache.logging.log4j",
"true",
refs,
null,
configuration,
null
)
loggerConfig.addAppender(sentryAppender, null, null)
configuration.addLogger("org.apache.logging.log4j", loggerConfig)
println(configuration.getAppenders)
loggerContext.updateLoggers()
// scalafix:on
}
}
Then invoking the SentryLog4j.init() in the Main module.
import org.apache.logging.log4j.LogManager
import org.apache.logging.log4j.Logger
import org.apache.logging.log4j.core.LoggerContext
import org.apache.logging.log4j.core.config.Configuration
object Main {
val logger: Logger = LogManager.getLogger()
sys.env.get("ENVIRONMENT") match {
case Some("dev") | Some("staging") | Some("production") =>
SentryLog4j2.init()
case _ => SentryLog4j2.init() // <-- this was only added during debugging
}
def main(args: Array[String]): Unit = {
logger.error("test") // this does not forward the logevent to the appender
}
}
I think I somehow need to register the appender to loggerConfig that the rootLogger uses so that all logger.error statements are propogated to the configured Sentry appender?
Greatly appreciate any guidance with this!
Although not an answer to how you get log4j2 and the SentryAppender to work. For anyone else that is stumbling on this problem, I'll just briefly explain what I did to get the sentry integration working.
What I eventually decided to do was drop the use of the SentryAppender and instead used the raw sentry client. Adding a wrapper class exposing the typical debug, info, warn and error methods. Then for the warn+ methods, I'd also send the logevent to Sentry.
This is essentially the only way I got this to work within a Flink cluster.
See example below:
sealed trait LoggerLike {
type LoggerFn = (String, Option[Object]) => Unit
val debug: LoggerFn
val info: LoggerFn
val warn: LoggerFn
val error: LoggerFn
}
trait LazyLogging {
#transient
protected lazy val logger: CustomLogger =
CustomLogger.getLogger(getClass.getName, enableSentry = true)
}
final class CustomLogger(slf4JLogger: Logger) extends LoggerLike {...your implementation...}
Then for each class/object (scala language at least), you'd just extend the LazyLogging trait to get a logger instance.

Spark-Submit failing in cluster mode when passing files using --files

I have a Java-spark code that reads certain properties files. These properties are being passed with spark-submit like:
spark-submit
--master yarn \
--deploy-mode cluster \
--files /home/aiman/SalesforceConn.properties,/home/aiman/columnMapping.prop,/home/aiman/sourceTableColumns.prop \
--class com.sfdc.SaleforceReader \
--verbose \
--jars /home/ebdpbuss/aiman/Salesforce/ojdbc-7.jar /home/aiman/spark-salesforce-0.0.1-SNAPSHOT-jar-with-dependencies.jar SalesforceConn.properties columnMapping.prop sourceTableColumns.prop
The code that I have written is:
SparkSession spark = SparkSession.builder().master("yarn").config("spark.submit.deployMode","cluster").getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
Configuration config = jsc.hadoopConfiguration();
FileSystem fs = FileSystem.get(config);
//args[] is the file names that is passed as arguments.
String connDetailsFile = args[0];
String mapFile = args[1];
String sourceColumnsFile = args[2];
String connFile = SparkFiles.get(connDetailsFile);
String mappingFile = SparkFiles.get(mapFile);
String srcColsFile = SparkFiles.get(sourceColumnsFile);
Properties prop = loadProperties(fs,connFile);
Properties mappings = loadProperties(fs,mappingFile);
Properties srcColProp = loadProperties(fs,srcColsFile);
The loadProperties() method I used above:
private static Properties loadProperties(FileSystem fs, String path)
{
Properties prop = new Properties();
FSDataInputStream is = null;
try{
is = fs.open(new Path(path));
prop.load(is);
} catch(Exception e){
e.printStackTrace();
System.exit(1);
}
return prop;
}
And its giving me exception:
Exception in thread "main" org.apache.spark.SparkException: Application application_1550650871366_125913 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1187)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1233)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/03/01 14:34:00 INFO ShutdownHookManager: Shutdown hook called
When you pass the path to files using --files they are stored in the local directory (temporary) for each executor. So if the file names do not change then you can just use them as follows instead of using the full path provided in arguments.
String connDetailsFile = "SalesforceConn.properties";
String mapFile = "columnMapping.prop";
String sourceColumnsFile = "sourceTableColumns.prop";
If the file names do change each time then you have to strip off the path to the file and just use the file name. This is because spark doesn't recognize that as a path but considers the whole string to be a file name.
For example, /home/aiman/SalesforceConn.properties will be considered as a file name and spark will give you an exception saying it can't find a file with name /home/aiman/SalesforceConn.properties
So your code should be something like this.
String connDetailsFile = args[0].split("/").last
String mapFile = args[1].split("/").last
String sourceColumnsFile = args[2].split("/").last

How to use createTempFile in groovy/Jenkins to create a file in non-default directory?

What I am trying to achieve is to create a temporary file in groovy in workspace directory, but as an example /tmp/foo will be good enough.
So, here is perfectly working java code:
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.Files;
class foo {
public static void main(String[] args) {
try {
String s="/tmp/foo";
Path p=Paths.get(s);
Path tmp=Files.createTempFile(p,"pref",".suf");
System.out.println(tmp.toString());
} catch (Exception e) {
e.printStackTrace();
}
}
}
however, when used in context of Jenkins pipeline it simply does not work:
def mktemp() {
//String s=pwd(tmp:true)
String s="/tmp/foo"
Path p=Paths.get(s)
Path tmp=Files.createTempFile(p,"pref",".suf")
return tmp;
}
The result is array element type mismatch message with nothing helpful in pipeline log:
java.lang.IllegalArgumentException: array element type mismatch
at java.lang.reflect.Array.set(Native Method)
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovyCallSiteSelector.parametersForVarargs(GroovyCallSiteSelector.java:104)
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovyCallSiteSelector.matches(GroovyCallSiteSelector.java:51)
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovyCallSiteSelector.findMatchingMethod(GroovyCallSiteSelector.java:197)
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovyCallSiteSelector.staticMethod(GroovyCallSiteSelector.java:191)
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onStaticCall(SandboxInterceptor.java:153)
at org.kohsuke.groovy.sandbox.impl.Checker$2.call(Checker.java:184)
at org.kohsuke.groovy.sandbox.impl.Checker.checkedStaticCall(Checker.java:188)
at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:95)
at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
at WorkflowScript.mktemp(WorkflowScript:16)
The java.io.File.createTempFile() is not any better. In plain java code it works perfectly. In groovy it throws java.io.IOException: No such file or directory.
BTW, /tmp/foo directory exists, methods are added on script approval screen.
From the IOException I suspect you're calling mktemp from within a node {} block and expecting to create the temporary file on that node. Pipeline scripts are run entirely on the Jenkins master. Pipeline steps that interact with the filesystem (e.g. writeFile) are aware of node {} blocks and will be sent over to the node to be executed there, but any pure-Java methods know nothing about remote nodes and are going to interact with the master's filesystem.

Apache Spark: saveAsTextFile not working correctly in Stand Alone Mode

I wrote a simple Apache Spark (1.2.0) Java program to import a text file and then write it to disk using saveAsTextFile. But the output folder either has no content (just the _SUCCESS file) or at times has incomplete data (data from just 1/2 of the tasks ).
When I do a rdd.count() on the RDD, it shows the correct number, so I know the RDD correctly constructed, it is just the saveAsTextFile method which is not working.
Here is the code:
/* SimpleApp.java */
import java.util.List;
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
public class SimpleApp {
public static void main(String[] args) {
String logFile = "/tmp/READ_ME.txt"; // Should be some file on your system
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(logFile);
logData.saveAsTextFile("/tmp/simple-output");
System.out.println("Lines -> " + logData.count());
}
}
This is because you're saving to a local path. Are you running multiple machines? so, each worker is saving to its own /tmp directory. Sometimes, you have the driver executing a task so you get part of the result locally. Really you won't want to mix distributed mode and local file systems.
You can try code like below(for eg)..
JavaSparkContext sc = new JavaSparkContext("local or your network IP","Application name");
JavaRDD<String> lines = sc.textFile("Path Of Your File", No. of partitions).count();
And then you print no. of lines containing in file.

Resources