Spark Structured Streaming - - apache-spark

I'm trying to run the following code from IntelliJ idea to print messages from Kafka to console. But it throws the following error -
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Stacktrace started from Dataset.checkpoint and way up. If I remove .checkpoint() then I get some other error - related to permission
17/08/02 12:10:52 ERROR StreamMetadata: Error writing stream metadata StreamMetadata(4e612f22-efff-4c9a-a47a-a36eb533e9d6) to C:/Users/rp/AppData/Local/Temp/temporary-2f570b97-ad16-4f00-8356-d43ccb7660db/metadata
java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\rp\AppData\Local\Temp\temporary-2f570b97-ad16-4f00-8356-d43ccb7660db\metadata
Source:
def main(args : Array[String]) = {
val spark = SparkSession.builder().appName("SparkStreaming").master("local[*]").getOrCreate()
val canonicalSchema = new StructType()
.add("cid",StringType)
.add("uid",StringType)
.add("sourceSystem",
new StructType().add("id",StringType)
.add("name",StringType))
.add("name", new StructType()
.add("firstname",StringType)
.add("lastname",StringType))
val messages = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
.option("subscribe","c_canonical")
.option("startingOffset","earliest")
.load()
.checkpoint()
.select(from_json(col("value").cast("string"),canonicalSchema))
.writeStream.outputMode("append").format("console").start.awaitTermination
}
Can anyone please help me understand where I'm doing wrong?

Structured Streaming doesn't support Dataset.checkpoint(). There is an open ticket to provide a better message or just ignore it: https://issues.apache.org/jira/browse/SPARK-20927
IOException probably is because you don't install cygwin on Windows.

Related

Too many open files in spark aborting spark job

In my application i am reading 40 GB text files that is totally spread across 188 files .
I split this files and create xml files per line in spark using pair rdd .
For 40 GB of input it will create many millions small xml files and this is my requirement.
All working fine but when spark saves files in S3 it throws error and job fails .
Here is the exception i get
Caused by: java.nio.file.FileSystemException:
/mnt/s3/emrfs-2408623010549537848/0000000000: Too many open files at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at
sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361) at
java.nio.file.Files.createFile(Files.java:632) at
com.amazon.ws.emr.hadoop.fs.files.TemporaryFiles.create(TemporaryFiles.java:70)
at
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.openNewPart(MultipartUploadOutputStream.java:493)
... 21 more
ApplicationMaster host: 10.97.57.198 ApplicationMaster RPC port: 0
queue: default start time: 1542344243252 final status: FAILED
tracking URL:
http://ip-10-97-57-234.tr-fr-nonprod.aws-int.thomsonreuters.com:20888/proxy/application_1542343091900_0001/
user: hadoop Exception in thread "main"
org.apache.spark.SparkException: Application
application_1542343091900_0001 finished with failed status
And this as well
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
Please reduce your request rate. (Service: Amazon S3; Status Code:
503; Error Code: SlowDown; Request ID: D33581CA9A799F64; S3 Extended
Request ID:
/SlEplo+lCKQRVVH+zHiop0oh8q8WqwnNykK3Ga6/VM2HENl/eKizbd1rg4vZD1BZIpp8lk6zwA=),
S3 Extended Request ID:
/SlEplo+lCKQRVVH+zHiop0oh8q8WqwnNykK3Ga6/VM2HENl/eKizbd1rg4vZD1BZIpp8lk6zwA=
Here is my code to do that .
object TestAudit {
def main(args: Array[String]) {
val inputPath = args(0)
val output = args(1)
val noOfHashPartitioner = args(2).toInt
//val conf = new SparkConf().setAppName("AuditXML").setMaster("local");
val conf = new SparkConf().setAppName("AuditXML")
val sc = new SparkContext(conf);
val input = sc.textFile(inputPath)
val pairedRDD = input.map(row => {
val split = row.split("\\|")
val fileName = split(0)
val fileContent = split(1)
(fileName, fileContent)
})
import org.apache.hadoop.io.NullWritable
import org.apache.spark.HashPartitioner
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RddMultiTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
}
pairedRDD.partitionBy(new HashPartitioner(10000)).saveAsHadoopFile("s3://a205381-tr-fr-development-us-east-1-trf-auditabilty//AUDITOUTPUT", classOf[String], classOf[String], classOf[RddMultiTextOutputFormat], classOf[GzipCodec])
}
}
Even i tried reducing no of HashPartitioner then also it does not work
Every process on Unix systems has a limitation of open files or number of file descriptors. As your data is large and partitions to subfile (in internal of Spark), your process meet the limitation and error.
You can increase the number of file descriptors for each user as following:
edit the file: /etc/security/limits.conf and add (or modify)
* hard nofile 500000
* soft nofile 500000
root hard nofile 500000
root soft nofile 500000
This will set the nofile (number of file descriptors) feature to 500000 for each user along with the root user.
After restarting the changes will be applied.
Also, someone can set the number of file descriptors for a special process, by setting the LimitNOFILE. For example, if you use yarn to run Spark jobs and the Yarn daemon will be started using systemd, you can add LimitNOFILE=128000 to Yarn systemd script(resource manager and nodemanager) to set Yarn process number of file descriptors to 128000.
related articles:
3 Methods to Change the Number of Open File Limit in Linux
Limits on the number of file descriptors

log4j in spark main code

i want to collect logger according to log4j, i can get the hadoop logger but can't get the main code logger.
tow attached - 1 log4j.properties
log4j.rootLogger=INFO, rolling
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=50MB
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=/home/spark.log
log4j.appender.rolling.encoding=UTF-8
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=WARN
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=WARN
log4j.logger.com.test.main.Main=WARN
tow attached - 2 com.test.main.Main
object Main {
def main(args: Array[String]): Unit = {
val logger = LogManager.getLogger(Main.getClass)
logger.info("info\n")
logger.warn("warn\n")
logger.debug("DEBUG\n")
logger.error("EEOR\n")
now,I can get spark logger in /home/spark.log, such as
[2018-08-14 18:04:28,852] INFO Running Spark version 2.2.0.cloudera1 (org.apache.spark.SparkContext)
[2018-08-14 18:04:29,705] INFO Submitted application: steaming_Test (org.apache.spark.SparkContext)
but no logger in main code such as
18:05:17.250 [main] ERROR com.sgm.bgdt.main.Main$ - EEOR
is there a error setting for "log4j.logger.com.test.main.Main=WARN" or something wrong in my main code?
PS.
this is my spark submit
--driver-java-options "-Dlog4j.configuration=file:/path/log4j.properties
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/path/log4j.properties

User class threw exception: java.util.NoSuchElementException: spark.driver.memory

I am getting an below error while running the Intellij application. Does any body know why this is failing?
User class threw exception: java.util.NoSuchElementException: spark.driver.memory
The error is thrown by this method:
/** Get a parameter; throws a NoSuchElementException if it's not set */
def get(key: String): String = {
getOption(key).getOrElse(throw new NoSuchElementException(key))
}
// SparkConf.scala
Thus, probably you didn't specify spark.driver.memory configuration entry explicitly. Can you tell how do you submit the job ? What options are passed to spark-submit command ?

Error in ignite context creation

Hi i have just started exploring Apache-Ignite and facing a error initially in ignite context creation--
import org.apache.ignite.spark._
import org.apache.ignite.configuration._
val ic = new IgniteContext[Integer, Integer](sc, () => new IgniteConfiguration())
#Error:
<console>:30: error: org.apache.ignite.spark.IgniteContext does not take type parameters
val ic = new IgniteContext[Integer, Integer](sc, () => new IgniteConfiguration())
But every where on internet this line works(as it shows in examples provided).
Versions- apache-ignite-1.8.0-src, spark-2.0.2-bin-hadoop2.7
And I am starting the shell by
./bin/spark-shell --packages org.apache.ignite:ignite-spark:1.7.0, org.apache.ignite:ignite-spring:1.8.0 --master local --repositories http://repo.maven.apache.org/maven2/org/apache/ignite
Can somebody help me with this error. Thanks.
Type arguments was removed in Ignite 1.7 according to that task.
Just change to
val ic = new IgniteContext(sc, () => new IgniteConfiguration())

How can I use a JSR-203 file system provider with Apache Spark?

We'd like to use the HDFS NIO.2 file system provider in a spark job. However, we've run into the classpath issues with file system providers: they have to be in the system classpath to be used through the Paths.get(URI) API. As a result, the provider is not found even when it is provided in the jar files supplied to spark-submit.
Here's the spark-submit command:
spark-submit --master local["*"] \
--jars target/dependency/jimfs-1.1.jar,target/dependency/guava-16.0.1.jar \
--class com.basistech.tc.SparkFsTc \
target/spark-fs-tc-0.0.1-SNAPSHOT.jar
And here's the job class, which fails with 'file system not found.'
public final class SparkFsTc {
private SparkFsTc() {
//
}
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("File System Test Case");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.parallelize(Collections.singletonList("foo"));
System.out.println(logData.getNumPartitions());
logData.mapPartitions(itr -> {
FileSystem fs = Jimfs.newFileSystem();
Path path = fs.getPath("/root");
URI uri = path.toUri();
Paths.get(uri); // expect this to go splat.
return null;
}).collect();
}
}
Is there some mechanism to persuade spark to add the FS provider to the appropriate classpath?
Readers should note that file system providers are special. If you read the code in the JRE, you will see
ServiceLoader<FileSystemProvider> sl = ServiceLoader
.load(FileSystemProvider.class, ClassLoader.getSystemClassLoader()).
They have to be in 'the system class loader'. They are not found locally.
This thing would work fine if I acquired the FileSystem object reference myself instead of using Paths.get(URI).

Resources