Connecting Spark and elasticsearch - apache-spark

I'am trying to run a simple code of Spark that copies the content of an RDD into an elastic search document. Both spark and elastic search are installed on my local machine.
import org.elasticsearch.spark.sql._
import org.apache.spark.sql.SparkSession
object ES {
case class Person(ID: Int, name: String, age: Int, numFriends:
Int);
def mapper(line: String): Person = {
val fields = line.split(',')
val person: Person = Person(fields(0).toInt, fields(1),
fields(2).toInt, fields(3).toInt)
return person}
def main(args: Array[String]): Unit = {
val spark: SparkSession =
SparkSession
.builder().master("local[*]")
.appName("SparkEs")
.config("es.index.auto.create", "true")
.config("es.nodes","localhost:9200")
.getOrCreate()
import spark.implicits._
val lines = spark.sparkContext.textFile("/home/herch/fakefriends.csv")
val people = lines.map(mapper).toDF()
people.saveToEs("spark/people")
}
}
I'am Getting this error. After multiples retries
INFO HttpMethodDirector: I/O exception (java.net.ConnectException)
caught when processing request:Connection timed out (Connection timed
out)
INFO HttpMethodDirector: Retrying request
INFO DAGScheduler: ResultStage 0 (runJob at EsSparkSQL.scala:97)
failed in 525.902 s due to Job aborted due to stage failure: Task 1
in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in
stage 0.0 (TID 1, localhost, executor driver):
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes
failed; tried [[192.168.0.22:9200]]
It seems to be a connection problem but i cannot identify its cause. Elastic search is running on my local machine on localhost:9200 and i'am able to query it via the terminal.

Since you are running both locally, you need to set es.nodes.wan.only to true (default false) in your SparkConf. I ran into the same exact problem and that fixed it.
See: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html

As seen on the elasticsearch / spark connector documentation page, you need to separate the host and port arguments inside the configuration :
val options13 = Map("path" -> "spark/index",
"pushdown" -> "true",
"es.nodes" -> "someNode", "es.port" -> "9200")
See how es.nodes only contains the host name, and es.port contains the HTTP port.

Related

How to get cluster information to call REST API (from the driver)?

I want to use Spark REST API to get metrics and publish to cloud watch. But the RESR API is like :
val url = "http://<host>:4040/api/v1/applications/<app-name>/stages"
If I give the master host and app id it works but how can I use this in a job and figure our master host and app-name dynamically ? Is there any way to get those information ?
Using Spark 2.1
Tried :
import org.apache.spark.sql.SparkSession
val id = spark.sparkContext.applicationId
val url = spark.sparkContext.uiWebUrl.get
case class SparkStage(name: String, shuffleWriteBytes: Long, memoryBytesSpilled: Long, diskBytesSpilled: Long)
val path = url + "/api/v1/applications/" + id + "/stages"
implicit val formats = DefaultFormats
val json = fromURL(path).mkString
val stages: List[SparkStage] = parse(json).extract[List[SparkStage]]
I am getting :
java.io.IOException: Server returned HTTP response code: 500 for URL: http://112.21.2.151:4040/api/v1/applications/application_1515337161733_0001
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1876)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at java.net.URL.openStream(URL.java:1045)
at scala.io.Source$.fromURL(Source.scala:141)
at scala.io.Source$.fromURL(Source.scala:131)
... 64 elided
If you know the host you can query applications endpoint:
http://localhost:4040/api/v1/applications
and parse the result to get applicaiton id.
To get applicationId and host from the application use respective SparkContext methods:
val spark: SparkSession
spark.sparkContext.applicationId
spark.sparkContext.uiWebUrl

Spark always using an existing SparkContext every time i run job with spark-submit

I have deployed the jar of spark on cluster. I submit the spark job using spark-submit command follow by my project jar.
I have many Spak Conf in my project. Conf will be decided based on which class i am running but every time i run the spark job i got this warning
7/01/09 07:32:51 WARN SparkContext: Use an existing SparkContext, some
configuration may not take effect.
Query Does it mean SparkContext is already there and my job is picking this.
Query Why come configurations are not taking place
Code
private val conf = new SparkConf()
.setAppName("ELSSIE_Ingest_Cassandra")
.setMaster(sparkIp)
.set("spark.sql.shuffle.partitions", "8")
.set("spark.cassandra.connection.host", cassandraIp)
.set("spark.sql.crossJoin.enabled", "true")
object SparkJob extends Enumeration {
val Program1, Program2, Program3, Program4, Program5 = Value
}
object ElssieCoreContext {
def getSparkSession(sparkJob: SparkJob.Value = SparkJob.RnfIngest): SparkSession = {
val sparkSession = sparkJob match {
case SparkJob.Program1 => {
val updatedConf = conf.set("spark.cassandra.output.batch.size.bytes", "2048").set("spark.sql.broadcastTimeout", "2000")
SparkSession.builder().config(updatedConf).getOrCreate()
}
case SparkJob.Program2 => {
val updatedConf = conf.set("spark.sql.broadcastTimeout", "2000")
SparkSession.builder().config(updatedConf).getOrCreate()
}
}
}
And In Program1.scala i call by
val spark = ElssieCoreContext.getSparkSession()
val sc = spark.sparkContext

Save Kafka messages into HBase through Spark. Session never closes

I am trying to use Spark streaming to receive messages from Kafka, convert them to Put and insert into HBase.
I create a inputDstream to receive messages from Kafka and then create a JobConf and finally use saveAsHadoopDataset(JobConf) to save records into HBase.
Every time record inserted into HBase, a session from Hbase to zookeeper would be set up but never closes. If number of connections increases more than max client Connections of zookeeper, spark streaming crashes.
my codes are shown below:
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder
object ReceiveKafkaAsDstream {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("ReceiveKafkaAsDstream")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val topics = "test"
val brokers = "10.0.2.15:6667"
val topicSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
val tableName = "KafkaTable"
val conf = HBaseConfiguration.create()
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val jobConfig: JobConf = new JobConf(conf, this.getClass)
jobConfig.set("mapreduce.output.fileoutputformat", "/user/root/out")
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val records = messages
.map(_._2)
.map(SampleKafkaRecord.parseToSampleRecord)
records.print()
records.foreachRDD{ stream => stream.map(SampleKafkaRecord.SampleToHbasePut).saveAsHadoopDataset(jobConfig) }
ssc.start()
ssc.awaitTermination()
}
case class SampleKafkaRecord(id: String, name: String)
object SampleKafkaRecord extends Serializable {
def parseToSampleRecord(line: String): SampleKafkaRecord = {
val values = line.split(";")
SampleKafkaRecord(values(0), values(1))
}
def SampleToHbasePut(CSVData: SampleKafkaRecord): (ImmutableBytesWritable, Put) = {
val rowKey = CSVData.id
val putOnce = new Put(rowKey.getBytes)
putOnce.addColumn("cf1".getBytes, "column-Name".getBytes, CSVData.name.getBytes)
return (new ImmutableBytesWritable(rowKey.getBytes), putOnce)
}
}
}
I set duration of SSC (SparkStreamingContext) as 1s and set maxClientCnxns as 10 in zookeeper conf file zoo.cfg, so there are at most 10 connections allowed from one client to zookeeper.
After 10 seconds (10 sessions set up from HBase to zookeeper), I got the error shown below:
16/08/24 14:59:30 WARN RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/hbaseid
16/08/24 14:59:31 INFO ClientCnxn: Opening socket connection to server localhost.localdomain/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
16/08/24 14:59:31 INFO ClientCnxn: Socket connection established to localhost.localdomain/127.0.0.1:2181, initiating session
16/08/24 14:59:31 WARN ClientCnxn: Session 0x0 for server localhost.localdomain/127.0.0.1:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
In my understanding, this error exists because number of connections is more than max connection of zookeeper. If I set maxClientCnxn as 20, streaming processing is able to last 20s. I know I can set maxClientCnxn as unlimited, but I really dont think it is a good way solving this problem.
Another thing is if I use TextFileStream to get text files as DStream and save them into hbase using saveAsHadoopDataset(jobConf), it runs pretty well. If I just read data from kafka using val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers) and simply print info out, there is no issue either. Problem comes when I receive kafka messages and then save them into HBase in the application.
My environment is HDP 2.4 sandbox. versions spark: 1.6, hbase:1.1.2, kafka:2.10.0, zookeeper: 3.4.6.
Any help is appreciated.
Well, finally I get it work.
Attribute set:
There is a attributes called "zookeeper.connection.timeout.ms". This attribute should be set as 1s.
Change to new API:
Change method saveAsHadoopDataset(JobConf) to saveAsNewAPIHadoopDataset(JobConf). I still dont know why the old API is not working.
Change import org.apache.hadoop.hbase.mapred.TableOutputFormat to import org.apache.hadoop.hbase.mapreduce.TableOutputFormat

How to run this code on Spark Cluster mode

I want to run my code on a Cluster:
my code:
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations._
import edu.stanford.nlp.pipeline._
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer
object Pre2 {
def plainTextToLemmas(text: String, pipeline: StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 0 ) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("pre2")
val sc = new SparkContext(conf)
val plainText = sc.textFile("data/in.txt")
val lemmatized = plainText.mapPartitions(p => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
p.map(q => plainTextToLemmas(q, pipeline))
})
val lemmatized1 = lemmatized.map(l => l.head + l.tail.mkString(" "))
val lemmatized2 = lemmatized1.filter(_.nonEmpty)
lemmatized2.coalesce(1).saveAsTextFile("data/out.txt)
}
}
and Cluster features:
2 Nodes
each node has : 60g RAM
each node has : 48 Cores
Shared Disk
I installed Spark on this cluster and one of these nodes is as a master and worker and another node is a worker .
when i run my code with this command in terminal :
./bin/spark-submit --master spark://192.168.1.20:7077 --class Main
--deploy-mode cluster code/Pre2.jar
it shows :
15/08/19 15:27:21 WARN RestSubmissionClient: Unable to connect to
server spark://192.168.1.20:7077. Warning: Master endpoint
spark://192.168.1.20:7077 was not a REST server. Falling back to
legacy submission gateway instead. 15/08/19 15:27:22 WARN
NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable Driver
successfully submitted as driver-20150819152724-0002 ... waiting
before polling master for driver state ... polling master for driver
state State of driver-20150819152724-0002 is RUNNING Driver running on
1192.168.1.19:33485 (worker-20150819115013-192.168.1.19-33485)
How can i run above code on Spark standalone cluster ?
Make sure you check out the WebUI using 8080 port. In your example it would be 192.168.1.20:8080.
If you are running it in Spark Standalone Cluster mode, try it without --deploy-mode cluster and hard code your nodes memory by adding --executor-memory 60g
"Warning: Master endpoint spark://192.168.1.20:7077 was not a REST server"
From the error, it also looks like the master rest url is different.
The rest URL could be found on master_url:8080 UI

Spark broadcasted variable returns NullPointerException when run in Amazon EMR cluster

The variables I share via broadcast are null in the cluster.
My application is quite complex, but I have written this small example that works flawlessly when I run it locally, but it fails in the cluster:
package com.gonzalopezzi.bigdata.bicing
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
object PruebaBroadcast2 extends App {
val conf = new SparkConf().setAppName("PruebaBroadcast2")
val sc = new SparkContext(conf)
val arr : Array[Int] = (6 to 9).toArray
val broadcasted = sc.broadcast(arr)
val rdd : RDD[Int] = sc.parallelize((1 to 4).toSeq, 2) // a small integer array [1, 2, 3, 4] is paralellized in two machines
rdd.flatMap((a : Int) => List((a, broadcasted.value(0)))).reduceByKey(_+_).collect().foreach(println) // NullPointerException in the flatmap. broadcasted is null
}
I don't know if the problem is a coding error or a configuration issue.
This is the stacktrace I get:
15/07/07 20:55:13 INFO scheduler.DAGScheduler: Job 0 failed: collect at PruebaBroadcast2.scala:24, took 0.992297 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, ip-172-31-36-49.ec2.internal): java.lang.NullPointerException
at com.gonzalopezzi.bigdata.bicing.PruebaBroadcast2$$anonfun$2.apply(PruebaBroadcast2.scala:24)
at com.gonzalopezzi.bigdata.bicing.PruebaBroadcast2$$anonfun$2.apply(PruebaBroadcast2.scala:24)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Command exiting with ret '1'
Can anyone help me fix this?
At least, can you tell me if you see something strange in the code?
If you think the code is ok, please tell me, as it would mean that the problem is in the configuration of the cluster.
Finally I got it working.
It doesn't work declaring the object like this:
object MyObject extends App {
But it works, if you declare an object with a main function:
object MyObject {
def main (args : Array[String]) {
/* ... */
}
}
So, the short example in the question works if I rewrite it this way:
object PruebaBroadcast2 {
def main (args: Array[String]) {
val conf = new SparkConf().setAppName("PruebaBroadcast2")
val sc = new SparkContext(conf)
val arr : Array[Int] = (6 to 9).toArray
val broadcasted = sc.broadcast(arr)
val rdd : RDD[Int] = sc.parallelize((1 to 4).toSeq, 2)
rdd.flatMap((a : Int) => List((a, broadcasted.value(0)))).reduceByKey(_+_).collect().foreach(println)
}
}
This problem seems related to this bug:
https://issues.apache.org/jira/browse/SPARK-4170
I had similar issue. The issue was I have a variable, and used it in RDD map function, and I got null value. This is my original code:
object MyClass extends App {
...
val prefix = "prefix"
val newRDD = inputRDD.map(s => prefix + s) // got null for prefix
...
}
And I found it works in any function not just main():
object MyClass extends App {
...
val prefix = "prefix"
val newRDD = addPrefix(input, prefix)
def addPrefix(input: RDD[String], prefix: String): RDD[String] = {
inputRDD.map(s => prefix + s)
}
}

Resources