Error calling `JValue.extract` from distributed operations in spark-shell - apache-spark

I am trying to use the case class extraction feature of json4s in Spark,
ie calling jvalue.extract[MyCaseClass]. It works fine if I bring the JValue objects into the master and do the extraction there, but the same calls fail in the workers:
import org.json4s._
import org.json4s.jackson.JsonMethods._
import scala.util.{Try, Success, Failure}
val sqx = sqlContext
val data = sc.textFile(inpath).coalesce(2000)
case class PageView(
client: Option[String]
)
def extract(json: JValue) = {
implicit def formats = org.json4s.DefaultFormats
Try(json.extract[PageView]).toOption
}
val json = data.map(parse(_)).sample(false, 1e-6).cache()
// count initial inputs
val raw = json.count
// count successful extractions locally -- same value as above
val loc = json.toLocalIterator.flatMap(extract).size
// distributed count -- always zero
val dist = json.flatMap(extract).count // always returns zero
// this throws "org.json4s.package$MappingException: Parsed JSON values do not match with class constructor"
json.map(x => {implicit def formats = org.json4s.DefaultFormats; x.extract[PageView]}).count
The implicit for Formats is defined locally in the extract function since DefaultFormats is not serializable and defining it at top level caused it to be serialized to for transmission to the workers rather than constructed there. I think the proble still has something to do with the remote initialization of DefaultFormats, but I am not sure what it is.
When I call the extract method directly, insted of my extract function, like in the last example, it no longer complains about serialization but just throws an error that the JSON does not match the expected structure.
How can I get the extraction to work when distributed to the workers?
Edit
#WesleyMiao has reproduced the problem and found that it is specific to spark-shell. He reports that this code works as a standalone application.

I got the same exception as yours when running your code in spark-shell. However when I turn your code into a real spark app and submit it to a standalone spark cluster, I got expected results with no exception.
Below is the code I put in a simple spark app.
val data = sc.parallelize(Seq("""{"client":"Michael"}""", """{"client":"Wesley"}"""))
val json = data.map(parse(_))
val dist = json.mapPartitions { jsons =>
implicit val formats = org.json4s.DefaultFormats
jsons.map(_.extract[PageView])
}
dist.collect() foreach println
And when I run it using spark-submit, I got the following result.
PageView(Some(Michael))
PageView(Some(Wesley))
And I am also sure that it is running not in "local[*]" mode.
Now I suspect the reason we got exceptions while running in spark-shell has something to do with the case class PageView definition in spark-shell and how spark-shell serialize / distribute it to executor.

As suggested here I would move object creation into the map. I.e. I would have function createPageViews that has extract as internal function and will pass createPageViews to workers.
More precisely I would use mapPartitions instead of map - so it would have to call createPageViews (and it's internal function definition part) only once per partition - and not once per every record.

Related

Spark - SparkSession access issue

I have a problem similar to one in
Spark java.lang.NullPointerException Error when filter spark data frame on inside foreach iterator
String_Lines.foreachRDD{line ->
line.foreach{x ->
// JSON to DF Example
val sparkConfig = SparkConf().setAppName("JavaKinesisWordCountASL").setMaster("local[*]").
set("spark.sql.warehouse.dir", "file:///C:/tmp")
val spark = SparkSession.builder().config(sparkConfig).orCreate
val outer_jsonData = Arrays.asList(x)
val outer_anotherPeopleDataset = spark.createDataset(outer_jsonData, Encoders.STRING())
spark.read().json(outer_anotherPeopleDataset).createOrReplaceTempView("jsonInnerView")
spark.sql("select name, address.city, address.state from jsonInnerView").show(false)
println("Current String #"+ x)
}
}
#thebluephantom did explain it to the point. I have my code in foreachRDD now, but still it doesn't work. This is Kotlin and I am running it in my local laptop with IntelliJ. Somehow it's not picking sparksession as I understand after reading all blogs. If I delete "spark.read and spark.sql", everything else works OK. What should I do to fix this?
If I delete "spark.read and spark.sql", everything else works OK
If you delete those, you're not actually making Spark do anything, only defining what Spark actions should happen (Spark actions are lazy)
Somehow it's not picking sparksession as I understand
It's "picking it up" just fine. The error is happening because it's picking up a brand new SparkSession. You should already have defined one of these outside of the forEachRDD method, but if you try to reuse it, you might run into different issues
Assuming String_Lines is already a Dataframe. There's no point in looping over all of its RDD data and trying to create brand new SparkSession. Or if it's a DStream, convert it to Streaming Dataframe instead...
That being said, you should be able to immediately select data from it
// unclear what the schema of this is
val selected = String_Lines.selectExpr("name", "address.city", "address.state")
selected.show(false)
You may need to add a get_json_object function in there if you're trying to parse strings to JSON
I am able to solve it finally.
I modified code like this.... Its clean and working.
This is String_Lines data type
val String_Lines: JavaDStream<String>
String_Lines.foreachRDD { x ->
val df = spark.read().json(x)
df.printSchema()
df.show(2,false)
}
Thanks,
Chandra

task not serializable spark REPL

another basics going wrong
I am trying to read a file in spark context and skip the header of the file by doing this
scala> val read = sc.textFile("/user/edureka/data/ls2014.tsv")
scala> val header = read.first
scala> val data = read.filter(row => (row != header))
with these I get error "org.apache.spark.SparkException: Task not serializable".
how does serialization work in this scenario.. want to know the basic and why it is erroring here.
note: I know there other methods to skip the header of the file. I however would like to know the concept of serialization in this context. Please share your views.
You have row, where I assume you mean rec:
val restfile = read.map(rec => row != header)
Should be this:
val restfile = read.map(rec => rec != header)
Probably row is something you defined earlier in the repl that isn't serializable, and therefore can not be automatically passed to the executors via the closure.

Method to get number of cores for a executor on a task node?

E.g. I need to get a list of all available executors and their respective multithreading capacity (NOT the total multithreading capacity, sc.defaultParallelism already handle that).
Since this parameter is implementation-dependent (YARN and spark-standalone have different strategy for allocating cores) and situational (it may fluctuate because of dynamic allocation and long-term job running). I cannot use other method to estimate this. Is there a way to retrieve this information using Spark API in a distributed transformation? (E.g. TaskContext, SparkEnv)
UPDATE As for Spark 1.6, I have tried the following methods:
1) run a 1-stage job with many partitions ( >> defaultParallelism ) and count the number of distinctive threadIDs for each executorID:
val n = sc.defaultParallelism * 16
sc.parallelize(n, n).map(v => SparkEnv.get.executorID -> Thread.currentThread().getID)
.groupByKey()
.mapValue(_.distinct)
.collect()
This however leads to an estimation higher than actual multithreading capacity because each Spark executor uses an overprovisioned thread pool.
2) Similar to 1, except that n = defaultParallesim, and in every task I add a delay to prevent resource negotiator from imbalanced sharding (a fast node complete it's task and asks for more before slow nodes can start running):
val n = sc.defaultParallelism
sc.parallelize(n, n).map{
v =>
Thread.sleep(5000)
SparkEnv.get.executorID -> Thread.currentThread().getID
}
.groupByKey()
.mapValue(_.distinct)
.collect()
it works most of the time, but is much slower than necessary and may be broken by very imbalanced cluster or task speculation.
3) I haven't try this: use java reflection to read BlockManager.numUsableCores, this is obviously not a stable solution, the internal implementation may change at any time.
Please tell me if you have found something better.
It is pretty easy with Spark rest API. You have to get application id:
val applicationId = spark.sparkContext.applicationId
ui URL:
val baseUrl = spark.sparkContext.uiWebUrl
and query:
val url = baseUrl.map { url =>
s"${url}/api/v1/applications/${applicationId}/executors"
}
With Apache HTTP library (already in Spark dependencies, adapted from https://alvinalexander.com/scala/scala-rest-client-apache-httpclient-restful-clients):
import org.apache.http.impl.client.DefaultHttpClient
import org.apache.http.client.methods.HttpGet
import scala.util.Try
val client = new DefaultHttpClient()
val response = url
.flatMap(url => Try{client.execute(new HttpGet(url))}.toOption)
.flatMap(response => Try{
val s = response.getEntity().getContent()
val json = scala.io.Source.fromInputStream(s).getLines.mkString
s.close
json
}.toOption)
and json4s:
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
case class ExecutorInfo(hostPort: String, totalCores: Int)
val executors: Option[List[ExecutorInfo]] = response.flatMap(json => Try {
parse(json).extract[List[ExecutorInfo]]
}.toOption)
As long as you keep application id and ui URL at hand and open ui port to external connections you can do the same thing from any task.
I would try to implement SparkListener in a way similar to web UI does. This code might be helpful as an example.

Transforming PySpark RDD with Scala

TL;DR - I have what looks like a DStream of Strings in a PySpark application. I want to send it as a DStream[String] to a Scala library. Strings are not converted by Py4j, though.
I'm working on a PySpark application that pulls data from Kafka using Spark Streaming. My messages are strings and I would like to call a method in Scala code, passing it a DStream[String] instance. However, I'm unable to receive proper JVM strings in the Scala code. It looks to me like the Python strings are not converted into Java strings but, instead, are serialized.
My question would be: how to get Java strings out of the DStream object?
Here is the simplest Python code I came up with:
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sparkContext=sc, batchDuration=int(1))
from pyspark.streaming.kafka import KafkaUtils
stream = KafkaUtils.createDirectStream(ssc, ["IN"], {"metadata.broker.list": "localhost:9092"})
values = stream.map(lambda tuple: tuple[1])
ssc._jvm.com.seigneurin.MyPythonHelper.doSomething(values._jdstream)
ssc.start()
I'm running this code in PySpark, passing it the path to my JAR:
pyspark --driver-class-path ~/path/to/my/lib-0.1.1-SNAPSHOT.jar
On the Scala side, I have:
package com.seigneurin
import org.apache.spark.streaming.api.java.JavaDStream
object MyPythonHelper {
def doSomething(jdstream: JavaDStream[String]) = {
val dstream = jdstream.dstream
dstream.foreachRDD(rdd => {
rdd.foreach(println)
})
}
}
Now, let's say I send some data into Kafka:
echo 'foo bar' | $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic IN
The println statement in the Scala code prints something that looks like:
[B#758aa4d9
I expected to get foo bar instead.
Now, if I replace the simple println statement in the Scala code with the following:
rdd.foreach(v => println(v.getClass.getCanonicalName))
I get:
java.lang.ClassCastException: [B cannot be cast to java.lang.String
This suggests that the strings are actually passed as arrays of bytes.
If I simply try to convert this array of bytes into a string (I know I'm not even specifying the encoding):
def doSomething(jdstream: JavaDStream[Array[Byte]]) = {
val dstream = jdstream.dstream
dstream.foreachRDD(rdd => {
rdd.foreach(bytes => println(new String(bytes)))
})
}
I get something that looks like (special characters might be stripped off):
�]qXfoo barqa.
This suggests the Python string was serialized (pickled?). How could I retrieve a proper Java string instead?
Long story short there is no supported way to do something like this. Don't try this in production. You've been warned.
In general Spark doesn't use Py4j for anything else than some basic RPC calls on the driver and doesn't start Py4j gateway on any other machine. When it is required (mostly MLlib and some parts of SQL) Spark uses Pyrolite to serialize objects passed between JVM and Python.
This part of the API is either private (Scala) or internal (Python) and as such not intended for general usage. While theoretically you access it anyway either per batch:
package dummy
import org.apache.spark.api.java.JavaRDD
import org.apache.spark.streaming.api.java.JavaDStream
import org.apache.spark.sql.DataFrame
object PythonRDDHelper {
def go(rdd: JavaRDD[Any]) = {
rdd.rdd.collect {
case s: String => s
}.take(5).foreach(println)
}
}
complete stream:
object PythonDStreamHelper {
def go(stream: JavaDStream[Any]) = {
stream.dstream.transform(_.collect {
case s: String => s
}).print
}
}
or exposing individual batches as DataFrames (probably the least evil option):
object PythonDataFrameHelper {
def go(df: DataFrame) = {
df.show
}
}
and use these wrappers as follows:
from pyspark.streaming import StreamingContext
from pyspark.mllib.common import _to_java_object_rdd
from pyspark.rdd import RDD
ssc = StreamingContext(spark.sparkContext, 10)
spark.catalog.listTables()
q = ssc.queueStream([sc.parallelize(["foo", "bar"]) for _ in range(10)])
# Reserialize RDD as Java RDD<Object> and pass
# to Scala sink (only for output)
q.foreachRDD(lambda rdd: ssc._jvm.dummy.PythonRDDHelper.go(
_to_java_object_rdd(rdd)
))
# Reserialize and convert to JavaDStream<Object>
# This is the only option which allows further transformations
# on DStream
ssc._jvm.dummy.PythonDStreamHelper.go(
q.transform(lambda rdd: RDD( # Reserialize but keep as Python RDD
_to_java_object_rdd(rdd), ssc.sparkContext
))._jdstream
)
# Convert to DataFrame and pass to Scala sink.
# Arguably there are relatively few moving parts here.
q.foreachRDD(lambda rdd:
ssc._jvm.dummy.PythonDataFrameHelper.go(
rdd.map(lambda x: (x, )).toDF()._jdf
)
)
ssc.start()
ssc.awaitTerminationOrTimeout(30)
ssc.stop()
this is not supported, untested and as such rather useless for anything else than the experiments with Spark API.

I don't get any result from notebook in Bluemix Spark

I tried to execute my scala code in Bluemix Spark service, once I can run it and get right result from my local virtual machine. When I ran it in Bluemix Spark, I can not get any response in notebook.
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.Matrix
val input = sc.textFile("swift://notebooks.spark/pca.csv")
val header = input.first()
val inputData = input.filter(x => x != header).map(line=>line.split(','))
val inputVector = input.map{d=>
Vectors.dense(
d(1).toDouble, d(2).toDouble, d(3).toDouble, d(4).toDouble, d(5).toDouble, d(6).toDouble,
d(7).toDouble, d(8).toDouble, d(9).toDouble, d(10).toDouble, d(11).toDouble)}
val rowMatrix = new RowMatrix(inputVector)
val pca: Matrix = rowMatrix.computePrincipalComponents(5)
When I execute the intput.take(2), I can get result well but no result for executing input.foreach(println). It's strange. How can I get result?
I have tested it on Bluemix in a Scala notebook.
val input = sc.textFile("swift://notebooks.spark/test.csv")
input.take(1) /** shows the first line */
input.foreach(println) /** nothing is displayed */
If you want to display the content of a RDD, then you can use the following code.
input.take(5).foreach(println) /** shows the first 5 lines */
input.collect().foreach(println) /** shows all lines */
I do not know how your local VM is set up, but I think you have to distinguish between running your code local or on a cluster.
Have look at this answer for more information: How to print the contents of RDD?

Resources