Spark mapPartitions correct usage with DataFrames

Spark mapPartitions correct usage with DataFrames - apache-spark

I'm struggling with the correct usage of mapPartitions.
I've successfully run my code with map, however since I do not want the resources to be loaded for every row I'd like to switch to mapPartitions.
Here's some simple example code:
import spark.implicits._
val dataDF = spark.read.format("json").load("basefile")
val newDF = dataDF.mapPartitions( iterator => {
iterator.map(p => Seq(1,"1")))
}).toDF("id", "newContent")
newDF.write.json("newfile")
This causes the exception
Exception in thread "main" java.lang.ClassNotFoundException: scala.Any
I'm guessing this has something to do with typing. What could the problem be?

the problem is that Seq(1,"1") is of type Seq[Any] which can't be returned from mapPartitions, try Seq(1,2) intsead if that works

Related

Spark - SparkSession access issue

I have a problem similar to one in
Spark java.lang.NullPointerException Error when filter spark data frame on inside foreach iterator
String_Lines.foreachRDD{line ->
line.foreach{x ->
// JSON to DF Example
val sparkConfig = SparkConf().setAppName("JavaKinesisWordCountASL").setMaster("local[*]").
set("spark.sql.warehouse.dir", "file:///C:/tmp")
val spark = SparkSession.builder().config(sparkConfig).orCreate
val outer_jsonData = Arrays.asList(x)
val outer_anotherPeopleDataset = spark.createDataset(outer_jsonData, Encoders.STRING())
spark.read().json(outer_anotherPeopleDataset).createOrReplaceTempView("jsonInnerView")
spark.sql("select name, address.city, address.state from jsonInnerView").show(false)
println("Current String #"+ x)
}
}
#thebluephantom did explain it to the point. I have my code in foreachRDD now, but still it doesn't work. This is Kotlin and I am running it in my local laptop with IntelliJ. Somehow it's not picking sparksession as I understand after reading all blogs. If I delete "spark.read and spark.sql", everything else works OK. What should I do to fix this?

If I delete "spark.read and spark.sql", everything else works OK
If you delete those, you're not actually making Spark do anything, only defining what Spark actions should happen (Spark actions are lazy)
Somehow it's not picking sparksession as I understand
It's "picking it up" just fine. The error is happening because it's picking up a brand new SparkSession. You should already have defined one of these outside of the forEachRDD method, but if you try to reuse it, you might run into different issues
Assuming String_Lines is already a Dataframe. There's no point in looping over all of its RDD data and trying to create brand new SparkSession. Or if it's a DStream, convert it to Streaming Dataframe instead...
That being said, you should be able to immediately select data from it
// unclear what the schema of this is
val selected = String_Lines.selectExpr("name", "address.city", "address.state")
selected.show(false)
You may need to add a get_json_object function in there if you're trying to parse strings to JSON

I am able to solve it finally.
I modified code like this.... Its clean and working.
This is String_Lines data type
val String_Lines: JavaDStream<String>
String_Lines.foreachRDD { x ->
val df = spark.read().json(x)
df.printSchema()
df.show(2,false)
}
Thanks,
Chandra

Spark - chained transformations lead to exception

I'm working with data that has the following schema
Array(Struct(field1, field2)) -> lets call it arr
Performing the following operation - chained withColumn:
df = df.withColumn("arr_exploded", df.col("arr")).withColumn("field1", df.col("arr_exploded.field1"))
Leads to a crash with the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "arr_exploded.field1" among (arr);
So that means the second withColumn is executing first. Why does this happen and how to prevent it?
Note, I found out that the following solutions work, which one is better?
/* Two Line approach */
df = df.withColumn("arr_exploded", df.col("arr"))
df = df.withColumn("field1", df.col("arr_exploded.field1"))
/* Checkpoint approach */
df = df.withColumn("arr_exploded", df.col("arr")).checkpoint().withColumn("field1", df.col("arr_exploded.field1"))

DataFrames are immutable by nature, each method returns a new instance.
withColumn does the same.
When you use df.col("arr_exploded.field1") your df reference still points to the old instance.
The first approach is better, you could do it in one line:
import spark.implicits._
df.withColumn("arr_exploded", $"arr").withColumn("field1", $"arr_exploded")
Java way
import static org.apache.spark.sql.functions.col;
df.withColumn("arr_exploded", col("arr")).withColumn("field1", col("arr_exploded"))

Running threads in Spark DataFrame foreachPartition()

I use multiple threads inside foreachPartition(), which works great for me except for when the underlying iterator is TungstenAggregationIterator. Here is a minimal code snippet to reproduce:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object Reproduce extends App {
val sc = new SparkContext("local", "reproduce")
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
df.foreachPartition { iterator =>
val f = Future(iterator.toVector)
Await.result(f, Duration.Inf)
}
}
When I run this, I get:
java.lang.NullPointerException
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
I believe I actually understand why this happens - TungstenAggregationIterator uses a ThreadLocal variable that returns null when called from a thread other than the original thread that got the iterator from Spark. From examining the code, this does not appear to differ between recent Spark versions.
However, this limitation is specific to TungstenAggregationIterator, and not documented, as far as I'm aware.
Is there a way to work around this limitation of TungstenAggregationIterator? Any relevant documentation? I have a workaround for this, but it's quite hacky and unnecessarily reduces runtime performance.

NotSerializableException: org.apache.hadoop.io.LongWritable

I know this question has been answered many times, but I tried everything and I do not come to a solution. I have the following code which raises a NotSerializableException
val ids : Seq[Long] = ...
ids.foreach{ id =>
sc.sequenceFile("file", classOf[LongWritable], classOf[MyWritable]).lookup(new LongWritable(id))
}
With the following exception
Caused by: java.io.NotSerializableException: org.apache.hadoop.io.LongWritable
Serialization stack:
...
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
When creating the SparkContext, I do
val sparkConfig = new SparkConf().setAppName("...").setMaster("...")
sparkConfig.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConfig.registerKryoClasses(Array(classOf[BitString[_]], classOf[MinimalBitString], classOf[org.apache.hadoop.io.LongWritable]))
sparkConfig.set("spark.kryoserializer.classesToRegister", "org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable")
and looking at the environment tab, I can see these entries. However, I do not understand why
the Kryo serializer does not seem to be used (the stack does not mention Kryo)
LongWritable is not serialized.
I'm using Apache Spark v. 1.5.1

Loading repeatedly the same data inside a loop is extremely inefficient. If you perform actions against the same data load it once and cache:
val rdd = sc
.sequenceFile("file", classOf[LongWritable], classOf[MyWritable])
rdd.cache()
Spark doesn't consider Hadoop Writables to be serializable. There is an open JIRA (SPARK-2421) for this. To handle LongWritables simple get should be enough:
rdd.map{case (k, v) => k.get()}
Regarding your custom class it is your responsibility to deal with this problem.
Effective lookup requires a partitoned RDD. Otherwise it has to search every partition in your RDD.
import org.apache.spark.HashPartitioner
val numPartitions: Int = ???
val partitioned = rdd.partitionBy(new HashPartitioner(numPartitions))
Generally speaking RDDs are not designed for random access. Even with defined partitioner lookup has to linearly search candidate partition. With 5000 uniformly distributed keys and 10M objects in an RDD it most likely means a repeated search over a whole RDD. You have few options to avoid that:
filter
val idsSet = sc.broadcast(ids.toSet)
rdd.filter{case (k, v) => idsSet.value.contains(k)}
join
val idsRdd = sc.parallelize(ids).map((_, null))
idsRdd.join(rdd).map{case (k, (_, v)) => (k, v)}
IndexedRDD - it doesn't like a particularly active project though
With 10M entries you'll probably be better with searching locally in memory than using Spark. For a larger data you should consider using a proper key-value store.

I'm new to apache spark but tried to solve your problem, please evaluate it, if it can help you out with the problem of serialization, it's occurring because for spark - hadoop LongWritable and other writables are not serialized.
val temp_rdd = sc.parallelize(ids.map(id =>
sc.sequenceFile("file", classOf[LongWritable], classOf[LongWritable]).toArray.toSeq
)).flatMap(identity)
ids.foreach(id =>temp_rdd.lookup(new LongWritable(id)))

Try this solution. It worked fine for me.
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SparkMapReduceApp");
conf.registerKryoClasses(new Class<?>[]{
LongWritable.class,
Text.class
});

Error calling `JValue.extract` from distributed operations in spark-shell

I am trying to use the case class extraction feature of json4s in Spark,
ie calling jvalue.extract[MyCaseClass]. It works fine if I bring the JValue objects into the master and do the extraction there, but the same calls fail in the workers:
import org.json4s._
import org.json4s.jackson.JsonMethods._
import scala.util.{Try, Success, Failure}
val sqx = sqlContext
val data = sc.textFile(inpath).coalesce(2000)
case class PageView(
client: Option[String]
)
def extract(json: JValue) = {
implicit def formats = org.json4s.DefaultFormats
Try(json.extract[PageView]).toOption
}
val json = data.map(parse(_)).sample(false, 1e-6).cache()
// count initial inputs
val raw = json.count
// count successful extractions locally -- same value as above
val loc = json.toLocalIterator.flatMap(extract).size
// distributed count -- always zero
val dist = json.flatMap(extract).count // always returns zero
// this throws "org.json4s.package$MappingException: Parsed JSON values do not match with class constructor"
json.map(x => {implicit def formats = org.json4s.DefaultFormats; x.extract[PageView]}).count
The implicit for Formats is defined locally in the extract function since DefaultFormats is not serializable and defining it at top level caused it to be serialized to for transmission to the workers rather than constructed there. I think the proble still has something to do with the remote initialization of DefaultFormats, but I am not sure what it is.
When I call the extract method directly, insted of my extract function, like in the last example, it no longer complains about serialization but just throws an error that the JSON does not match the expected structure.
How can I get the extraction to work when distributed to the workers?
Edit
#WesleyMiao has reproduced the problem and found that it is specific to spark-shell. He reports that this code works as a standalone application.

I got the same exception as yours when running your code in spark-shell. However when I turn your code into a real spark app and submit it to a standalone spark cluster, I got expected results with no exception.
Below is the code I put in a simple spark app.
val data = sc.parallelize(Seq("""{"client":"Michael"}""", """{"client":"Wesley"}"""))
val json = data.map(parse(_))
val dist = json.mapPartitions { jsons =>
implicit val formats = org.json4s.DefaultFormats
jsons.map(_.extract[PageView])
}
dist.collect() foreach println
And when I run it using spark-submit, I got the following result.
PageView(Some(Michael))
PageView(Some(Wesley))
And I am also sure that it is running not in "local[*]" mode.
Now I suspect the reason we got exceptions while running in spark-shell has something to do with the case class PageView definition in spark-shell and how spark-shell serialize / distribute it to executor.

As suggested here I would move object creation into the map. I.e. I would have function createPageViews that has extract as internal function and will pass createPageViews to workers.
More precisely I would use mapPartitions instead of map - so it would have to call createPageViews (and it's internal function definition part) only once per partition - and not once per every record.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark mapPartitions correct usage with DataFrames - apache-spark

the problem is that Seq(1,"1") is of type Seq[Any] which can't be returned from mapPartitions, try Seq(1,2) intsead if that works

Related

Spark - SparkSession access issue

Spark - chained transformations lead to exception

Running threads in Spark DataFrame foreachPartition()

NotSerializableException: org.apache.hadoop.io.LongWritable

Error calling `JValue.extract` from distributed operations in spark-shell

Categories

Resources