How to pass output of one function to another in Spark - apache-spark

I am sending output of one function which is dataframe to another function.
val df1 = fun1
val df11 = df1.collect
val df2 = df11.map(x =fun2( x,df3))
Above 2 lines are wriiten in main function. Df1 is very large so if i do collect on driver it gives outof memory or gc issue.
What r ways to send output of one function to another in spark?

Spark can run the data processing for you. You don't need the intermediate collect step. You should just chain all of the transformations together and then add an action at the end to save the resulting data out to disk.
Calling collect() is only useful for debugging very small results.
For example, you could do something like this:
rdd.map(x => fun1(x))
.map(y => fun2(y))
.saveAsObjectFile();
This article might be helpful to explain more about this:
http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/

Related

Caching in spark before diverging the flow

I have a basic question regarding working with Spark DataFrame.
Consider the following piece of pseudo code:
val df1 = // Lazy Read from csv and create dataframe
val df2 = // Filter df1 on some condition
val df3 = // Group by on df2 on certain columns
val df4 = // Join df3 with some other df
val subdf1 = // All records from df4 where id < 0
val subdf2 = // All records from df4 where id > 0
* Then some more operations on subdf1 and subdf2 which won't trigger spark evaluation yet*
// Write out subdf1
// Write out subdf2
Suppose I start of with main dataframe df1(which I lazy read from the CSV), do some operations on this dataframe (filter, groupby, join) and then comes a point where I split this datframe based on a condition (for eg, id > 0 and id < 0). Then I further proceed to operate on these sub dataframes(let us name these subdf1, subdf2) and ultimately write out both the sub dataframes.
Notice that the write function is the only command that triggers the spark evaluation and rest of the functions(filter, groupby, join) result in lazy evaluations.
Now when I write out subdf1, I am clear that lazy evaluation kicks in and all the statements are evaluated starting from reading of CSV to create df1.
My question comes when we start writing out subdf2. Does spark understand the divergence in code at df4 and store this dataframe when command for writing out subdf1 was encountered? Or will it again start from the first line of creating df1 and re-evaluate all the intermediary dataframes?
If so, is it a good idea to cache the dataframe df4(Assuming I have sufficient memory)?
I'm using scala spark if that matters.
Any help would be appreciated.
No, Spark cannot infer that from your code. It will start all over again. To confirm this, you can do subdf1.explain() and subdf2.explain() and you should see that both dataframes have query plans that start right from the beginning where df1 was read.
So you're right that you should cache df4 to avoid redoing all the computations starting from df1, if you have enough memory. And of course, remember to unpersist by doing df4.unpersist() at the end if you no longer need df4 for any further computations.

Splitting a pipeline in spark?

Assume that I have a Spark pipeline like this (formatted to emphasize the important steps):
val foos1 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform1)
.distinct().collect().toSet
I'm adding a similar pipeline:
val foos2 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform2)
.distinct().collect().toSet
Then I do something with both results.
I'd like to avoid doing someComplicatedProcessing twice (not parsing the file twice is nice, too).
Is there a way to take the stream after the .map(someComplicatedProcessing) step and create two parallel streams feeding off it?
I know that I can store the intermediate result on disk and thus save the CPU time at the cost of more I/O. Is there a better way? What words do I web-search for?
First option - cache intermediate results:
val cached = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.cache
val foos1 = cached.map(transform1)
.distinct().collect().toSet
val foos2 = cached.map(transform2)
.distinct().collect().toSet
Second option - use RDD and make single pass:
val foos = spark_session.read(foo_file)
.flatMap(toFooRecord)
.map(someComplicatedProcessing)
.rdd
.flatMap(x => Seq(("t1", transform1(x)), ("t2", transform2(x))))
.distinct
.collect
.groupBy(_._1)
.mapValues(_.map(_._2))
val foos1 = foos("t1")
val foos2 = foos("t2")
The second option may require some type wrangling if transform1 and transform2 have incompatible return types.

How to run a repetitive task in parallel in Spark instead of looping sequentially?

I have just started with Spark, and I know that non-functional way of sequential looping should be avoided for Spark to give me maximum performance.
I have a function makeData. I need to create a dataframe with the return value of this function by calling this function n times. Currently, my code looks like this:
var myNewDF = sqlContext.createDataFrame(sc.emptyRDD[Row], minority_set.schema)
for ( i <- 1 to n ) {
myNewDF = myNewDF.unionAll(sqlContext.createDataFrame(sc.parallelize(makeData()),minority_set.schema))
}
Is there a way of doing this where each iteration happens in parallel?
Here the problem is that n can be large, and this loop is taking a lot of time. Is there a more scala/spark friendly way of achieving the same thing?
Since all your dataset is in memory already (guessing by sc.parallelize(makeData())) there's no point of using Spark SQL's unionAll to do the unioning which is also local (yet partitioned!).
I'd use Scala alone and only when you merged all the records I'd build a Dataset from it.
With that, I'd do something as follows:
val dataset = (1 to n).par.map { _ => makeData() }.reduce (_ ++ _)
val df = sc.parallelize(dataset, minority_set.schema)

How to effectively read millions of rows from Cassandra?

I have a hard task to read from a Cassandra table millions of rows. Actually this table contains like 40~50 millions of rows.
The data is actually internal URLs for our system and we need to fire all of them. To fire it, we are using Akka Streams and it have been working pretty good, doing some back pressure as needed. But we still have not found a way to read everything effectively.
What we have tried so far:
Reading the data as Stream using Akka Stream. We are using phantom-dsl that provides a publisher for a specific table. But it does not read everything, only a small portion. Actually it stops to read after the first 1 million.
Reading using Spark by a specific date. Our table is modeled like a time series table, with year, month, day, minutes... columns. Right now we are selecting by day, so Spark will not fetch a lot of things to be processed, but this is a pain to select all those days.
The code is the following:
val cassandraRdd =
sc
.cassandraTable("keyspace", "my_table")
.select("id", "url")
.where("year = ? and month = ? and day = ?", date.getYear, date.getMonthOfYear, date.getDayOfMonth)
Unfortunately I can't iterate over the partitions to get less data, I have to use a collect because it complains the actor is not serializable.
val httpPool: Flow[(HttpRequest, String), (Try[HttpResponse], String), HostConnectionPool] = Http().cachedHostConnectionPool[String](host, port).async
val source =
Source
.actorRef[CassandraRow](10000000, OverflowStrategy.fail)
.map(row => makeUrl(row.getString("id"), row.getString("url")))
.map(url => HttpRequest(uri = url) -> url)
val ref = Flow[(HttpRequest, String)]
.via(httpPool.withAttributes(ActorAttributes.supervisionStrategy(decider)))
.to(Sink.actorRef(httpHandlerActor, IsDone))
.runWith(source)
cassandraRdd.collect().foreach { row =>
ref ! row
}
I would like to know if any of you have such experience on reading millions of rows for doing anything different from aggregation and so on.
Also I have thought to read everything and send to a Kafka topic, where I would be receiving using Streaming(spark or Akka), but the problem would be the same, how to load all those data effectively ?
EDIT
For now, I'm running on a cluster with a reasonable amount of memory 100GB and doing a collect and iterating over it.
Also, this is far different from getting bigdata with spark and analyze it using things like reduceByKey, aggregateByKey, etc, etc.
I need to fetch and send everything over HTTP =/
So far it is working the way I did, but I'm afraid this data get bigger and bigger to a point where fetching everything into memory makes no sense.
Streaming this data would be the best solution, fetching in chunks, but I haven't found a good approach yet for this.
At the end, I'm thinking of to use Spark to get all those data, generate a CSV file and use Akka Stream IO to process, this way I would evict to keep a lot of things in memory since it takes hours to process every million.
Well, after spending sometime reading, talking with other guys and doing tests the result could be achieve by the following code sample:
val sc = new SparkContext(sparkConf)
val cassandraRdd = sc.cassandraTable(config.getString("myKeyspace"), "myTable")
.select("key", "value")
.as((key: String, value: String) => (key, value))
.partitionBy(new HashPartitioner(2 * sc.defaultParallelism))
.cache()
cassandraRdd
.groupByKey()
.foreachPartition { partition =>
partition.foreach { row =>
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val myActor = system.actorOf(Props(new MyActor(system)), name = "my-actor")
val source = Source.fromIterator { () => row._2.toIterator }
source
.map { str =>
myActor ! Count
str
}
.to(Sink.actorRef(myActor, Finish))
.run()
}
}
sc.stop()
class MyActor(system: ActorSystem) extends Actor {
var count = 0
def receive = {
case Count =>
count = count + 1
case Finish =>
println(s"total: $count")
system.shutdown()
}
}
case object Count
case object Finish
What I'm doing is the following:
Try to achieve a good number of Partitions and a Partitioner using the partitionBy and groupBy methods
Use Cache to prevent Data Shuffle, making your Spark move large data across nodes, using high IO etc.
Create the whole actor system with it's dependencies as well as the Stream inside the foreachPartition method. Here is a trade off, you can have only one ActorSystem but you will have to make a bad use of .collect as I wrote in the question. However creating everything inside, you still have the ability to run things inside spark distributed across your cluster.
Finish each actor system at the end of the iterator using the Sink.actorRef with a message to kill(Finish)
Perhaps this code could be even more improved, but so far I'm happy to do not make the use of .collect anymore and working only inside Spark.

Apache Spark: stepwise execution

Due to a performance measurement I want to execute my scala programm written for spark stepwise, i.e.
execute first operator; materialize result;
execute second operator; materialize result;
...
and so on. The original code:
var filename = new String("<filename>")
var text_file = sc.textFile(filename)
var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
counts.saveAsTextFile("file://result")
So I want the execution of var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) to be stepwise.
Is calling counts.foreachPartition(x => {}) after every operator the right way to do it?
Or is writing to /dev/null with saveAsTextFile() a better alternative? And does spark actually have something like a NullSink for that purpose? I wasn't able to write to /dev/null with saveAsTextFile() because /dev/null already exists. Is there a way to overwrite the spark result folder?
And does the temporary result after each operation should be cached with cache()?
What is the best way to separate the execution?
Spark supports two types of operations: actions and transformations. Transformations, as the name implies, turn datasets into new ones through the combination of the transformation operator and (in some cases, optionally) a function provided to the transformation. Actions, on the other hand, run through a dataset with some computation to provide a value to the driver.
There are two things that Spark does that makes your desired task a little difficult: it bundles non-shuffling transformations into execution blocks called stages and stages in the scheduling graph must be triggered through actions.
For your case, provided your input isn't massive, I think it would be easiest to trigger your transformations with a dummy action (e.g. count(), collect()) as the RDD will be materialized. During RDD computation, you can check the Spark UI to gather any performance statistics about the steps/stages/jobs used to create it.
This would look something like:
val text_file = sc.textFile(filename)
val words = text_file.flatMap(line => line.split(" "))
words.count()
val wordCount = words.map(word => (word, 1))
wordCount.count()
val wordCounts = wordCount.reduceByKey(_ + _)
wordCounts.count()
Some notes:
Since RDD's for all intents and purposes are immutable, they should be stored in val's
You can shorten your reduceByKey() syntax with underscore notation
Your approach with foreachPartition() could work since it is an action but it would require a change in your functions since your are operating over an iterator on your partition
Caching only makes since if you either create multiple RDD's from a parent RDD (branching out) or run iterated computation over the same RDD (perhaps in a loop)
You can also simple invoke RDD.persist() or RDD.cache() after every transformation. but ensure that you have right level of StorageLevel defined.

Resources