spark parallelize writes using list of dataframes - apache-spark

I have a list of dataframe created using jdbc. Is there a way to write them in parallel using parquet?
val listOfTableNameAndDf = for {
table <- tableNames
} yield (table, sqlContext.read.jdbc(jdbcUrl, table, new Properties))
I can write them sequentially, but is there a way to parallelize the writes or make it faster.
listOfTableNameAndDf.map { x => {
x._2.write.mode(org.apache.spark.sql.SaveMode.Overwrite).parquet(getStatingDir(x._1))
}
}

You can future to perform write actions asynchronously:
dfs.map { case (name, table) =>
Future(table.write.mode("overwrite").parquet(getStatingDir("name")))
}
but I doubt it will result in any significant improvement. In case like yours there a few main bottlenecks:
Cluster resources - if any job saturates available resources remaining jobs will be queued as before.
Input source throughput - source database have to keep up with the cluster.
Output source IO - output source have to keep with the cluster.
If source and output are the same for each job, jobs will compete for the same set of resources and sequential execution of the driver code is almost never an issue.
If you're looking for improvements in the current code I would recommend starting with using reader method with a following signature:
jdbc(url: String, table: String, columnName: String,
lowerBound: Long, upperBound: Long, numPartitions: Int,
connectionProperties: Properties)
It requires more effort to use but typically exhibits much better performance because reads (and as a result data) are distributed between worker nodes.

Related

How can I stop this Spark flatmap, which returns massive results, failing on writing?

I'm using a flatmap function to split absolutely huge XML files into (tens of thousands) of smaller XML String fragments which I want to write out to Parquet. This has a high rate of stage failure; exactly where is a bit cryptic, but it seems to be somewhere when the DataFrameWriter is writing that I lose an executor, probably because I'm exceeding some storage boundary.
To give a flavour, here's the class that's used in the flatMap, with some pseudo-code. Note that the class returns an Iterable - which I had hoped would allow Spark to stream the results from the flatMap, rather than (I suspect) holding it all in memory before writing it:
class XmlIterator(filepath: String, split_element: String) extends Iterable[String] {
// open an XMLEventReader on a FileInputStream on the filepath
// Implement an Iterable that returns a chunk of the XML file at a time
def iterator = new Iterator[String] {
def hasNext = {
// advance in the input stream and return true if there's something to return
}
def next = {
// return the current chunk as a String
}
}
}
And here is how I use it:
var dat = [a one-column DataFrame containing a bunch of paths to giga-files]
dat.repartition(1375) // repartition to the number of rows, as I want the DataFrameWriter
// to write out as soon as each file is processed
.flatMap(rec => new XmlIterator(rec, "bibrecord"))
.write
.parquet("some_path")
This works beautifully for a few files in parallel but for larger batches I suffer stage failure. One part of the stack trace suggests to me that Spark is in fact holding the entire results of each flatMap as an array before writing out:
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
To be honest, I thought that by implementing the flatMap as an Iterable Spark would be able to pull the results out one-by-one and avoid buffering the entire results in memory, but I'm a bit baffled.
Can anyone suggest an alternative, more memory-efficient strategy for saving out the results of the flatMap?
For what it's worth, I've managed to solve this myself by adding an intermediate stage that persists the flatMap output to disk. This lets me repartition the output of the flatmap before passing to a DataFrameWriter. Works seamlessly.
dat.repartition(1375)
.flatMap(rec => new XmlIterator(rec, "bibrecord"))
.persist(StorageLevel.DISK_ONLY)
.repartition(5000)
.write
.parquet("some_path")
I suspect that trying to pass the flatMap output directly to a DataFrameWriter was overwhelming some internal buffer - the output from each flatMap could be as much as 5GB, and I assume Spark was needing to hold this in memory.
If anyone has comments or pointers to the internal workings of the DataFrameWriter that would be super interesting.

Spark converting dataframe to RDD takes a huge amount of time, lazy execution or real issue?

In my spark application, I am loading data from Solr into a dataframe, running an SQL query on it, and then writing the resulting dataframe to MongoDB.
I am using spark-solr library to read data from Solr and mongo-spark-connector to write results to MongoDB.
The problem is that it is very slow, for datasets as small as 90 rows in an RDD, the spark job takes around 6 minutes to complete (4 nodes, 96gb RAM, 32 cores each).
I am sure that reading from Solr and writing to MongoDB is not slow because outside Spark they perform very fast.
When I inspect running jobs/stages/tasks on application master UI, it always shows a specific line in this function as taking 99% of the time:
override def exportData(spark: SparkSession, result: DataFrame): Unit = {
try {
val mongoWriteConfig = configureWriteConfig
MongoSpark.save(result.withColumn("resultOrder", monotonically_increasing_id())
.rdd
.map(row => {
implicit val formats: DefaultFormats.type = org.json4s.DefaultFormats
val rMap = Map(row.getValuesMap(row.schema.fieldNames.filterNot(_.equals("resultOrder"))).toSeq: _*)
val m = Map[String, Any](
"queryId" -> queryId,
"queryIndex" -> opIndex,
"resultOrder" -> row.getAs[Long]("resultOrder"),
"result" -> rMap
)
Document.parse(Serialization.write(m))
}), mongoWriteConfig);
} catch {
case e: SparkException => handleMongoException(e)
}
}
The line .rdd is shown to take most of the time to execute. Other stages take a few seconds or less.
I know that converting a dataframe to an rdd is not an inexpensive call but for 90 rows it should not take this long. My local standalone spark instance can do it in a few seconds.
I understand that Spark executes transformations lazily. Does it mean that operations before .rdd call is taking a long time and it's just a display issue on application master UI? Or is it really the dataframe to rdd conversion taking too long? What can cause this?
By the way, SQL queries run on the dataframe are pretty simple ones, just a single group by etc.

Convert a Spark SQL batch source to structured streaming sink

Trying to convert an org.apache.spark.sql.sources.CreatableRelationProvider into a org.apache.spark.sql.execution.streaming.Sink by simply implementing addBatch(...) which calls the createRelation(...) but there is a df.rdd in the createRelation(...), which causes the following error:
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
Was trying to look into howorg.apache.spark.sql.execution.streaming.FileStreamSink which also needs to get Rdd from dataframe in the streaming job, it seems to play the trick of using df.queryExecution.executedPlan.execute() to generate the RDD instead of calling .rdd.
However things does not seems to be that simple:
It seems the output ordering might need to be taken care of - https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L159
Might be some eager execution concerns? (not sure)
https://issues.apache.org/jira/browse/SPARK-20865
More details of the issue I am running into can be found here
Wondering what would be the idiomatic way to do this conversion?
Dataset.rdd() creates a new plan that just breaks the incremental planing. Because StreamExecution uses the existing plan to collect metrics and update watermark, we should never create a new plan. Otherwise, metrics and watermark are updated in the new plan, and StreamExecution cannot retrieval them.
Here is an example of the code in Scala to convert column values in Structured Streaming:
val convertedRows: RDD[Row] = df.queryExecution.toRdd.mapPartitions { iter: Iterator[InternalRow] =>
iter.map { row =>
val convertedValues: Array[Any] = new Array(conversionFunctions.length)
var i = 0
while (i < conversionFunctions.length) {
convertedValues(i) = conversionFunctions(i)(row, i)
i += 1
}
Row.fromSeq(convertedValues)
}
}

Method to get number of cores for a executor on a task node?

E.g. I need to get a list of all available executors and their respective multithreading capacity (NOT the total multithreading capacity, sc.defaultParallelism already handle that).
Since this parameter is implementation-dependent (YARN and spark-standalone have different strategy for allocating cores) and situational (it may fluctuate because of dynamic allocation and long-term job running). I cannot use other method to estimate this. Is there a way to retrieve this information using Spark API in a distributed transformation? (E.g. TaskContext, SparkEnv)
UPDATE As for Spark 1.6, I have tried the following methods:
1) run a 1-stage job with many partitions ( >> defaultParallelism ) and count the number of distinctive threadIDs for each executorID:
val n = sc.defaultParallelism * 16
sc.parallelize(n, n).map(v => SparkEnv.get.executorID -> Thread.currentThread().getID)
.groupByKey()
.mapValue(_.distinct)
.collect()
This however leads to an estimation higher than actual multithreading capacity because each Spark executor uses an overprovisioned thread pool.
2) Similar to 1, except that n = defaultParallesim, and in every task I add a delay to prevent resource negotiator from imbalanced sharding (a fast node complete it's task and asks for more before slow nodes can start running):
val n = sc.defaultParallelism
sc.parallelize(n, n).map{
v =>
Thread.sleep(5000)
SparkEnv.get.executorID -> Thread.currentThread().getID
}
.groupByKey()
.mapValue(_.distinct)
.collect()
it works most of the time, but is much slower than necessary and may be broken by very imbalanced cluster or task speculation.
3) I haven't try this: use java reflection to read BlockManager.numUsableCores, this is obviously not a stable solution, the internal implementation may change at any time.
Please tell me if you have found something better.
It is pretty easy with Spark rest API. You have to get application id:
val applicationId = spark.sparkContext.applicationId
ui URL:
val baseUrl = spark.sparkContext.uiWebUrl
and query:
val url = baseUrl.map { url =>
s"${url}/api/v1/applications/${applicationId}/executors"
}
With Apache HTTP library (already in Spark dependencies, adapted from https://alvinalexander.com/scala/scala-rest-client-apache-httpclient-restful-clients):
import org.apache.http.impl.client.DefaultHttpClient
import org.apache.http.client.methods.HttpGet
import scala.util.Try
val client = new DefaultHttpClient()
val response = url
.flatMap(url => Try{client.execute(new HttpGet(url))}.toOption)
.flatMap(response => Try{
val s = response.getEntity().getContent()
val json = scala.io.Source.fromInputStream(s).getLines.mkString
s.close
json
}.toOption)
and json4s:
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
case class ExecutorInfo(hostPort: String, totalCores: Int)
val executors: Option[List[ExecutorInfo]] = response.flatMap(json => Try {
parse(json).extract[List[ExecutorInfo]]
}.toOption)
As long as you keep application id and ui URL at hand and open ui port to external connections you can do the same thing from any task.
I would try to implement SparkListener in a way similar to web UI does. This code might be helpful as an example.

How to effectively read millions of rows from Cassandra?

I have a hard task to read from a Cassandra table millions of rows. Actually this table contains like 40~50 millions of rows.
The data is actually internal URLs for our system and we need to fire all of them. To fire it, we are using Akka Streams and it have been working pretty good, doing some back pressure as needed. But we still have not found a way to read everything effectively.
What we have tried so far:
Reading the data as Stream using Akka Stream. We are using phantom-dsl that provides a publisher for a specific table. But it does not read everything, only a small portion. Actually it stops to read after the first 1 million.
Reading using Spark by a specific date. Our table is modeled like a time series table, with year, month, day, minutes... columns. Right now we are selecting by day, so Spark will not fetch a lot of things to be processed, but this is a pain to select all those days.
The code is the following:
val cassandraRdd =
sc
.cassandraTable("keyspace", "my_table")
.select("id", "url")
.where("year = ? and month = ? and day = ?", date.getYear, date.getMonthOfYear, date.getDayOfMonth)
Unfortunately I can't iterate over the partitions to get less data, I have to use a collect because it complains the actor is not serializable.
val httpPool: Flow[(HttpRequest, String), (Try[HttpResponse], String), HostConnectionPool] = Http().cachedHostConnectionPool[String](host, port).async
val source =
Source
.actorRef[CassandraRow](10000000, OverflowStrategy.fail)
.map(row => makeUrl(row.getString("id"), row.getString("url")))
.map(url => HttpRequest(uri = url) -> url)
val ref = Flow[(HttpRequest, String)]
.via(httpPool.withAttributes(ActorAttributes.supervisionStrategy(decider)))
.to(Sink.actorRef(httpHandlerActor, IsDone))
.runWith(source)
cassandraRdd.collect().foreach { row =>
ref ! row
}
I would like to know if any of you have such experience on reading millions of rows for doing anything different from aggregation and so on.
Also I have thought to read everything and send to a Kafka topic, where I would be receiving using Streaming(spark or Akka), but the problem would be the same, how to load all those data effectively ?
EDIT
For now, I'm running on a cluster with a reasonable amount of memory 100GB and doing a collect and iterating over it.
Also, this is far different from getting bigdata with spark and analyze it using things like reduceByKey, aggregateByKey, etc, etc.
I need to fetch and send everything over HTTP =/
So far it is working the way I did, but I'm afraid this data get bigger and bigger to a point where fetching everything into memory makes no sense.
Streaming this data would be the best solution, fetching in chunks, but I haven't found a good approach yet for this.
At the end, I'm thinking of to use Spark to get all those data, generate a CSV file and use Akka Stream IO to process, this way I would evict to keep a lot of things in memory since it takes hours to process every million.
Well, after spending sometime reading, talking with other guys and doing tests the result could be achieve by the following code sample:
val sc = new SparkContext(sparkConf)
val cassandraRdd = sc.cassandraTable(config.getString("myKeyspace"), "myTable")
.select("key", "value")
.as((key: String, value: String) => (key, value))
.partitionBy(new HashPartitioner(2 * sc.defaultParallelism))
.cache()
cassandraRdd
.groupByKey()
.foreachPartition { partition =>
partition.foreach { row =>
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val myActor = system.actorOf(Props(new MyActor(system)), name = "my-actor")
val source = Source.fromIterator { () => row._2.toIterator }
source
.map { str =>
myActor ! Count
str
}
.to(Sink.actorRef(myActor, Finish))
.run()
}
}
sc.stop()
class MyActor(system: ActorSystem) extends Actor {
var count = 0
def receive = {
case Count =>
count = count + 1
case Finish =>
println(s"total: $count")
system.shutdown()
}
}
case object Count
case object Finish
What I'm doing is the following:
Try to achieve a good number of Partitions and a Partitioner using the partitionBy and groupBy methods
Use Cache to prevent Data Shuffle, making your Spark move large data across nodes, using high IO etc.
Create the whole actor system with it's dependencies as well as the Stream inside the foreachPartition method. Here is a trade off, you can have only one ActorSystem but you will have to make a bad use of .collect as I wrote in the question. However creating everything inside, you still have the ability to run things inside spark distributed across your cluster.
Finish each actor system at the end of the iterator using the Sink.actorRef with a message to kill(Finish)
Perhaps this code could be even more improved, but so far I'm happy to do not make the use of .collect anymore and working only inside Spark.

Resources