I am using scala parallel collections.
val largeList = list.par.map(x => largeComputation(x)).toList
It is blazing fast, but I have a feeling that I may run into out-of-memory issues if we run too may "largeComputation" in parallel.
Therefore when testing, I would like to know how many threads is the parallel collection using and if-need-be, how can I configure the number of threads for the parallel collections.
Here is a piece of scaladoc where they explain how to change the task support and wrap inside it the ForkJoinPool. When you instantiate the ForkJoinPool you pass as the parameter desired parallelism level:
Here is a way to change the task support of a parallel collection:
import scala.collection.parallel._
val pc = mutable.ParArray(1, 2, 3)
pc.tasksupport = new ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(2))
So for your case it will be
val largeList = list.par
largerList.tasksupport = new ForkJoinTaskSupport(
new scala.concurrent.forkjoin.ForkJoinPool(x)
)
largerList.map(x => largeComputation(x)).toList
Related
I am learning Scala and as an exercise I am transforming some python (PySpark) code to Scala (spark/Scala) code. Everything was going ok until I started dealing with scala threads. So, Do you now how can I re write the following code to scala?
Thank You in Advance!
def load_tables(table_name, spark):
source_path = f"s3://data/tables/{table_name}"
table = spark.read.format("csv").load(source_path)
table.createOrReplaceTempView(table_name)
def read_initial_tables(spark):
threads = []
tables = ["table1", "table2", "table3"]
for table in tables:
t = threading.Thread(target=load_tables, args=(table, spark))
threads.append(t)
for thread in threads:
thread.start()
for thread in threads:
thread.join()
...passing arguments into threads...
Scala uses the Java standard libraries, and starting a thread in Java is a little bit different from starting a thread in Python. The main difference is, in Python you can choose any target (i.e., any function or callable object) for the thread's top-level, and you can pass in any args that you like. But when you start a Java thread, the top-level function must be a no-argument method named run() that belongs to an object that implements java.lang.Runnable.
Your Python thread's top-level function is load_tables(table, spark). So, what you need in your Scala program is a thread whose top-level function is a run() function that calls load_tables(table, spark).
I don't actually know Scala, but maybe the example on this web page will steer you in the right direction: https://alvinalexander.com/scala/how-to-create-java-thread-runnable-in-scala/
Basically, I think all you have to do is follow his example, and put your load_tables(table, spark) call in the place where his example says, "your custom behavior here."
Solomon is right. I could not describe it better. Taking advantage of the syntactic sugar Scala provides over Java, your Python code is not longer in Scala:
def load_tables(table_name: String, spark: SparkSession): Runnable = () => {
val source_path = s"s3://data/tables/$table_name"
val table = spark.read.format("csv").load(source_path)
table.createOrReplaceTempView(table_name)
}
def read_initial_tables(spark: SparkSession): Unit = {
val tables = List("table1", "table2", "table3")
val threads = for {
table <- tables
} yield new Thread(load_tables(table, spark))
for (thread <- threads)
thread.start()
for (thread <- threads)
thread.join()
}
You might ask where is the run() method, Solomon was talking about. Actually, the empty parentheses () after the = sign the load_tables starts with, represent the no-argument parameter list that is passed to the run method, while the body of the run method is the block of code between curly braces after the => sign.
So a call to load_tables actually returns a new Runnable instance.
This is called a Single Abstract Method which is just a syntactic sugar that gives the impression that load_tables looks callable as in Python, but it's not actually. Only it's return type is, because it returns a Runnable object. This short version is only achievable because Runnable is a Functional Interface.
I'm not a specialist in Spark, so I'm not sure if this is the idiomatic way to code in Scala with Spark, but it's a good starting point to go from here.
Maybe not really what you are looking for but it could be interesting. Scala has some very convenient stuff for parallelization of collections with the method .par:
val parallelizedList = List(1, 2, 3, 4).par
parallelizedList.foreach(i => println(i)) // this is executed in parallel, not sequentially
// output:
// 2
// 4
// 1
// 3
So you can use this syntax with spark to read multiple tables in parallel:
def loadTable(tableName: String, spark: SparkSession): Unit = {
val sourcePath = f"s3://data/tables/$tableName"
val table = spark.read.format("csv").load(sourcePath)
table.createOrReplaceTempView(tableName)
}
val tableNames = List("table1", "table2", "table3")
tableNames.par.foreach(name => loadTable(name, spark))
EDIT
If you use Scala 2.12, parallel collections will be available. They have been moved to their own module in 2.13: scala/scala-parallel-collection
libraryDependencies += "org.scala-lang.modules" %% "scala-parallel-collections" % "1.0.0"
import scala.collection.parallel.CollectionConverters._
Be careful if the actions you execute on a parallel collection modify the same data. This can lead to non-deterministic behaviour (see #Alin Gabriel Arhip's comment below).
Apparently, it is not really encouraged to use parallel collections with Spark (also see #Alin Gabriel Arhip's comment below), but I've never had any problem with them so far (although I usually only use them for very simple processing that I know won't use all available resources)
Say I have an object and I need to make some operations towards the member of this object: arr.
object A {
val arr = (0 to 1000000).toList
def main(args: Array[String]): Unit = {
//...init spark context
val rdd: RDD[Int] = ...
rdd.map(arr.contains(_)).saveAsTextFile...
}
}
What is the difference between broadcasted arr and not broadcasted?
i.e.
val arrBr = sc.broadcast(arr)
rdd.map(arrBr.value.contains(_))
and
rdd.map(arr.contains(_))
In my opinion, the object A is a singleton object, so it will be transferred through the nodes in Spark.
Is it necessary to use broadcast in this scenario?
In the case
rdd.map(arr.contains(_))
arr is serialized shipped for each task
while in
val arrBr = sc.broadcast(arr)
rdd.map(arrBr.value.contains(_))
this is only done once per executor.
Therefore you should use broadcast when dealing with large datastructures.
Just two additional things to mention beside Raphael's answer which is correct. You must always consider the size of the variable that you broadcast this shouldn't be too large otherwise Spark will face difficulties to distribute it efficiently along the cluster. In your case is:
4B x 1000000 = 4000000B ~ 4GB
which exceeds already the default value 4MB and can be controlled by modifying the value of spark.broadcast.blockSize.
Another factor to decide whether to use or not broadcast is when you have joins and want to avoid shuffling. By broadcasting a dataframe the keys will be available immediately in the node and hence avoid retrieving data from different nodes(shuffling).
E.g. I need to get a list of all available executors and their respective multithreading capacity (NOT the total multithreading capacity, sc.defaultParallelism already handle that).
Since this parameter is implementation-dependent (YARN and spark-standalone have different strategy for allocating cores) and situational (it may fluctuate because of dynamic allocation and long-term job running). I cannot use other method to estimate this. Is there a way to retrieve this information using Spark API in a distributed transformation? (E.g. TaskContext, SparkEnv)
UPDATE As for Spark 1.6, I have tried the following methods:
1) run a 1-stage job with many partitions ( >> defaultParallelism ) and count the number of distinctive threadIDs for each executorID:
val n = sc.defaultParallelism * 16
sc.parallelize(n, n).map(v => SparkEnv.get.executorID -> Thread.currentThread().getID)
.groupByKey()
.mapValue(_.distinct)
.collect()
This however leads to an estimation higher than actual multithreading capacity because each Spark executor uses an overprovisioned thread pool.
2) Similar to 1, except that n = defaultParallesim, and in every task I add a delay to prevent resource negotiator from imbalanced sharding (a fast node complete it's task and asks for more before slow nodes can start running):
val n = sc.defaultParallelism
sc.parallelize(n, n).map{
v =>
Thread.sleep(5000)
SparkEnv.get.executorID -> Thread.currentThread().getID
}
.groupByKey()
.mapValue(_.distinct)
.collect()
it works most of the time, but is much slower than necessary and may be broken by very imbalanced cluster or task speculation.
3) I haven't try this: use java reflection to read BlockManager.numUsableCores, this is obviously not a stable solution, the internal implementation may change at any time.
Please tell me if you have found something better.
It is pretty easy with Spark rest API. You have to get application id:
val applicationId = spark.sparkContext.applicationId
ui URL:
val baseUrl = spark.sparkContext.uiWebUrl
and query:
val url = baseUrl.map { url =>
s"${url}/api/v1/applications/${applicationId}/executors"
}
With Apache HTTP library (already in Spark dependencies, adapted from https://alvinalexander.com/scala/scala-rest-client-apache-httpclient-restful-clients):
import org.apache.http.impl.client.DefaultHttpClient
import org.apache.http.client.methods.HttpGet
import scala.util.Try
val client = new DefaultHttpClient()
val response = url
.flatMap(url => Try{client.execute(new HttpGet(url))}.toOption)
.flatMap(response => Try{
val s = response.getEntity().getContent()
val json = scala.io.Source.fromInputStream(s).getLines.mkString
s.close
json
}.toOption)
and json4s:
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
case class ExecutorInfo(hostPort: String, totalCores: Int)
val executors: Option[List[ExecutorInfo]] = response.flatMap(json => Try {
parse(json).extract[List[ExecutorInfo]]
}.toOption)
As long as you keep application id and ui URL at hand and open ui port to external connections you can do the same thing from any task.
I would try to implement SparkListener in a way similar to web UI does. This code might be helpful as an example.
I have a list of dataframe created using jdbc. Is there a way to write them in parallel using parquet?
val listOfTableNameAndDf = for {
table <- tableNames
} yield (table, sqlContext.read.jdbc(jdbcUrl, table, new Properties))
I can write them sequentially, but is there a way to parallelize the writes or make it faster.
listOfTableNameAndDf.map { x => {
x._2.write.mode(org.apache.spark.sql.SaveMode.Overwrite).parquet(getStatingDir(x._1))
}
}
You can future to perform write actions asynchronously:
dfs.map { case (name, table) =>
Future(table.write.mode("overwrite").parquet(getStatingDir("name")))
}
but I doubt it will result in any significant improvement. In case like yours there a few main bottlenecks:
Cluster resources - if any job saturates available resources remaining jobs will be queued as before.
Input source throughput - source database have to keep up with the cluster.
Output source IO - output source have to keep with the cluster.
If source and output are the same for each job, jobs will compete for the same set of resources and sequential execution of the driver code is almost never an issue.
If you're looking for improvements in the current code I would recommend starting with using reader method with a following signature:
jdbc(url: String, table: String, columnName: String,
lowerBound: Long, upperBound: Long, numPartitions: Int,
connectionProperties: Properties)
It requires more effort to use but typically exhibits much better performance because reads (and as a result data) are distributed between worker nodes.
I have a hard task to read from a Cassandra table millions of rows. Actually this table contains like 40~50 millions of rows.
The data is actually internal URLs for our system and we need to fire all of them. To fire it, we are using Akka Streams and it have been working pretty good, doing some back pressure as needed. But we still have not found a way to read everything effectively.
What we have tried so far:
Reading the data as Stream using Akka Stream. We are using phantom-dsl that provides a publisher for a specific table. But it does not read everything, only a small portion. Actually it stops to read after the first 1 million.
Reading using Spark by a specific date. Our table is modeled like a time series table, with year, month, day, minutes... columns. Right now we are selecting by day, so Spark will not fetch a lot of things to be processed, but this is a pain to select all those days.
The code is the following:
val cassandraRdd =
sc
.cassandraTable("keyspace", "my_table")
.select("id", "url")
.where("year = ? and month = ? and day = ?", date.getYear, date.getMonthOfYear, date.getDayOfMonth)
Unfortunately I can't iterate over the partitions to get less data, I have to use a collect because it complains the actor is not serializable.
val httpPool: Flow[(HttpRequest, String), (Try[HttpResponse], String), HostConnectionPool] = Http().cachedHostConnectionPool[String](host, port).async
val source =
Source
.actorRef[CassandraRow](10000000, OverflowStrategy.fail)
.map(row => makeUrl(row.getString("id"), row.getString("url")))
.map(url => HttpRequest(uri = url) -> url)
val ref = Flow[(HttpRequest, String)]
.via(httpPool.withAttributes(ActorAttributes.supervisionStrategy(decider)))
.to(Sink.actorRef(httpHandlerActor, IsDone))
.runWith(source)
cassandraRdd.collect().foreach { row =>
ref ! row
}
I would like to know if any of you have such experience on reading millions of rows for doing anything different from aggregation and so on.
Also I have thought to read everything and send to a Kafka topic, where I would be receiving using Streaming(spark or Akka), but the problem would be the same, how to load all those data effectively ?
EDIT
For now, I'm running on a cluster with a reasonable amount of memory 100GB and doing a collect and iterating over it.
Also, this is far different from getting bigdata with spark and analyze it using things like reduceByKey, aggregateByKey, etc, etc.
I need to fetch and send everything over HTTP =/
So far it is working the way I did, but I'm afraid this data get bigger and bigger to a point where fetching everything into memory makes no sense.
Streaming this data would be the best solution, fetching in chunks, but I haven't found a good approach yet for this.
At the end, I'm thinking of to use Spark to get all those data, generate a CSV file and use Akka Stream IO to process, this way I would evict to keep a lot of things in memory since it takes hours to process every million.
Well, after spending sometime reading, talking with other guys and doing tests the result could be achieve by the following code sample:
val sc = new SparkContext(sparkConf)
val cassandraRdd = sc.cassandraTable(config.getString("myKeyspace"), "myTable")
.select("key", "value")
.as((key: String, value: String) => (key, value))
.partitionBy(new HashPartitioner(2 * sc.defaultParallelism))
.cache()
cassandraRdd
.groupByKey()
.foreachPartition { partition =>
partition.foreach { row =>
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val myActor = system.actorOf(Props(new MyActor(system)), name = "my-actor")
val source = Source.fromIterator { () => row._2.toIterator }
source
.map { str =>
myActor ! Count
str
}
.to(Sink.actorRef(myActor, Finish))
.run()
}
}
sc.stop()
class MyActor(system: ActorSystem) extends Actor {
var count = 0
def receive = {
case Count =>
count = count + 1
case Finish =>
println(s"total: $count")
system.shutdown()
}
}
case object Count
case object Finish
What I'm doing is the following:
Try to achieve a good number of Partitions and a Partitioner using the partitionBy and groupBy methods
Use Cache to prevent Data Shuffle, making your Spark move large data across nodes, using high IO etc.
Create the whole actor system with it's dependencies as well as the Stream inside the foreachPartition method. Here is a trade off, you can have only one ActorSystem but you will have to make a bad use of .collect as I wrote in the question. However creating everything inside, you still have the ability to run things inside spark distributed across your cluster.
Finish each actor system at the end of the iterator using the Sink.actorRef with a message to kill(Finish)
Perhaps this code could be even more improved, but so far I'm happy to do not make the use of .collect anymore and working only inside Spark.