I frequently want to parallelize a task that relies on a non-threadsafe shared resource. Consider the following non-threadsafe class. I want to do a map over a data: Vector[String].
class Processor { def apply: String => String }
Basically, I want to create n threads, each with an instance of Processor and a partition of the data. Scala parallel collections have spoiled me into thinking the parallelization should be dirt simple. However, they don't seem well suited for this problem. Yes, I can use actors but Scala actors might become deprecated and Akka seems like overkill.
The first thing that comes to mind is to have a synchronized map Thread -> Processor and then use parallel collections, looking up my Processor in this thread-safe map. Is there a better way?
Instead of building your own synchronized map, you can use ThreadLocal. That will guarantee a unique Processor per thread.
val processors = new ThreadLocal[Processor] {
def initialValue() = new Processor
}
data.par.map(x => processors.get.apply(x))
Alternatively you try using an executor service configured to use specified number of threads explicitly:
val processors = new ThreadLocal[Processor] {
override def initialValue() = new Processor
}
val N = 4
// create an executor with fixed number of threads
val execSvc = Executors.newFixedThreadPool(N)
// create the tasks
data foreach {
loopData =>
execSvc.submit(new Runnable() {
def run = processors.get().apply(loopData)
})
}
// await termination
execSvc.shutdown()
while(!execSvc.awaitTermination(1, TimeUnit.SECONDS)) {
;
}
// processing complete!
Related
Given the following case class:
case class User(name:String, age:Int)
An RDD is created from a List of instances of Users
The following code filters the RDD to remove users above the age of 50
trait Process {
def test {
val rdd = ... // create RDD
rdd.filter(_.age>50)
}
}
In order to add logging, a separate validate function is created and passed to the filter, as follows:
trait Process {
def validate(user:User) {
if (user.age>50) {
true
}
else {
println("FAILED VALIDATION")
false
}
}
def test {
val rdd = ... // create RDD
rdd.filter(validate)
}
}
The following exception is thrown:
org.apache.spark.SparkException: Task not serializable
The code works by making the class in which the validate function is defined serializable:
trait Process extends Serializable
Is this the correct way to handle the Task not serializable exception, or is there a performance degradation to using serialization within Spark? Are there any better ways to do this?
Thanks
is there a performance degradation to using serialization within Spark
Task serialization (as opposed to data serialization, that occurs when shuffling / collecting data) is rarely noticeable performance-wise, as long as the serialized objects are small. Task serialization occurs once per task (regardless of the amount of data processed).
In this case (serializing the Process instance), the performance impact would probably be negligible since it's a small object.
The risk with this assumption ("Process is small, so it's OK") is that over time, Process might change: it would be easy for developers not to notice that this class gets serialized, so they might add members that would make this slower.
Are there any better ways to do this
You can avoid serialization completely by using static methods - methods of objects instead of classes. In this case, you can create a companion object for Process:
import Process._
trait Process {
def test {
val rdd = ... // create RDD
rdd.filter(validate)
}
}
object Process {
def validate(user:User) {
if (user.age>50) {
true
} else {
println("FAILED VALIDATION")
false
}
}
Objects are "static", so Spark can use them without serialization.
I have a simple method that writes a line of data to a File followed by a new line that is executed asynchronously.
def writeToFile(bw: BufferedWriter, str: String) = {
bw.write(str)
bw.newLine
}
When my program runs I'm getting "mixed up" rows in the file due to the async nature of the calls. For instance...say writeToFile(bw, "foo") is executed 3 times asynchronously I may get:
correct output
foo
foo
foo
possible incorrect output
foofoo
foo
I'm able to avoid this possibility by using synchronized method like this:
def writeToFile(bw: BufferedWriter, str: String) = synchronized {
bw.write(str)
bw.newLine
}
From what I researched I can't determine how "safe" this is in regards to scaling my application. The only examples I can find using synchronized is when accessing collections, not writing to a file. My application is built in the Play! Framework 2.4.2.
I personally would create an akka actor for each BufferedWriter what will encapsulate it completely.
import java.io.BufferedWriter
import akka.actor._
import playground.BufferedWriterActor.WriteToBuffer
object BufferedWriterActor {
val name = "BufferedWriterActor"
def props(bw: BufferedWriter) = Props(classOf[BufferedWriterActor], bw)
case class WriteToBuffer(str: String)
}
class BufferedWriterActor(bw: BufferedWriter) extends Actor {
def receive: Actor.Receive = {
case WriteToBuffer(str) =>
bw.write(str)
bw.newLine()
}
}
Use it like this:
import akka.actor.{ActorSystem, Props}
object HelloWorld {
def main(args: Array[String]): Unit = {
val system = ActorSystem("mySystem")
// Share this actor across all your threads.
val myActor = system.actorOf(BufferedWriterActor.props(bw), BufferedWriterActor.name)
// Send messages to this actor from all you threads.
myActor ! BufferedWriterActor.WriteToBuffer("The Text")
}
}
This will chain all calls to this buffer in a single thread.
More info on akka and its actors is here:
http://akka.io/
http://doc.akka.io/docs/akka/snapshot/scala/actors.html
Also play framework itself uses akka so you should be able to use its default ActorSystem, but I do not remember how exactly, sorry.
So I'm trying to work with both Squeryl and Akka Actors. I've done a lot of searching and all I've been able to find is the following Google Group post:
https://groups.google.com/forum/#!topic/squeryl/M0iftMlYfpQ
I think I might have shot myself in the foot as I originally created this factory pattern so I could toss around Database objects.
object DatabaseType extends Enumeration {
type DatabaseType = Value
val Postgres = Value(1,"Postgres")
val H2 = Value(2,"H2")
}
object Database {
def getInstance(dbType : DatabaseType, jdbcUrl : String, username : String, password : String) : Database = {
Class.forName(jdbcDriver(dbType))
new Database(Session.create(
_root_.java.sql.DriverManager.getConnection(jdbcUrl,username,password),
squerylAdapter(dbType)))
}
private def jdbcDriver(db : DatabaseType) = {
db match {
case DatabaseType.Postgres => "org.postgresql.Driver"
case DatabaseType.H2 => "org.h2.Driver"
}
}
private def squerylAdapter(db : DatabaseType) = {
db match {
case DatabaseType.Postgres => new PostgreSqlAdapter
case DatabaseType.H2 => new H2Adapter
}
}
}
Originally in my implementation, I tried surrounding all my statements in using(session), but I'd keep getting the dreaded "No session is bound to the current thread" error, so I added the session.bindToCuirrentThread to the constructor.
class Database(session: Session) {
session.bindToCurrent
def failedBatch(filename : String, message : String, start : Timestamp = now, end : Timestamp = now) =
batch.insert(new Batch(0,filename,Some(start),Some(end),ProvisioningStatus.Fail,Some(message)))
def startBatch(batch_id : Long, start : Timestamp = now) =
batch update (b => where (b.id === batch_id) set (b.start := Some(start)))
...more functions
This worked reasonably well, until I got to Scala Actors.
class TransferActor() extends Actor {
def databaseInstance() = {
val dbConfig = config.getConfig("provisioning.database")
Database.getInstance(DatabaseType.Postgres,
dbConfig.getString("jdbcUrl"),
dbConfig.getString("username"),
dbConfig.getString("password"))
}
lazy val config = ConfigManager.current
override def receive: Actor.Receive = { /* .. do some work */
I constantly get the following:
[ERROR] [03/11/2014 17:02:57.720] [provisioning-system-akka.actor.default-dispatcher-4] [akka://provisioning-system/user/$c] No session is bound to current thread, a session must be created via Session.create
and bound to the thread via 'work' or 'bindToCurrentThread'
Usually this error occurs when a statement is executed outside of a transaction/inTrasaction block
java.lang.RuntimeException: No session is bound to current thread, a session must be created via Session.create
and bound to the thread via 'work' or 'bindToCurrentThread'
I'm getting a fresh Database object each time, not caching it with a lazy val, so shouldn't that constructor always get called and attach to my current thread? Does Akka attach different threads to different actors and swap them around? Should I just add a function to call session.bindToCurrentThread each time I'm in an actor? Seems kinda hacky.
Does Akka attach different threads to different actors and swap them around?
That's exactly how the actor model works. The idea is that you can have a small thread pool servicing a very large number of threads because actors only need to use a thread when they have a message waiting to be processed.
Some general tips for Squeryl.... A session is a one to one association with a JDBC connection. The main advantage of keeping Sessions open is that you can have a transaction open that gives you a consistent view of the database as you perform multiple operations. If you don't need that, make your session/transaction code granular to avoid these types of issues. If you do need it, don't rely on Sessions being available in a thread local context. Use the transaction(session){} or transaction(sessionFactory){} methods to explicitly tell Squeryl where you want your Session to come from.
I have an Actor and some other object:
object Config {
val readValueFromConfig() = { //....}
}
class MyActor extends Actor {
val confValue = Config.readValueFromConfig()
val initValue = Future {
val a = confValue // sometimes it's null
val a = Config.readValueFromConfig() //always works well
}
//..........
}
The code above is a very simplified version of what I actually have. The odd thing is that sometimes val a = confValue returns null, whereas if I replace it with val a = Config.readValueFromConfig() then it always works well.
I wonder, is this due to the fact that the only way to interact with an actor is sending it a message? Therefore, since val confValue is not a local variable, I must either use val a = Config.readValueFromConfig() (a different object, not an actor) or val a = self ! GetConfigValue and read the result afterwards?
val readValueFromConfig() = { //....}
This gives me a compile error. I assume you mean without parentheses?
val readValueFromConfig = { //....}
Same logic with different timing gives different result = a race condition.
val confValue = Config.readValueFromConfig() is always executed during construction of MyActor objects (because it's a field of MyActor). Sometimes this is returning null.
val a = Config.readValueFromConfig() //always works well is always executed later - after MyActor is constructed, when the Future initValue is executed by it's Executor. It seems this never returns null.
Possible causes:
Could be explained away if the body of readValueFromConfig was dependent upon another
parallel/async operation having completed. Any chance you're reading the config asynchronously? Given the name of this method, it probably just reads synchronously from a file - meaning this is not the cause.
Singleton objects are not threadsafe?? I compiled your code. Here's the decompilation of your singleton object java class:
public final class Config
{
public static String readValueFromConfig()
{
return Config..MODULE$.readValueFromConfig();
}
}
public final class Config$
{
public static final MODULE$;
private final String readValueFromConfig;
static
{
new ();
}
public String readValueFromConfig()
{
return this.readValueFromConfig;
}
private Config$()
{
MODULE$ = this;
this.readValueFromConfig = // ... your logic here;
}
}
Mmmkay... Unless I'm mistaken, that ain't thread-safe.
IF two threads are accessing readValueFromConfig (say Thread1 accesses it first), then inside method private Config$(), MODULE$ is unsafely published before this.readValueFromConfig is set (reference to this prematurely escapes the constructor). Thread2 which is right behind can read MODULE$.readValueFromConfig before it is set. Highly likely to be a problem if '... your logic here' is slow and blocks the thread - which is precisely what synchronous I/O does.
Moral of story: avoid stateful singleton objects from Actors (or any Threads at all, including Executors) OR make them become thread-safe through very careful coding style. Work-Around: change to a def, which internally caches the value in a private val.
I wonder, is this due to the fact that the only way to interact with an actor is sending it a message? Therefore, since val confValue is not a local variable, I must either use val a = Config.readValueFromConfig() (a different object, not an actor)
Just because it's not an actor, doesn't mean it's necessarily safe. It probably isn't.
or val a = self ! GetConfigValue and read the result afterwards?
That's almost right. You mean self ? GetConfigValue, I think - that will return a Future, which you can then map over. ! doesn't return anything.
You cannot read from an actor's variables directly inside a Future because (in general) that Future could be running on any thread, on any processor core, and you don't have any memory barrier there to force the CPU caches to reload the value from main memory.
I realized that when I am trying to process items in a concurrent queue using multiple threads while multiple threads can be putting items into it, the ideal solution would be to use the Reactive Extensions with the Concurrent data structures.
My original question is at:
While using ConcurrentQueue, trying to dequeue while looping through in parallel
So I am curious if there is any way to have a LINQ (or PLINQ) query that will continuously be dequeueing as items are put into it.
I am trying to get this to work in a way where I can have n number of producers pushing into the queue and a limited number of threads to process, so I don't overload the database.
If I could use Rx framework then I expect that I could just start it, and if 100 items are placed in within 100ms, then the 20 threads that are part of the PLINQ query would just process through the queue.
There are three technologies I am trying to work together:
Rx Framework (Reactive LINQ)
PLING
System.Collections.Concurrent
structures
Drew is right, I think the ConcurrentQueue even though it sounds perfect for the job is actually the underlying data structure that the BlockingCollection uses. Seems very back to front to me too.
Check out chapter 7 of this book*
http://www.amazon.co.uk/Parallel-Programming-Microsoft-NET-Decomposition/dp/0735651590/ref=sr_1_1?ie=UTF8&qid=1294319704&sr=8-1
and it will explain how to use the BlockingCollection and have multiple producers and multiple consumers each taking off the "queue". You will want to look at the "GetConsumingEnumerable()" method and possibly just call .ToObservable() on that.
*the rest of the book is pretty average.
edit:
Here is a sample program that I think does what you want?
class Program
{
private static ManualResetEvent _mre = new ManualResetEvent(false);
static void Main(string[] args)
{
var theQueue = new BlockingCollection<string>();
theQueue.GetConsumingEnumerable()
.ToObservable(Scheduler.TaskPool)
.Subscribe(x => ProcessNewValue(x, "Consumer 1", 10000000));
theQueue.GetConsumingEnumerable()
.ToObservable(Scheduler.TaskPool)
.Subscribe(x => ProcessNewValue(x, "Consumer 2", 50000000));
theQueue.GetConsumingEnumerable()
.ToObservable(Scheduler.TaskPool)
.Subscribe(x => ProcessNewValue(x, "Consumer 3", 30000000));
LoadQueue(theQueue, "Producer A");
LoadQueue(theQueue, "Producer B");
LoadQueue(theQueue, "Producer C");
_mre.Set();
Console.WriteLine("Processing now....");
Console.ReadLine();
}
private static void ProcessNewValue(string value, string consumerName, int delay)
{
Thread.SpinWait(delay);
Console.WriteLine("{1} consuming {0}", value, consumerName);
}
private static void LoadQueue(BlockingCollection<string> target, string prefix)
{
var thread = new Thread(() =>
{
_mre.WaitOne();
for (int i = 0; i < 100; i++)
{
target.Add(string.Format("{0} {1}", prefix, i));
}
});
thread.Start();
}
}
I don't know how best to accomplish this with Rx, but I would recommend just using BlockingCollection<T> and the producer-consumer pattern. Your main thread adds items into the collection, which uses ConcurrentQueue<T> underneath by default. Then you have a separate Task that you spin up ahead of that which uses Parallel::ForEach over the BlockingCollection<T> to process as many items from the collection as makes sense for the system concurrently. Now, you will probably also want to look into using the GetConsumingPartitioner method of the ParallelExtensions library in order to be most efficient since the default partitioner will create more overhead than you want in this case. You can read more about this from this blog post.
When the main thread is finished you call CompleteAdding on the BlockingCollection<T> and Task::Wait on the Task you spun up to wait for all the consumers to finish processing all the items in the collection.