Determine machine storing a Spark partition - apache-spark

How can I determine the hostname of the machine holding a particular partition of an RDD?
I realize Spark does not intend to expose this information to casual users, but I'm trying to interface Spark with another system, and knowing the physical locations of the partitions would allow for more efficient transfers.

You can try and call foreachPartition on the RDD and get the hostname using system commands.
Something like (in pyspark):
def f(iterator):
log2file(gethostname)
rdd.foreachParition(f)
where log2file is some function to log to a file and gethostname is a regular system command to get the hostname.
If you want to get the result back as an RDD you can use mapPartitions as follows:
def f(iterator): yield hostname
rdd.mapPartitions(f).collect()

Found a solution on another Stackoverflow question, How to get ID of a map task in Spark?. This information is available in the TaskContext object, which you can use like so:
import org.apache.spark.TaskContext
sc.parallelize(1 to 10, 3).foreachPartition(_ => {
val ctx = TaskContext.get
val stageId = ctx.stageId
val partId = ctx.partitionId
val hostname = ctx.taskMetrics.hostname
println(s"Stage: $stageId, Partition: $partId, Host: $hostname")
})

Related

Method to get number of cores for a executor on a task node?

E.g. I need to get a list of all available executors and their respective multithreading capacity (NOT the total multithreading capacity, sc.defaultParallelism already handle that).
Since this parameter is implementation-dependent (YARN and spark-standalone have different strategy for allocating cores) and situational (it may fluctuate because of dynamic allocation and long-term job running). I cannot use other method to estimate this. Is there a way to retrieve this information using Spark API in a distributed transformation? (E.g. TaskContext, SparkEnv)
UPDATE As for Spark 1.6, I have tried the following methods:
1) run a 1-stage job with many partitions ( >> defaultParallelism ) and count the number of distinctive threadIDs for each executorID:
val n = sc.defaultParallelism * 16
sc.parallelize(n, n).map(v => SparkEnv.get.executorID -> Thread.currentThread().getID)
.groupByKey()
.mapValue(_.distinct)
.collect()
This however leads to an estimation higher than actual multithreading capacity because each Spark executor uses an overprovisioned thread pool.
2) Similar to 1, except that n = defaultParallesim, and in every task I add a delay to prevent resource negotiator from imbalanced sharding (a fast node complete it's task and asks for more before slow nodes can start running):
val n = sc.defaultParallelism
sc.parallelize(n, n).map{
v =>
Thread.sleep(5000)
SparkEnv.get.executorID -> Thread.currentThread().getID
}
.groupByKey()
.mapValue(_.distinct)
.collect()
it works most of the time, but is much slower than necessary and may be broken by very imbalanced cluster or task speculation.
3) I haven't try this: use java reflection to read BlockManager.numUsableCores, this is obviously not a stable solution, the internal implementation may change at any time.
Please tell me if you have found something better.
It is pretty easy with Spark rest API. You have to get application id:
val applicationId = spark.sparkContext.applicationId
ui URL:
val baseUrl = spark.sparkContext.uiWebUrl
and query:
val url = baseUrl.map { url =>
s"${url}/api/v1/applications/${applicationId}/executors"
}
With Apache HTTP library (already in Spark dependencies, adapted from https://alvinalexander.com/scala/scala-rest-client-apache-httpclient-restful-clients):
import org.apache.http.impl.client.DefaultHttpClient
import org.apache.http.client.methods.HttpGet
import scala.util.Try
val client = new DefaultHttpClient()
val response = url
.flatMap(url => Try{client.execute(new HttpGet(url))}.toOption)
.flatMap(response => Try{
val s = response.getEntity().getContent()
val json = scala.io.Source.fromInputStream(s).getLines.mkString
s.close
json
}.toOption)
and json4s:
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
case class ExecutorInfo(hostPort: String, totalCores: Int)
val executors: Option[List[ExecutorInfo]] = response.flatMap(json => Try {
parse(json).extract[List[ExecutorInfo]]
}.toOption)
As long as you keep application id and ui URL at hand and open ui port to external connections you can do the same thing from any task.
I would try to implement SparkListener in a way similar to web UI does. This code might be helpful as an example.

How to effectively read millions of rows from Cassandra?

I have a hard task to read from a Cassandra table millions of rows. Actually this table contains like 40~50 millions of rows.
The data is actually internal URLs for our system and we need to fire all of them. To fire it, we are using Akka Streams and it have been working pretty good, doing some back pressure as needed. But we still have not found a way to read everything effectively.
What we have tried so far:
Reading the data as Stream using Akka Stream. We are using phantom-dsl that provides a publisher for a specific table. But it does not read everything, only a small portion. Actually it stops to read after the first 1 million.
Reading using Spark by a specific date. Our table is modeled like a time series table, with year, month, day, minutes... columns. Right now we are selecting by day, so Spark will not fetch a lot of things to be processed, but this is a pain to select all those days.
The code is the following:
val cassandraRdd =
sc
.cassandraTable("keyspace", "my_table")
.select("id", "url")
.where("year = ? and month = ? and day = ?", date.getYear, date.getMonthOfYear, date.getDayOfMonth)
Unfortunately I can't iterate over the partitions to get less data, I have to use a collect because it complains the actor is not serializable.
val httpPool: Flow[(HttpRequest, String), (Try[HttpResponse], String), HostConnectionPool] = Http().cachedHostConnectionPool[String](host, port).async
val source =
Source
.actorRef[CassandraRow](10000000, OverflowStrategy.fail)
.map(row => makeUrl(row.getString("id"), row.getString("url")))
.map(url => HttpRequest(uri = url) -> url)
val ref = Flow[(HttpRequest, String)]
.via(httpPool.withAttributes(ActorAttributes.supervisionStrategy(decider)))
.to(Sink.actorRef(httpHandlerActor, IsDone))
.runWith(source)
cassandraRdd.collect().foreach { row =>
ref ! row
}
I would like to know if any of you have such experience on reading millions of rows for doing anything different from aggregation and so on.
Also I have thought to read everything and send to a Kafka topic, where I would be receiving using Streaming(spark or Akka), but the problem would be the same, how to load all those data effectively ?
EDIT
For now, I'm running on a cluster with a reasonable amount of memory 100GB and doing a collect and iterating over it.
Also, this is far different from getting bigdata with spark and analyze it using things like reduceByKey, aggregateByKey, etc, etc.
I need to fetch and send everything over HTTP =/
So far it is working the way I did, but I'm afraid this data get bigger and bigger to a point where fetching everything into memory makes no sense.
Streaming this data would be the best solution, fetching in chunks, but I haven't found a good approach yet for this.
At the end, I'm thinking of to use Spark to get all those data, generate a CSV file and use Akka Stream IO to process, this way I would evict to keep a lot of things in memory since it takes hours to process every million.
Well, after spending sometime reading, talking with other guys and doing tests the result could be achieve by the following code sample:
val sc = new SparkContext(sparkConf)
val cassandraRdd = sc.cassandraTable(config.getString("myKeyspace"), "myTable")
.select("key", "value")
.as((key: String, value: String) => (key, value))
.partitionBy(new HashPartitioner(2 * sc.defaultParallelism))
.cache()
cassandraRdd
.groupByKey()
.foreachPartition { partition =>
partition.foreach { row =>
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val myActor = system.actorOf(Props(new MyActor(system)), name = "my-actor")
val source = Source.fromIterator { () => row._2.toIterator }
source
.map { str =>
myActor ! Count
str
}
.to(Sink.actorRef(myActor, Finish))
.run()
}
}
sc.stop()
class MyActor(system: ActorSystem) extends Actor {
var count = 0
def receive = {
case Count =>
count = count + 1
case Finish =>
println(s"total: $count")
system.shutdown()
}
}
case object Count
case object Finish
What I'm doing is the following:
Try to achieve a good number of Partitions and a Partitioner using the partitionBy and groupBy methods
Use Cache to prevent Data Shuffle, making your Spark move large data across nodes, using high IO etc.
Create the whole actor system with it's dependencies as well as the Stream inside the foreachPartition method. Here is a trade off, you can have only one ActorSystem but you will have to make a bad use of .collect as I wrote in the question. However creating everything inside, you still have the ability to run things inside spark distributed across your cluster.
Finish each actor system at the end of the iterator using the Sink.actorRef with a message to kill(Finish)
Perhaps this code could be even more improved, but so far I'm happy to do not make the use of .collect anymore and working only inside Spark.

How to send transformed data from partitions to S3?

I have an RDD which is to big to collect. I have applied a chain of transformations to the RDD and want to send its transformed data directly from its partitions on my slaves to S3. I am currently operating as follows:
val rdd:RDD = initializeRDD
val rdd2 = rdd.transform
rdd2.first // in order to force calculation of RDD
rdd2.foreachPartition sendDataToS3
Unfortunately, the data that gets sent to S3 is untransformed. The RDD looks exactly like it did in stage initializeRDD.
Here is the body of sendDataToS3:
implicit class WriteableRDD[T](rdd:RDD[T]){
def transform:RDD[String] = rdd map {_.toString}
....
def sendPartitionsToS3(prefix:String) = {
rdd.foreachPartition { p =>
val filename = prefix+new scala.util.Random().nextInt(1000000)
val pw = new PrintWriter(new File(filename))
p foreach pw.println
pw.close
s3.putObject(S3_BUCKET, filename, new File(filename))
}
this
}
}
This is called with rdd.transform.sendPartitionsToS3(prefix).
How do I make sure the data that gets sent in sendDataToS3 is the transformed data?
My guess is there is a bug in your code that is not included in the question.
I'm answering anyway just to make sure you are aware of RDD.saveAsTextFile. You can give it a path on S3 (s3n://bucket/directory) and it will write each partition into that path directly from the executors.
I can hardly imagine when you would need to implement your own sendPartitionsToS3 instead of using saveAsTextFile.

NotSerializableException: org.apache.hadoop.io.LongWritable

I know this question has been answered many times, but I tried everything and I do not come to a solution. I have the following code which raises a NotSerializableException
val ids : Seq[Long] = ...
ids.foreach{ id =>
sc.sequenceFile("file", classOf[LongWritable], classOf[MyWritable]).lookup(new LongWritable(id))
}
With the following exception
Caused by: java.io.NotSerializableException: org.apache.hadoop.io.LongWritable
Serialization stack:
...
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
When creating the SparkContext, I do
val sparkConfig = new SparkConf().setAppName("...").setMaster("...")
sparkConfig.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConfig.registerKryoClasses(Array(classOf[BitString[_]], classOf[MinimalBitString], classOf[org.apache.hadoop.io.LongWritable]))
sparkConfig.set("spark.kryoserializer.classesToRegister", "org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable")
and looking at the environment tab, I can see these entries. However, I do not understand why
the Kryo serializer does not seem to be used (the stack does not mention Kryo)
LongWritable is not serialized.
I'm using Apache Spark v. 1.5.1
Loading repeatedly the same data inside a loop is extremely inefficient. If you perform actions against the same data load it once and cache:
val rdd = sc
.sequenceFile("file", classOf[LongWritable], classOf[MyWritable])
rdd.cache()
Spark doesn't consider Hadoop Writables to be serializable. There is an open JIRA (SPARK-2421) for this. To handle LongWritables simple get should be enough:
rdd.map{case (k, v) => k.get()}
Regarding your custom class it is your responsibility to deal with this problem.
Effective lookup requires a partitoned RDD. Otherwise it has to search every partition in your RDD.
import org.apache.spark.HashPartitioner
val numPartitions: Int = ???
val partitioned = rdd.partitionBy(new HashPartitioner(numPartitions))
Generally speaking RDDs are not designed for random access. Even with defined partitioner lookup has to linearly search candidate partition. With 5000 uniformly distributed keys and 10M objects in an RDD it most likely means a repeated search over a whole RDD. You have few options to avoid that:
filter
val idsSet = sc.broadcast(ids.toSet)
rdd.filter{case (k, v) => idsSet.value.contains(k)}
join
val idsRdd = sc.parallelize(ids).map((_, null))
idsRdd.join(rdd).map{case (k, (_, v)) => (k, v)}
IndexedRDD - it doesn't like a particularly active project though
With 10M entries you'll probably be better with searching locally in memory than using Spark. For a larger data you should consider using a proper key-value store.
I'm new to apache spark but tried to solve your problem, please evaluate it, if it can help you out with the problem of serialization, it's occurring because for spark - hadoop LongWritable and other writables are not serialized.
val temp_rdd = sc.parallelize(ids.map(id =>
sc.sequenceFile("file", classOf[LongWritable], classOf[LongWritable]).toArray.toSeq
)).flatMap(identity)
ids.foreach(id =>temp_rdd.lookup(new LongWritable(id)))
Try this solution. It worked fine for me.
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SparkMapReduceApp");
conf.registerKryoClasses(new Class<?>[]{
LongWritable.class,
Text.class
});

apache spark running task on each rdd

I have a rdd which is distributed accross multiple machines in a spark environment. I would like to execute a function on each worker machine on this rdd.
I do not want to collect the rdd and then execute a function on the driver. The function should be executed seperately on each executors for their own rdd.
How can I do that
Update (adding code)
I am running all this in spark shell
import org.apache.spark.sql.cassandra.CassandraSQLContext
import java.util.Properties
val cc = new CassandraSQLContext(sc)
val rdd = cc.sql("select * from sams.events where appname = 'test'");
val df = rdd.select("appname", "assetname");
Here I have a df with 400 rows. I need to save this df to sql server table. When I try to use df.write method it gives me errors which I have posted in a separate thread
spark dataframe not appending to the table
I can open a driverManager conection and insert rows but that will be done in the driver module of spark
import java.sql._
import com.microsoft.sqlserver.jdbc.SQLServerDriver
// create a Statement from the connection
Statement statement = conn.createStatement();
// insert the data
statement.executeUpdate("INSERT INTO Customers " + "VALUES (1001, 'Simpson', 'Mr.', 'Springfield', 2001)");
String connectionUrl = "jdbc:sqlserver://localhost:1433;" +
"databaseName=AdventureWorks;user=MyUserName;password=*****;";
Connection con = DriverManager.getConnection(connectionUrl);
I need to do this writing in the executor machine. How can I achieve this?
In order to setup connections from workers to other systems, we should use rdd.foreachPartitions(iter => ...)
foreachPartitions lets you execute an operation for each partition, giving you access to the data of the partition as a local iterator.
With enough data per partition, the time of setting up resources (like db connections) is amortized by using such resources over a whole partition.
abstract eg.
rdd.foreachPartition(iter =>
//setup db connection
val dbconn = Driver.connect(ip, port)
iter.foreach{element =>
val query = makeQuery(element)
dbconn.execute(query)
}
dbconn.close
}
It's also possible to create singleton resource managers that manage those resources for each JVM of the cluster. See also this answer for a complete example of such local resource manager: spark-streaming and connection pool implementation

Resources