Spark RDD recursive operations on simple collection - apache-spark

I have users informations in an RDD :
(Id:10, Name:bla, Adress:50, ...)
And I have another collection containing the successive change of identity we gathered for each user.
(lastId, newId)
(10, 43)
(85, 90)
(43, 50)
I need to get the last identity for each user's id, in this example :
getFinalIdentity(10) = 50 (10 -> 43 -> 50)
For a while I used a broadcast variable containing these identities and iterated over the collection to get the final ID.
Everything was working fine until the referential became too big to fit in a broadcast variable ...
I came up with a solution, using an RDD to store the identities and iterating recursively over it, but it is not very fast and looks very complex to me.
Is there an elegant and fast way to make this ?

Have you thought about graphs?
You could create a graphs from list of edges as (lastId, newId). This way nodes with no outgoing edges are the final id for the nodes that do not have incoming edges.
It could be done in Spark with GraphX.
Below is an example. It shows for each Id the Id of the first ID in a chain. That means for this change of ids (1 -> 2 -> 3) the result will be (1, 1), (2, 1), (3, 1)
import org.apache.spark.graphx.{EdgeDirection, EdgeTriplet, Graph, VertexId}
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
sc.setLogLevel("ERROR")
// RDD of pairs (oldId, newId)
val changedIds = sc.parallelize(Seq((1L, 2L), (2L, 3L), (3L, 4L), (10L, 20L), (20L, 31L), (30L, 40L), (100L, 200L), (200L, 300L)))
// case classes for pregel operation
case class Value(originId: VertexId) // vertex value
case class Message(value: VertexId) // message sent from one vertex to another
// Create graph from id pairs
val graph = Graph.fromEdgeTuples(changedIds, Value(0))
// Initial message will be sent to all vertexes at the start
val initialMsg = Message(0)
// How vertex should process received message
def onMsgReceive(vertexId: VertexId, value: Value, msg: Message): Value = {
// Initial message will have value 0. In that case current vertex need to initialize its value to its own ID
if (msg.value == 0) Value(vertexId)
// Otherwise received value is initial ID
else Value(msg.value)
}
// How vertexes should send messages
def sendMsg(triplet: EdgeTriplet[Value, Int]): Iterator[(VertexId, Message)] = {
// For the triplet only single message shall be sent to destination vertex
// Its payload is source vertex origin ID
Iterator((triplet.dstId, Message(triplet.srcAttr.originId)))
}
// How incoming messages to one vertex should be merged
def mergeMsg(msg1: Message, msg2: Message): Message = {
// Generally for this case it's an error
// Because one ID can't have 2 different originIDs
msg2 // Just return any of the incoming messages
}
// Kick out pregel calculation
val res = graph
.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(onMsgReceive, sendMsg, mergeMsg)
// Print results
res.vertices.collect().foreach(println)
}
}
Output: (finalId firstId)
(100,Value(100))
(4,Value(1))
(300,Value(100))
(200,Value(100))
(40,Value(30))
(20,Value(10))
(1,Value(1))
(30,Value(30))
(10,Value(10))
(2,Value(1))
(3,Value(1))
(31,Value(10))

Related

How come the test case is still passing even if I have not provided correct mocking in my opinion

I am testing this function. The main bit for me is the call to add method of a respository (partitionsOfATagTransactionRepository.add(transaction, infoToAdd,mutationCondition))
def updateOrCreateTagPartitionInfo(transaction:DistributedTransaction,currentTagPartition: Option[TagPartitions], tag: String) = {
val currentCalendar = Calendar.getInstance() //TODOM - should I use a standard Locale/Timezone (eg GMT) to keep time consistent across all instances of the server application
val currentYear = currentCalendar.get(Calendar.YEAR).toLong
val currentMonth = currentCalendar.get(Calendar.MONTH).toLong
val newTagParitionInfo = TagPartitionsInfo(currentYear.toLong, currentMonth.toLong)
val (infoToAdd,mutationCondition) = currentTagPartition match {
case Some(tagPartitionInfo) => {
//checktest-should add new tag partition info to existing partition info
(TagPartitions(tagPartitionInfo.tag, tagPartitionInfo.partitionInfo + (newTagParitionInfo)),new PutIfExists)
}
case None => {
//checktest-should add new tag partition info if existing partition doesn't exist
(TagPartitions(tag, Set(newTagParitionInfo)),new PutIfNotExists)
}
}
partitionsOfATagTransactionRepository.add(transaction, infoToAdd,mutationCondition) //calling a repositoru method which I suppose needs mocking
infoToAdd
}
I wrote this test case to test the method
"should add new tag partition info if existing partition doesn't exist" in {
val servicesTestEnv = new ServicesTestEnv(components = components)
val questionTransactionDBService = new QuestionsTransactionDatabaseService(
servicesTestEnv.mockAnswersTransactionRepository,
servicesTestEnv.mockPartitionsOfATagTransactionRepository,
servicesTestEnv.mockPracticeQuestionsTagsTransactionRepository,
servicesTestEnv.mockPracticeQuestionsTransactionRepository,
servicesTestEnv.mockSupportedTagsTransactionRepository,
servicesTestEnv.mockUserProfileAndPortfolioTransactionRepository,
servicesTestEnv.mockQuestionsCreatedByUserRepo,
servicesTestEnv.mockTransactionService,
servicesTestEnv.mockPartitionsOfATagRepository,
servicesTestEnv.mockHelperMethods
)
val currentCalendar = Calendar.getInstance() //TODOM - should I use a standard Locale/Timezone (eg GMT) to keep time consistent across all instances of the server application
val currentYear = currentCalendar.get(Calendar.YEAR).toLong
val currentMonth = currentCalendar.get(Calendar.MONTH).toLong
val newTagParitionInfo = TagPartitionsInfo(currentYear.toLong, currentMonth.toLong)
val existingTag = "someExistingTag"
val existingTagPartitions = None
val result = questionTransactionDBService.updateOrCreateTagPartitionInfo(servicesTestEnv.mockDistributedTransaction,
existingTagPartitions,existingTag) //calling the funtion under test but have not provided mock for the repository's add method. The test passes! how? Shouldn't the test throw Null Pointer exception?
val expectedResult = TagPartitions(existingTag,Set(newTagParitionInfo))
verify(servicesTestEnv.mockPartitionsOfATagTransactionRepository,times(1))
.add(servicesTestEnv.mockDistributedTransaction,expectedResult,new PutIfNotExists())
result mustBe expectedResult
result mustBe TagPartitions(existingTag,Set(newTagParitionInfo))
}
The various mocks are defined as
val mockCredentialsProvider = mock(classOf[CredentialsProvider])
val mockUserTokenTransactionRepository = mock(classOf[UserTokenTransactionRepository])
val mockUserTransactionRepository = mock(classOf[UserTransactionRepository])
val mockUserProfileAndPortfolioTransactionRepository = mock(classOf[UserProfileAndPortfolioTransactionRepository])
val mockHelperMethods = mock(classOf[HelperMethods])
val mockTransactionService = mock(classOf[TransactionService])
val mockQuestionsCreatedByUserRepo = mock(classOf[QuestionsCreatedByAUserForATagTransactionRepository])
val mockQuestionsAnsweredByUserRepo = mock(classOf[QuestionsAnsweredByAUserForATagTransactionRepository])
val mockDistributedTransaction = mock(classOf[DistributedTransaction])
val mockQuestionTransactionDBService = mock(classOf[QuestionsTransactionDatabaseService])
val mockQuestionNonTransactionDBService = mock(classOf[QuestionsNonTransactionDatabaseService])
val mockAnswersTransactionRepository = mock(classOf[AnswersTransactionRepository])
val mockPartitionsOfATagTransactionRepository = mock(classOf[PartitionsOfATagTransactionRepository])
val mockPracticeQuestionsTagsTransactionRepository = mock(classOf[PracticeQuestionsTagsTransactionRepository])
val mockPracticeQuestionsTransactionRepository = mock(classOf[PracticeQuestionsTransactionRepository])
val mockSupportedTagsTransactionRepository = mock(classOf[SupportedTagsTransactionRepository])
val mockPartitionsOfATagRepository = mock(classOf[PartitionsOfATagRepository])
The test case passes even though I have not provided any mock for partitionsOfATagTransactionRepository.add. Should I get a NullPointer exception when the add method is called.
I was expecting that I would need to write something like doNothing().when(servicesTestEnv.mockPartitionsOfATagTransactionRepository).add(ArgumentMatchers.any[DistributedTransaction],ArgumentMatchers.any[TagPartitions],ArgumentMatchers.any[MutationCondition]) or when(servicesTestEnv.mockPartitionsOfATagTransactionRepository).add(ArgumentMatchers.any[DistributedTransaction],ArgumentMatchers.any[TagPartitions],ArgumentMatchers.any[MutationCondition]).thenReturn(...) for the test case to pass.
Mockito team made a decision to return default value for a method if no stubbing is provided.
See: https://javadoc.io/doc/org.mockito/mockito-core/latest/org/mockito/Mockito.html#stubbing
By default, for all methods that return a value, a mock will return either null, a primitive/primitive wrapper value, or an empty collection, as appropriate. For example 0 for an int/Integer and false for a boolean/Boolean.
This decision was made consciously: if you are focusing on a different aspect of behaviour of method under test, and the default value is good enough, you don't need to specify it.
Note that other mocking frameworks have taken opposite path - they raise an exception when unstubbed call is detected (for example: EasyMock).
See EasyMock vs Mockito: design vs maintainability?

HazelcastJet rolling-aggregation with removing previous data and add new

We have use case where we are receiving message from kafka that needs to be aggregated. This has to be aggregated in a way that if an updates comes on same id then existing value if any needs to be subtracted and the new value has to be added.
From various forum i got to know that jet doesnt store raw values rather aggregated result and some internal data.
In such case how can i achieve this?
Example
Balance 1 {id:1, amount:100} // aggregated result 100
Balance 2 {id:2, amount:200} // 300
Balance 3 {id:1, amount:400} // 600 after removing 100 and adding 400
I could achieve a simple use where every time add. But i was not able to achieve the aggregation where existing value needs to be subtracted and new value has to be added.
rollingAggregation(AggregatorOperations.summingDouble(<login to add remove>))
.drainTo(Sinks.logger()).
Balance 1,2,3 are sequnce of messages
The comment shows whats the aggregated value at each message performed by jet.
My aim is to add new amount (if id comes for the first time) and subtract amount if an updated balance comes i. e. Id is same as earlier.
You can try a custom aggregate operation which will emit the previous and currently seen values like this:
public static <T> AggregateOperation1<T, ?, Tuple2<T, T>> previousAndCurrent() {
return AggregateOperation
.withCreate(() -> new Object[2])
.<T>andAccumulate((acc, current) -> {
acc[0] = acc[1];
acc[1] = current;
})
.andExportFinish((acc) -> tuple2((T) acc[0], (T) acc[1]));
}
The output should be a Tuple of the form (previous, current). Then you can apply rolling aggregate again to the output. To simplify the problem as input I have a pair of (id, amount) pairs.
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<Integer, Long>mapJournal("map", START_FROM_OLDEST)) // (id, amount)
.groupingKey(Entry::getKey)
.rollingAggregate(previousAndCurrent(), (key, val) -> val)
.rollingAggregate(AggregateOperations.summingLong(e -> {
long prevValue = e.f0() == null ? 0 : e.f0().getValue();
long newValue = e.f1().getValue();
return newValue - prevValue;
}))
.drainTo(Sinks.logger());
JetConfig config = new JetConfig();
config.getHazelcastConfig().addEventJournalConfig(new EventJournalConfig().setMapName("map"));
JetInstance jet = Jet.newJetInstance(config);
IMapJet<Object, Object> map = jet.getMap("map");
map.put(0, 1L);
map.put(0, 2L);
map.put(1, 10L);
map.put(1, 40L);
jet.newJob(p).join();
This should produce as output: 1, 2, 12, 42.

spark graph frames aggregate messages multiple iterations

Spark graphFrames documentation has a nice example how to apply aggregate messages function.
To me, it seems to only calculate the friends /connections of the single and first vertices and not iterate deeper into the graph as graphXs pregel operator.
How can I accomplish such iterations in graphFrames as well using aggregate messages similar to how iteration is handled here https://github.com/sparkling-graph/sparkling-graph/blob/master/operators/src/main/scala/ml/sparkling/graph/operators/measures/vertex/eigenvector/EigenvectorCentrality.scala in graphX?
import org.graphframes.examples
import org.graphframes.lib.AggregateMessages
val g: GraphFrame = examples.Graphs.friends // get example graph
// We will use AggregateMessages utilities later, so name it "AM" for short.
val AM = AggregateMessages
// For each user, sum the ages of the adjacent users.
val msgToSrc = AM.dst("age")
val msgToDst = AM.src("age")
val agg = g.aggregateMessages
.sendToSrc(msgToSrc) // send destination user's age to source
.sendToDst(msgToDst) // send source user's age to destination
.agg(sum(AM.msg).as("summedAges")) // sum up ages, stored in AM.msg column
agg.show()
http://graphframes.github.io/user-guide.html#message-passing-via-aggregatemessages

Periodic Broadcast in Apache Spark Streaming

I am implementing a stream learner for text classification. There are some single-valued parameters in my implementation that needs to be updated as new stream items arrive. For example, I want to change learning rate as the new predictions are made. However, I doubt that there is a way to broadcast variables after the initial broadcast. So what happens if I need to broadcast a variable every time I update it. If there is a way to do it or a workaround for what I want to accomplish in Spark Streaming, I'd be happy to hear about it.
Thanks in advance.
I got this working by creating a wrapper class over the broadcast variable. The updateAndGet method of wrapper class returns the refreshed broadcast variable. I am calling this function inside dStream.transform -> as per the Spark Documentation
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
Transform Operation states:
"the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches."
BroadcastWrapper class will look like :
public class BroadcastWrapper {
private Broadcast<ReferenceData> broadcastVar;
private Date lastUpdatedAt = Calendar.getInstance().getTime();
private static BroadcastWrapper obj = new BroadcastWrapper();
private BroadcastWrapper(){}
public static BroadcastWrapper getInstance() {
return obj;
}
public JavaSparkContext getSparkContext(SparkContext sc) {
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc);
return jsc;
}
public Broadcast<ReferenceData> updateAndGet(SparkContext sparkContext){
Date currentDate = Calendar.getInstance().getTime();
long diff = currentDate.getTime()-lastUpdatedAt.getTime();
if (var == null || diff > 60000) { //Lets say we want to refresh every 1 min = 60000 ms
if (var != null)
var.unpersist();
lastUpdatedAt = new Date(System.currentTimeMillis());
//Your logic to refresh
ReferenceData data = getRefData();
var = getSparkContext(sparkContext).broadcast(data);
}
return var;
}
}
You can use this broadcast variable updateAndGet function in stream.transform method that allows RDD-RDD transformations
objectStream.transform(stream -> {
Broadcast<Object> var = BroadcastWrapper.getInstance().updateAndGet(stream.context());
/**Your code to manipulate stream **/
});
Refer to my full answer from this pos :https://stackoverflow.com/a/41259333/3166245
Hope it helps
My understanding is once a broadcast variable is initially sent out, it is 'read only'. I believe you can update the broadcast variable on the local nodes, but not on remote nodes.
May be you need to consider doing this 'outside Spark'. How about using a noSQL store (Cassandra ..etc) or even Memcache? You can then update the variable from one task and periodically check this store from other tasks?
I got an ugly play, but it worked!
We can find how to get a broadcast value from a broadcast object. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L114
just by broadcast id.
so i periodically rebroadcast through the same broadcast id.
val broadcastFactory = new TorrentBroadcastFactory()
broadcastFactory.unbroadcast(BroadcastId, true, true)
// append some ids to initIds
val broadcastcontent = broadcastFactory.newBroadcast[.Set[String]](initIds, false, BroadcastId)
and i can get BroadcastId from the first broadcast value.
val ids = ssc.sparkContext.broadcast(initIds)
// broadcast id
val BroadcastId = broadcastIds.id
then worker use ids as a Broadcast Type as normal.
def func(record: Array[Byte], bc: Broadcast[Set[String]]) = ???
bkc.unpersist(true)
bkc.destroy()
bkc = sc.broadcast(tableResultMap)
bkv = bkc.value
You may try this,I not guarantee whether effective
It is best that you collect the data to the driver and then broadcast them to all nodes.
Use Dstream # foreachRDD to collect the computed RDDs at the driver and once you know when you need to change learning rate, then use SparkContext#broadcast(value) to send the new value to all nodes.
I would expect the code to look something like the following:
dStreamContainingBroadcastValue.foreachRDD{ rdd =>
val valueToBroadcast = rdd.collect()
sc.broadcast(valueToBroadcast)
}
You may also find this thread useful, from the spark user mailing list. Let me know if that works.

Spark - convert string IDs to unique integer IDs

I have a dataset which looks like this, where each user and product ID is a string:
userA, productX
userA, productX
userB, productY
with ~2.8 million products and 300 million users; about 2.1 billion user-product associations.
My end goal is to run Spark collaborative filtering (ALS) on this dataset. Since it takes int keys for users and products, my first step is to assign a unique int to each user and product, and transform the dataset above so that users and products are represented by ints.
Here's what I've tried so far:
val rawInputData = sc.textFile(params.inputPath)
.filter { line => !(line contains "\\N") }
.map { line =>
val parts = line.split("\t")
(parts(0), parts(1)) // user, product
}
// find all unique users and assign them IDs
val idx1map = rawInputData.map(_._1).distinct().zipWithUniqueId().cache()
// find all unique products and assign IDs
val idx2map = rawInputData.map(_._2).distinct().zipWithUniqueId().cache()
idx1map.map{ case (id, idx) => id + "\t" + idx.toString
}.saveAsTextFile(params.idx1Out)
idx2map.map{ case (id, idx) => id + "\t" + idx.toString
}.saveAsTextFile(params.idx2Out)
// join with user ID map:
// convert from (userStr, productStr) to (productStr, userIntId)
val rev = rawInputData.cogroup(idx1map).flatMap{
case (id1, (id2s, idx1s)) =>
val idx1 = idx1s.head
id2s.map { (_, idx1)
}
}
// join with product ID map:
// convert from (productStr, userIntId) to (userIntId, productIntId)
val converted = rev.cogroup(idx2map).flatMap{
case (id2, (idx1s, idx2s)) =>
val idx2 = idx2s.head
idx1s.map{ (_, idx2)
}
}
// save output
val convertedInts = converted.map{
case (a,b) => a.toInt.toString + "\t" + b.toInt.toString
}
convertedInts.saveAsTextFile(params.outputPath)
When I try to run this on my cluster (40 executors with 5 GB RAM each), it's able to produce the idx1map and idx2map files fine, but it fails with out of memory errors and fetch failures at the first flatMap after cogroup. I haven't done much with Spark before so I'm wondering if there is a better way to accomplish this; I don't have a good idea of what steps in this job would be expensive. Certainly cogroup would require shuffling the whole data set across the network; but what does something like this mean?
FetchFailed(BlockManagerId(25, ip-***.ec2.internal, 48690), shuffleId=2, mapId=87, reduceId=25)
The reason I'm not just using a hashing function is that I'd eventually like to run this on a much larger dataset (on the order of 1 billion products, 1 billion users, 35 billion associations), and number of Int key collisions would become quite large. Is running ALS on a dataset of that scale even close to feasible?
I looks like you are essentially collecting all lists of users, just to split them up again. Try just using join instead of cogroup, which seems to me to do more like what you want. For example:
import org.apache.spark.SparkContext._
// Create some fake data
val data = sc.parallelize(Seq(("userA", "productA"),("userA", "productB"),("userB", "productB")))
val userId = sc.parallelize(Seq(("userA",1),("userB",2)))
val productId = sc.parallelize(Seq(("productA",1),("productB",2)))
// Replace userName with ID's
val userReplaced = data.join(userId).map{case (_,(prod,user)) => (prod,user)}
// Replace product names with ID's
val bothReplaced = userReplaced.join(productId).map{case (_,(user,prod)) => (user,prod)}
// Check results:
bothReplaced.collect()) // Array((1,1), (1,2), (2,2))
Please drop a comments on how well it performs.
(I have no idea what FetchFailed(...) means)
My platform version : CDH :5.7, Spark :1.6.0/StandAlone;
My Test Data Sizeļ¼š31815167 all data; 31562704 distinct user strings, 4140276 distinct product strings .
First idea:
My first idea is to use collectAsMap action and then use the map idea to change the user/product string to int . With driver memory up to 12G , i got OOM or GC overhead exception (the exception is limited by driver memory).
But this idea can only use on a small data size, with bigger data size , you need a bigger driver memory .
Second idea :
Second idea is to use join method, as Tobber proposaled. Here is some test result:
Job setup:
driver: 2G , 2 cpu;
executor : (8G , 4 cpu) * 7;
I follow the steps:
1) find unique user strings and zipWithIndexes;
2) join the original data;
3) save the encoded data;
The job take about 10 minutes to finish.

Resources