Mockito: mocked method returning values based on other the value returned by another mocked method - mockito

I have a class, say Worker.class with the following methods:
Worker worker = mock(Worker.class);
worker.getTime() ==> return Enums/String, Ex. "Evening" or "Morning"
worker.getSpeed() ==> return Double, 1d, 2d, 3d, 4d ...
worker.getRemainingJobCount() ==> return Long, 1l, 2l, 3l, 4l ...
I'm tring to create a simulation where a worker will have different speed and remaining job count in "Evening" and "Morning".
For example,
// Morning
when(worker.getTime()).thenReturn("Morning")
when(worker.getSpeed()).thenReturn(1L, 2L, 3L, 2L, 4L)
when(worker.getRemainingJobCount()).thenReturn(10L, 12D, 30D, 10D, 10D)
// Evening
when(worker.getTime()).thenReturn("Evening")
when(worker.getSpeed()).thenReturn(1L, 2L, 1L)
when(worker.getRemainingJobCount()).thenReturn(5L, 10D, 5D)
// LateNight
when(worker.getTime()).thenReturn("LateNight")
when(worker.getSpeed()).thenReturn(1L, 1L, 1L, 1L)
when(worker.getRemainingJobCount()).thenReturn(1D, 2D, 3D, 4D)
In other words, I want the returned value for getSpeed and getRemainingJobCount to depend on the value from getTime.
I am thinking about using Answer:
when(this.worker.getSpeed()).thenAnswer(
new Answer<Long>(){
public Long answer(InvocationOnMock invocation) throws Throwable {
// how can I introduce the value from getTime() ?
// So I may use if-else conditions here?
}
}
);
How may I write this in Mockito?

Related

Pregel API - why iterations on small graph are consuming so much memory?

I'm relatively new to Spark and Scala however I've decided to post here an example of code that is quite simple and in my perception shouldn't cause a serious problem, however in practice it cause Out of Memory error quite often in AWS EMR Spark environment depending on the value of maxIterations :
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx._
import scala.util.Try
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.io.IOUtils
import java.io.IOException
val config = new SparkConf().setAppName("test graphx")
config.set("spark.driver.allowMultipleContexts","true")
val batch_id=new Integer(31)
val maxIterations=2 //200 interations are causing out of memory
var myVertices = sc.makeRDD(Array( (1L, ("A",batch_id,0.0,0.0,0.0,11.0)), (2L, ("B",batch_id,0.0,1000.0,0.0,300.0)), (3L, ( "C", batch_id, 1000.0, 1000.0, 0.0, 8.0)), (4L, ("D",batch_id,1000.0, 0.0, 0.0, 400.0)) ))
var myEdges = sc.makeRDD(Array(Edge(4L, 3L, (7.7, 0.0) ), Edge(2L, 3L, (5.0, 0.0) ), Edge(2L, 1L, (12.0, 0.0))))
var myGraph=Graph(myVertices,myEdges)
myGraph.cache
myGraph.triplets.foreach(println)
//we need to calculate some constant values for each edge before start of pregel
val initGraph=myGraph.mapTriplets(tr =>
(tr.attr._1, (tr.attr._1 *
(scala.math.sqrt((tr.dstAttr._3-tr.srcAttr._3)*(tr.dstAttr._3-tr.srcAttr._3)+( tr.dstAttr._4-tr.srcAttr._4)*( tr.dstAttr._4-tr.srcAttr._4)+(tr.dstAttr._5-tr.srcAttr._5)*(tr.dstAttr._5-tr.srcAttr._5))) *
(scala.math.sqrt((tr.dstAttr._3-tr.srcAttr._3)*(tr.dstAttr._3-tr.srcAttr._3)+( tr.dstAttr._4-tr.srcAttr._4)*( tr.dstAttr._4-tr.srcAttr._4)+(tr.dstAttr._5-tr.srcAttr._5)*(tr.dstAttr._5-tr.srcAttr._5))) /
(tr.dstAttr._6 * tr.dstAttr._6))
)
)
initGraph.triplets.take(100).foreach(println)
val distanceStep = 0.1
val tolerance = 1
val sssp = initGraph.pregel( (0.0, 0.0, 0.0, 0.0), maxIterations //500-3000
)(
(id: VertexId, vert: ((String, Integer, Double, Double, Double, Double)), msg: (Double, Double, Double, Double)) =>
(
vert._1,vert._2,
( if (scala.math.abs(msg._1)> tolerance) {vert._3+distanceStep*msg._1 } else { vert._3 }),
( if (scala.math.abs(msg._2)> tolerance) {vert._4+distanceStep*msg._2 } else { vert._4 }),
( if (scala.math.abs(msg._3)> tolerance) {vert._5+distanceStep*msg._3 } else { vert._5 }),
vert._6
),// Vertex Program
e => { // Send Message
Iterator(
(
e.dstId,
(
((e.srcAttr._3 - e.dstAttr._3)*distanceStep*scala.math.sqrt( 2*e.attr._2*e.srcAttr._6 / ((e.dstAttr._3-e.srcAttr._3)*(e.dstAttr._3-e.srcAttr._3)+( e.dstAttr._4-e.srcAttr._4)*( e.dstAttr._4-e.srcAttr._4)+(e.dstAttr._5-e.srcAttr._5)*(e.dstAttr._5-e.srcAttr._5)) )), //x
((e.srcAttr._4 - e.dstAttr._4)*distanceStep*scala.math.sqrt( 2*e.attr._2*e.srcAttr._6 / ((e.dstAttr._3-e.srcAttr._3)*(e.dstAttr._3-e.srcAttr._3)+( e.dstAttr._4-e.srcAttr._4)*( e.dstAttr._4-e.srcAttr._4)+(e.dstAttr._5-e.srcAttr._5)*(e.dstAttr._5-e.srcAttr._5)) )), //y
((e.srcAttr._5 - e.dstAttr._5)*distanceStep*scala.math.sqrt( 2*e.attr._2*e.srcAttr._6 / ((e.dstAttr._3-e.srcAttr._3)*(e.dstAttr._3-e.srcAttr._3)+( e.dstAttr._4-e.srcAttr._4)*( e.dstAttr._4-e.srcAttr._4)+(e.dstAttr._5-e.srcAttr._5)*(e.dstAttr._5-e.srcAttr._5)) )), //z
e.attr._1*distanceStep*scala.math.sqrt((e.dstAttr._3-e.srcAttr._3)*(e.dstAttr._3-e.srcAttr._3)+( e.dstAttr._4-e.srcAttr._4)*( e.dstAttr._4-e.srcAttr._4)+(e.dstAttr._5-e.srcAttr._5)*(e.dstAttr._5-e.srcAttr._5)) //vector module
)
)
)
},
{
(a, b) => (a._1 + b._1, a._2 + b._2, a._3 + b._3, 0) // Merge Message
}
)
sssp.vertices.take(10).foreach(println)
I run it in AWS EMR on 4 node m5.x2large cluster via Zeppelin, however it can be quickly adopted and executed as a job in spark.
In short this code creates a myGraph graph with 4 vertices and 3 edges. then for each triplet I calculate some constants value and use graph object initGraph for this.
Then for initGraph I apply pregel API which execution is only limited by number of iterations maxIterations. And at this moment for Pregel API I see strange behavior. For small maxIterations values (less than 10) it works quite fast, for 100-150 iterations it is performing for 3-4minutes in zeppelin, and for 200 iterations it fails with different errors (ConnectionClosed etc).
I tried to monitor what's going on with the cluster once I put maxIterations= 150 or 200 and it looks like this Allocated memory goes straight up and available memory is decreasing at the same pace.
As I'm quite new to Spark I'm not sure this is correct behavior and quite honestly, I can't find an explanation of what could consume gigabytes of memory even with 200 iterations of pregel on such a small graph. If you can reproduce it on your end and check, I'm curios to listen to your advice on performance optimization, because if I expand the cluster and run the same code on a larger hardware setup, it simply just a question of maxIterations and size of graph to actually get same OutOfMemory error. And real graph I need to run this on is more than 1M vertices and ~7M edges so I can't figure out what kind of hardware it might require if this problem won't be solved.

HazelcastJet rolling-aggregation with removing previous data and add new

We have use case where we are receiving message from kafka that needs to be aggregated. This has to be aggregated in a way that if an updates comes on same id then existing value if any needs to be subtracted and the new value has to be added.
From various forum i got to know that jet doesnt store raw values rather aggregated result and some internal data.
In such case how can i achieve this?
Example
Balance 1 {id:1, amount:100} // aggregated result 100
Balance 2 {id:2, amount:200} // 300
Balance 3 {id:1, amount:400} // 600 after removing 100 and adding 400
I could achieve a simple use where every time add. But i was not able to achieve the aggregation where existing value needs to be subtracted and new value has to be added.
rollingAggregation(AggregatorOperations.summingDouble(<login to add remove>))
.drainTo(Sinks.logger()).
Balance 1,2,3 are sequnce of messages
The comment shows whats the aggregated value at each message performed by jet.
My aim is to add new amount (if id comes for the first time) and subtract amount if an updated balance comes i. e. Id is same as earlier.
You can try a custom aggregate operation which will emit the previous and currently seen values like this:
public static <T> AggregateOperation1<T, ?, Tuple2<T, T>> previousAndCurrent() {
return AggregateOperation
.withCreate(() -> new Object[2])
.<T>andAccumulate((acc, current) -> {
acc[0] = acc[1];
acc[1] = current;
})
.andExportFinish((acc) -> tuple2((T) acc[0], (T) acc[1]));
}
The output should be a Tuple of the form (previous, current). Then you can apply rolling aggregate again to the output. To simplify the problem as input I have a pair of (id, amount) pairs.
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<Integer, Long>mapJournal("map", START_FROM_OLDEST)) // (id, amount)
.groupingKey(Entry::getKey)
.rollingAggregate(previousAndCurrent(), (key, val) -> val)
.rollingAggregate(AggregateOperations.summingLong(e -> {
long prevValue = e.f0() == null ? 0 : e.f0().getValue();
long newValue = e.f1().getValue();
return newValue - prevValue;
}))
.drainTo(Sinks.logger());
JetConfig config = new JetConfig();
config.getHazelcastConfig().addEventJournalConfig(new EventJournalConfig().setMapName("map"));
JetInstance jet = Jet.newJetInstance(config);
IMapJet<Object, Object> map = jet.getMap("map");
map.put(0, 1L);
map.put(0, 2L);
map.put(1, 10L);
map.put(1, 40L);
jet.newJob(p).join();
This should produce as output: 1, 2, 12, 42.

Spark RDD recursive operations on simple collection

I have users informations in an RDD :
(Id:10, Name:bla, Adress:50, ...)
And I have another collection containing the successive change of identity we gathered for each user.
(lastId, newId)
(10, 43)
(85, 90)
(43, 50)
I need to get the last identity for each user's id, in this example :
getFinalIdentity(10) = 50 (10 -> 43 -> 50)
For a while I used a broadcast variable containing these identities and iterated over the collection to get the final ID.
Everything was working fine until the referential became too big to fit in a broadcast variable ...
I came up with a solution, using an RDD to store the identities and iterating recursively over it, but it is not very fast and looks very complex to me.
Is there an elegant and fast way to make this ?
Have you thought about graphs?
You could create a graphs from list of edges as (lastId, newId). This way nodes with no outgoing edges are the final id for the nodes that do not have incoming edges.
It could be done in Spark with GraphX.
Below is an example. It shows for each Id the Id of the first ID in a chain. That means for this change of ids (1 -> 2 -> 3) the result will be (1, 1), (2, 1), (3, 1)
import org.apache.spark.graphx.{EdgeDirection, EdgeTriplet, Graph, VertexId}
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
sc.setLogLevel("ERROR")
// RDD of pairs (oldId, newId)
val changedIds = sc.parallelize(Seq((1L, 2L), (2L, 3L), (3L, 4L), (10L, 20L), (20L, 31L), (30L, 40L), (100L, 200L), (200L, 300L)))
// case classes for pregel operation
case class Value(originId: VertexId) // vertex value
case class Message(value: VertexId) // message sent from one vertex to another
// Create graph from id pairs
val graph = Graph.fromEdgeTuples(changedIds, Value(0))
// Initial message will be sent to all vertexes at the start
val initialMsg = Message(0)
// How vertex should process received message
def onMsgReceive(vertexId: VertexId, value: Value, msg: Message): Value = {
// Initial message will have value 0. In that case current vertex need to initialize its value to its own ID
if (msg.value == 0) Value(vertexId)
// Otherwise received value is initial ID
else Value(msg.value)
}
// How vertexes should send messages
def sendMsg(triplet: EdgeTriplet[Value, Int]): Iterator[(VertexId, Message)] = {
// For the triplet only single message shall be sent to destination vertex
// Its payload is source vertex origin ID
Iterator((triplet.dstId, Message(triplet.srcAttr.originId)))
}
// How incoming messages to one vertex should be merged
def mergeMsg(msg1: Message, msg2: Message): Message = {
// Generally for this case it's an error
// Because one ID can't have 2 different originIDs
msg2 // Just return any of the incoming messages
}
// Kick out pregel calculation
val res = graph
.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(onMsgReceive, sendMsg, mergeMsg)
// Print results
res.vertices.collect().foreach(println)
}
}
Output: (finalId firstId)
(100,Value(100))
(4,Value(1))
(300,Value(100))
(200,Value(100))
(40,Value(30))
(20,Value(10))
(1,Value(1))
(30,Value(30))
(10,Value(10))
(2,Value(1))
(3,Value(1))
(31,Value(10))

Can I use UDAFs in window functions?

I created a user-defined aggregate function. It concatenates all accumulated values in to a list (ArrayType). It's called EdgeHistory.
If I don't specify the window it works fine. It returns an array of all the lists. But with the following example it fails:
case class ExampleRow(n: Int, list: List[(String, String, Float, Float)])
val x = Seq(
ExampleRow(1, List(("a", "b", 1f, 2f), ("c", "d", 2f, 3f))),
ExampleRow(2, List(("a", "b", 2f, 4f), ("c", "d", 4f, 6f))),
ExampleRow(3, List(("a", "b", 4f, 8f), ("c", "d", 8f, 12f)))
)
val df = sc.parallelize(x).toDF()
val edgeHistory = new EdgeHistory()
val y = df.agg(edgeHistory('list).over(Window.orderBy("n").rangeBetween(1, 0)))
It throws an error:
STDERR: Exception in thread "main" java.lang.UnsupportedOperationException: EdgeHistory('list) is not supported in a window operation.
at org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:177)
at org.apache.spark.sql.Column.over(Column.scala:1052)
at szdavid92.AnalyzeGraphStream$.main(AnalyzeGraphStream.scala:75)
The error message seems pretty straightforward. It seems you cannot define UDAFs in windows.
Do I understand correctly?
Why is this limitation?
UPDATE
I tried with SQL syntax and I get a related error
df.registerTempTable("data")
sqlContext.udf.register("edge_history", edgeHistory)
val y = sqlContext.sql(
"""
|SELECT n, list, edge_history(list) OVER (ORDER BY n ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
|FROM data
""".stripMargin)
that is
Exception in thread "main" org.apache.spark.sql.AnalysisException: Couldn't find window function edge_history;
at org.apache.spark.sql.hive.ResolveHiveWindowFunction$$anonfun$apply$1$$anonfun$applyOrElse$1$$anonfun$3.apply(hiveUDFs.scala:288)
at org.apache.spark.sql.hive.ResolveHiveWindowFunction$$anonfun$apply$1$$anonfun$applyOrElse$1$$anonfun$3.apply(hiveUDFs.scala:288)

Store countByKey result into Cassandra

I want to count the number of IndicatePresence messages for each user for any given day (out of a Cassandra table), and then store this in a separate Cassandra table to drive some dashboard pages. I managed to get the 'countByKey' working, but now cannot figure out how to use the Spark-Cassandra 'saveToCassandra' method with a Map (it only takes RDD).
JavaSparkContext sc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> indicatePresenceTable = javaFunctions(sc).cassandraTable("mykeyspace", "indicatepresence");
JavaPairRDD<UserDate, CassandraRow> keyedByUserDate = indicatePresenceTable.keyBy(new Function<CassandraRow, UserDate>() {
private static final long serialVersionUID = 1L;
#Override
public UserDate call(CassandraRow cassandraIndicatePresenceRow) throws Exception {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
return new UserDate(cassandraIndicatePresenceRow.getString("userid"), sdf.format(cassandraIndicatePresenceRow.getDate("date")));
}
});
Map<UserDate, Object> countByKey = keyedByUserDate.countByKey();
writerBuilder("analytics", "countbykey", ???).saveToCassandra();
Is there a way use a Map directly in a writerBuilder? Or should I write my own custom reducer, that returns an RDD, but essentially does the same thing as the countByKey method? Or, should I convert each entry in the Map into a new POJO (eg UserDateCount, with user, date, and count) and use 'parallelize' to turn the list into an RDD and then store that?
The best thing to do would be to never return the result to the driver (by using countByKey). Instead do a reduceByKey to get another RDD back in the form of (key, count). Map that RDD to the row format of your table and then call saveToCassandra on that.
The most important strength of this approach is we never serialize the data back to the driver application. All the information is kept on the cluster and saved from their directly to C* rather than running through the bottleneck of the driver application.
Example (Very Similar to a Map Reduce Word Count):
Map each element to (key, 1)
Call reduceByKey to change (key, 1) -> (key, count)
Map each element to something writeable to C* (key,count)-> WritableObject
Call save to C*
In Scala this would be something like
keyedByUserDate
.map(_.1, 1) // Take the Key portion of the tuple and replace the value portion with 1
.reduceByKey( _ + _ ) // Combine the value portions for all elements which share a key
.map{ case (key, value) => your C* format} // Change the Tuple2 to something that matches your C* table
.saveToCassandra(ks,tab) // Save to Cassandra
In Java it is a little more convoluted (Insert your types in for K and V)
.mapToPair(new PairFunction<Tuple2<K,V>,K,Long>>, Tuple2<K, Long>(){
#Override
public Tuple2<K, Long> call(Tuple2<K, V> input) throws Exception {
return new Tuple2(input._1(),1)
}
}.reduceByKey(new Function2(Long,Long,Long)(){
#Override
public Long call(Long value1, Long value2) throws Exception {
return value1 + value2
}
}.map(new Function1(Tuple2<K, Long>, OutputTableClass)(){
#Override
public OutputTableClass call(Tuple2<K,Long> input) throws Exception {
//Do some work here
return new OutputTableClass(col1,col2,col3 ... colN)
}
}.saveToCassandra(ks,tab, mapToRow(OutputTableClass.class))

Resources