How can I access values outside of Spark GraphX .map loop? - multithreading

Brand new to Apache Spark and I'm a little confused how to make updates to a value that sits outside of a .mapTriplets iteration in GraphX. See below:
def mapTripletsMethod(edgeWeights: Graph[Int, Double], stationaryDistribution: Graph[Double, Double]) = {
val tempMatrix: SparseDoubleMatrix2D = graphToSparseMatrix(edgeWeights)
stationaryDistribution.mapTriplets{ e =>
val row = e.srcId.toInt
val column = e.dstId.toInt
var cellValue = -1 * tempMatrix.get(row, column) + e.dstAttr
tempMatrix.set(row, column, cellValue) // this doesn't do anything to tempMatrix
e
}
}
I'm guessing this is due to the design of an RDD and there's no simple way to update the tempMatrix value. When I run the above code the tempMatrix.set method does nothing. It was rather difficult to try to follow the problem in the debugger.
Does anyone have an easy solution? Thank you!
Edit
I've made an update above to show that stationaryDistribution is a graph RDD.

You could make tempMatrix be of type RDD[((Int,Int), Double)] -- that is, each entry is a pair where the first element is in turn a (row,col) pair. Then use the PairRDDFunctions class to combine that with ((row,col),weight) triplets generated by your mapTriplets call. (So, don't think of it as updating the tempMatrix, but rather combining two RDDs to get a third.)
If you need to support stationary distribution graphs where there is more than one edge per vertex pair it gets a little tricky: you'll probably need to combine those edges in a reduction pass to create an RDD with one entry per pair, with a list of weights, and then apply all the weights to a given (row,col) pair at the same time. Otherwise it's very simple.
Notice that `PairRDDFunctions' on the one hand give you ways to combine multiple RDDs into one, or on the other hand to pull the values out into a Map on the master. Assuming that the distribution matrix is large enough to merit an RDD in the first place, I think you should do the whole thing on RDDs.
Another approach is to make the tempMatrix be a GraphRDD too, which may or may not make sense depending on what you're going to do with it next.

Related

Python: How is a ~ used to exclude data?

In the below code I know that it is returning all records that are outside of the buffer, but I'm confused as to the mechanics of how that is happening.
I see that there is a "~" (aka a bitwise not) being used. From some googling my understanding of ~ is that it returns the inverse of each bit in the input it is passed eg if the bit is a 0 it returns a 1. Is this correct if not could someone please ELI5?
Could someone please explain the actual mechanics of how the below code is returning records that are outside of the "my_union" buffer?
NOTE: hospitals and collisions are just geo dataframes.
coverage = gpd.GeoDataFrame(geometry=hospitals.geometry).buffer(10000)
my_union = coverage.geometry.unary_union
outside_range = collisions.loc[~collisions["geometry"].apply(lambda x: my_union.contains(x))]
~ does indeed perform a bitwise not in python. But here it is used to perform a logical not on each element of a list (or rather pandas Series) of booleans. See this answer for an example.
Let's assume the collisions GeoDataFrame contains points, but it will work similarly for other types of geometries.
Let me further change the code a bit:
coverage = gpd.GeoDataFrame(geometry=hospitals.geometry).buffer(10000)
my_union = coverage.geometry.unary_union
within_my_union = collisions["geometry"].apply(lambda x: my_union.contains(x))
outside_range = collisions.loc[~within_my_union]
Then:
my_union is a single (Multi)Polygon.
my_union.contains(x) returns a boolean indicating whether the point x is within the my_union MultiPolygon.
collisions["geometry"] is a pandas Series containing the points.
collisions["geometry"].apply(lambda x: my_union.contains(x)) will run my_union.contains(x) on each of these points.
This will result in another pandas Series containing booleans, indicating whether each point is within my_union.
~ then negates these booleans, so that the Series now indicates whether each point is not within my_union.
collisions.loc[~within_my_union] then selects all the rows of collisions where the entry in ~within_my_union is True, i.e. all the points that don't lie within my_union.
I'm not sure exactly what you mean by the actual mechanics and it's hard to know for sure without seeing input and output, but I had a go at explaining it below if it is helpful:
All rows from the geometry column in the collisions dataframe that contain any value in my_union will be excluded in the newly created outside_range dataframe.

Python: Sort 2-tuple sometimes according to first and sometimes according to second element

I have a list of tuples of integers [(2,10), [...] (4,11),(3,9)].
Tuples are added to the list as well as deleted from the list regularly. It will contain up to ~5000 Elements.
In my code I need to use this list sometimes sorted according to the first and sometimes to the second tuple-element. Hence ordering of the list will change drastically. Resorting might take place at any time.
Pythons tinsort is only fast when list are already sorted heavily. So this general approach of frequent resorting might be inefficient. A better approach would be to use two naturally sorted data-structures like the SortedList. But here I would need two lists (one for the first tuple element, and one for the second) as well as a dictionary to create the mapping of the above tuples.
What is the pythonic way to solve this?
In Java I would do it like this:
TreeSet<Integer> leftTupleEntry = new Treeset<Integer>();
TreeSet<Integer> rightTupleEntry = new Treeset<Integer>();
Hashmap<Integer, Integer> tupleMap = new HashMap<Integer,Integer>()
And have both sorting strategies in the best runtime complexity class as well as the necessary connection between both numbers.
When I need to sort it according to first tuple I need to access the whole list (as i need to calculate a cumulative sum, and other operations)
When I need to sort according to second element, I'm only interested in the smallest elements, which then is usually followed by the deletion of these respective tuples.
Typically after any insertation a new sort according to the first element is requested.
first_element_list = sorted([i[0] for i in list_tuple])
second_element_list = sorted([i[1] for i in list_tuple])
What I did:
I use the SortedKeyList and sorted according to the first tuple element. Inserting into this list is O(log(n)). Reading from it is O(log(n)) too.
from operator import itemgetter
from sortedcontainers import SortedKeyList
self.list = SortedKeyList(key=itemgetter(0))
self.list.add((1,4))
self.list.add((2,6))
When I need the argmin according to the second tuple element I used
np.argmin(self.list, axis=0)[0]
Which is O(n). Not optimal.

Difference between reduce and reduceByKey in Apache Spark

What is the difference between reduce and reduceByKey in Apache Spark in terms of their functionalities?
Why reduceByKey is a transformation and reduce is an action?
This is close to a duplicate of my answer explaining reduceByKey, but I will elaborate to the specific part that makes the two different. However refer to my answer for a bit more specifics on the internals of reduceByKey.
Basically, reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.
Note, however that there is a reduceByKeyLocally you can use to automatically pull down the Map to a single location also.
Please go through this official documentation link .
reduce is an action which Aggregate the elements of the dataset using a function func (which takes two arguments and returns one),also we can use reduce for single RDDs (for more info Please click HERE).
reduceByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. (for more info Please click HERE)
this is the qt assistant :
reduce(f): Reduces the elements of this RDD using the specified
commutative and associative binary operator. Currently reduces
partitions locally.
reduceByKey(func, numPartitions=None, partitionFunc=) :
Merge the values for each key using an associative and commutative reduce
function.

Quadtree object movement

So I need some help brainstorming, from a theoretical standpoint. Right now I have some code that just draws some objects. The objects lie in the leaves of a quadtree. Now as the objects move I want to keep them placed in the correct leaf of the quadtree.
Right now I am just reconstructing the quadtree on the objects after I change their position. I was trying to figure out a way to correct the tree without rebuilding it completely. All I can think of is having a bunch of pointers to adjacent leaf nodes.
Does anyone have an idea of how to figure out the node into which an object moves without just having a ton of pointers everywhere or a link to articles on this? All I could find was different ways to build the quadtree, nothing about updating it.
If I understand your question. You want some way of mapping between spatial coordinates and leaves on the quadtree.
Here's one possible solution I've been looking at:
For simplicity, let's do the 1D case first. And lets assume we have 32 gridpoints in x. Every grid point then corresponds to some leaf on a quadtree of depth five. (depth 0 = the whole grid, depth 1 = 2 points, depth 2 = 4 points... depth 5 = 32 points).
Each leaf could be represented by the branch indices leading to the leaf. At each level there are two branches we can label A and B. So, a particular leaf might be labeled BBAAB, which would mean, go down the B branch, then the B branch, then the A branch, then the B branch and then the B branch.
So, how do you map e.g. BBABB to an x grid point between 0..31? Just convert it to binary, so that BBABB->11011 = 27. Thus, the mapping from gridpoint to leaf-node is simply a matter of translating the letters A and B into 0s and 1s and then interpreting the result as a binary number.
For the 2D case, it's only slightly more complicated. Now we have four branches from each node, so we can label each branch path using a four-letter alphabet, e.g. starting from the root and taking the 3rd branch and then the fourth branch and then the first branch and then the second branch and then the second branch again we would generate the string CDABB.
Now to convert the string (e.g. 'CDABB') into a pair of gridvalues (x,y).
Let's assume A is lower-left, B is lower right, C is upper left and D is upper right. Then, symbolically, we could write, A.x=0, A.y=0 / B.x=1, B.y=0 / C.x=0, C.y=1 / D.x=1, D.y=1.
Taking the example CDABB, we first look at its x values (CDABB).x = (01011), which gives us the x grid point. And similarly for y.
Finally, if you want to find out e.g. the node immediately to the right of CDABB, then simply convert it to a pair of binary numbers in x and y, add +1 to the x value and convert the new pair of binary numbers back into a string.
I'm sure this has all been discovered, but I haven't yet found this information on the web.
If you have the spatial data necessary to insert an element into the quad-tree in the first place (ex: its point or rectangle), then you have the same data needed to remove it.
An easy way is before you move an element, remove it from the quad-tree using the same data you used to originally insert it, then move it, then re-insert.
Removal from the quad-tree can first remove the element from the leaf node(s), then if the leaf nodes become empty, remove them from their parents. If the parents become empty, remove them from their parents, and so forth.
This simple method is efficient enough for a complex world of objects moving every frame as long as you implement the quad-tree efficiently (ex: use a free list for the nodes). There shouldn't have to be a heap allocation on a per-node basis to insert it, nor a heap deallocation involved in removing every single node. Most node allocations/deallocations should be a simple constant-time operation just involving, say, the manipulation of a couple of integers or pointers.
You can also make this a little more complex if you like. You can start off storing the previous position of an object and then move it. If the new position occupies nodes other than the previous position, then remove the object from the nodes it no longer occupies and insert it to the new ones. Otherwise just keep it in the same node(s).
Update
I usually try to avoid linking my previous answers, but in this case I ended up doing a pretty comprehensive write up on the topic which would be hard to replicate anywhere else. Here it is: https://stackoverflow.com/a/48330314/4842163

Update the quantile for a dataset when a new datapoint is added

Suppose I have a list of numbers and I've computed the q-quantile (using Quantile).
Now a new datapoint comes along and I want to update my q-quantile, without having stored the whole list of previous datapoints.
What would you recommend?
Perhaps it can't be done exactly without, in the worst case, storing all previous datapoints.
In that case, can you think of something that would work well enough?
One idea I had, if you can assume normality, is to use the inverse CDF instead of the q-quantile.
Keep track of the sample variance as you go and then you can compute InverseCDF[NormalDistribution[sampleMean,sampleVariance], q] which should be the value such that a fraction q of the values are smaller, which is what the q-quantile is.
(I see belisarius was thinking along the same lines.
Here's the link he pointed to: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm )
Unless you know that your underlying data comes from some distribution, it is not possible to update arbitrary quantiles without retaining the original data. You can, as others suggested, assume that the data has some sort of distribution and store the quantiles this way, but this is a rather restrictive approach.
Alternately, have you thought of programming this somewhere besides Mathematica? For example, you could create a class for your datapoints that contains (1) the Double value and (2) some timestamp for when the data came in. In a SortedList of these datapoints classes (which compares based on value), you could get the quantile very fast by simply referencing the index of the datapoints. Want to get a historical quantile? Simply filter on the timestamps in your sorted list.

Resources