Python: How is a ~ used to exclude data? - python-3.x

In the below code I know that it is returning all records that are outside of the buffer, but I'm confused as to the mechanics of how that is happening.
I see that there is a "~" (aka a bitwise not) being used. From some googling my understanding of ~ is that it returns the inverse of each bit in the input it is passed eg if the bit is a 0 it returns a 1. Is this correct if not could someone please ELI5?
Could someone please explain the actual mechanics of how the below code is returning records that are outside of the "my_union" buffer?
NOTE: hospitals and collisions are just geo dataframes.
coverage = gpd.GeoDataFrame(geometry=hospitals.geometry).buffer(10000)
my_union = coverage.geometry.unary_union
outside_range = collisions.loc[~collisions["geometry"].apply(lambda x: my_union.contains(x))]

~ does indeed perform a bitwise not in python. But here it is used to perform a logical not on each element of a list (or rather pandas Series) of booleans. See this answer for an example.
Let's assume the collisions GeoDataFrame contains points, but it will work similarly for other types of geometries.
Let me further change the code a bit:
coverage = gpd.GeoDataFrame(geometry=hospitals.geometry).buffer(10000)
my_union = coverage.geometry.unary_union
within_my_union = collisions["geometry"].apply(lambda x: my_union.contains(x))
outside_range = collisions.loc[~within_my_union]
Then:
my_union is a single (Multi)Polygon.
my_union.contains(x) returns a boolean indicating whether the point x is within the my_union MultiPolygon.
collisions["geometry"] is a pandas Series containing the points.
collisions["geometry"].apply(lambda x: my_union.contains(x)) will run my_union.contains(x) on each of these points.
This will result in another pandas Series containing booleans, indicating whether each point is within my_union.
~ then negates these booleans, so that the Series now indicates whether each point is not within my_union.
collisions.loc[~within_my_union] then selects all the rows of collisions where the entry in ~within_my_union is True, i.e. all the points that don't lie within my_union.

I'm not sure exactly what you mean by the actual mechanics and it's hard to know for sure without seeing input and output, but I had a go at explaining it below if it is helpful:
All rows from the geometry column in the collisions dataframe that contain any value in my_union will be excluded in the newly created outside_range dataframe.

Related

Reduce Time complexity

Question at hand : Complete the function minimumSwaps in the editor below. It must return an integer representing the minimum number of swaps to sort the array.
My Approach:
def minimumSwaps(arr):
count = 0
temp = [None]*len(arr)
res1=sorted(arr)
while(res1!=arr):
for i in range(int(len(arr))):
if(res1[i]!=arr[i]):
y=res1.index(arr[i])
arr[y] , arr[i]=arr[i] , arr[y]
count = count +1
return count
The code does give the required op for majority of the cases , but fails a few due to time limit exceeds error. Could someone suggest a few changes to reduce the time complexity issues and make the code more efficient. If Possible please try not to change the code in its entirety , I want to learn to make codes more efficient rather than trying a whole new approach altogether.
Link to one of the huge test case
To me, this is a graph problem. Maybe it's possible with a more simple solution, but I don't think so.
You can observe that to get the minimum swaps necessary, you'd just have to move every element into its sorted position. You can figure out where they're supposed to be by sorting and having an array indexed by element (or dictionary, for that matter) to the index.
Now, build a graph by making each item its own node, and connecting with a directed edge to the place it needs to be. We can observe that for a cycle of length k, we will need k-1 swaps to solve it. This is because we just need to swap each item forward, but the last swap actually solves two items rather than one. Thus, the answer is the sum of k-1 for each cycle, which can be reduced to n-c where c is the number of cycles.
To see why this works, consider the case of [2,3,1]. The sorted version of this array is [1,2,3]. Now, build the graph, where index 0 points to index 1 (since 2 needs to be in index 1), index 1 points to index 2, and index 2 points to index 0. We can run a search algorithm through the graph and find the number of cycles or components, and find that there is 1 cycle of length 3. So, the answer we produce is 3-1 = 2. As we can observe, this is indeed correct.
The problem gets a little more complicated if the array can contain duplicates, but it's not so bad, you'd just have to think a little harder. Maybe this isn't the intended solution, but it'll certainly work in O(n). Best of luck!

Point H/V capped at value 32768 pre-scaling

The Point H/V properties from a given Path in a PDF seem capped at value of 32768 pre-scaling from a matrix transform. I'm trying to read the Point information for certain PDF's where it seems to be limiting the Point data incorrectly. When I try transforming the point using transform matrix for the associated element, the matrix appears to be applied to the capped value rather than the true underlying value. A given point as a reported by the library might have an H or V value greater than 32768 where the scaling value might be something like 0.006.
Is there a way to access a Point with a H or V value greater than 32768 before it is scaled? Or even getting the correct scaled value would be okay.
I've seeing this behavior in version 15.0.4PlusP4k and other 15.0.4.x versions.
Yes, there is are two interfaces to access the points in a path. PDEPathGetData returns a list of ASFixed values, and hence is limited to the ASFixed range (as you found). PDEPathGetDataEx (PDEPath path, ASReal *Data, int DataLen) will return the same array of Points, but they will not be limited to the ASFixed range.
Also, I should point out that Datalogics Support is always available to answer these types of questions from customers. Both on-line, and by Phone.

How can I access values outside of Spark GraphX .map loop?

Brand new to Apache Spark and I'm a little confused how to make updates to a value that sits outside of a .mapTriplets iteration in GraphX. See below:
def mapTripletsMethod(edgeWeights: Graph[Int, Double], stationaryDistribution: Graph[Double, Double]) = {
val tempMatrix: SparseDoubleMatrix2D = graphToSparseMatrix(edgeWeights)
stationaryDistribution.mapTriplets{ e =>
val row = e.srcId.toInt
val column = e.dstId.toInt
var cellValue = -1 * tempMatrix.get(row, column) + e.dstAttr
tempMatrix.set(row, column, cellValue) // this doesn't do anything to tempMatrix
e
}
}
I'm guessing this is due to the design of an RDD and there's no simple way to update the tempMatrix value. When I run the above code the tempMatrix.set method does nothing. It was rather difficult to try to follow the problem in the debugger.
Does anyone have an easy solution? Thank you!
Edit
I've made an update above to show that stationaryDistribution is a graph RDD.
You could make tempMatrix be of type RDD[((Int,Int), Double)] -- that is, each entry is a pair where the first element is in turn a (row,col) pair. Then use the PairRDDFunctions class to combine that with ((row,col),weight) triplets generated by your mapTriplets call. (So, don't think of it as updating the tempMatrix, but rather combining two RDDs to get a third.)
If you need to support stationary distribution graphs where there is more than one edge per vertex pair it gets a little tricky: you'll probably need to combine those edges in a reduction pass to create an RDD with one entry per pair, with a list of weights, and then apply all the weights to a given (row,col) pair at the same time. Otherwise it's very simple.
Notice that `PairRDDFunctions' on the one hand give you ways to combine multiple RDDs into one, or on the other hand to pull the values out into a Map on the master. Assuming that the distribution matrix is large enough to merit an RDD in the first place, I think you should do the whole thing on RDDs.
Another approach is to make the tempMatrix be a GraphRDD too, which may or may not make sense depending on what you're going to do with it next.

Generating all permutations excluding cyclic rotations in Python

I need to create a (practice) program for currency arbitrage that detects profitable "loops" given a series of exchange rates. So there might be different values for USD->JPY, JPY->USD, USD->EUR, and so on. In order to detect profitability, however, I first need to enumerate all possible loops -- USD->JPY->EUR->USD is one example, but USD->EUR->JPY->USD is a distinct example using the same currencies since it may hit different exchange rates.
If I ignore the last part of the loop, which will always be the same as the origin, it seems to be the case that every currency can only exist at most once in the "best" loop, as if a currency exists more than once it would actually be two different loops (at least one of which would still be profitable).
Similarly, I can ignore loops that are just translations of already tested loops: USD->JPY->ASD is the same as JPY->ASD->USD.
So, given input like [USD,JPY,EUR,ASD] I need something that would return:
(USD,JPY,EUR,ASD)
(USD,JPY,ASD,EUR)
(USD,EUR,ASD,JPY)
(USD,EUR,JPY,ASD)
(USD,ASD,EUR,JPY)
(USD,ASD,JPY,EUR)
This solution uses the yield from syntax introduced in Python 3.3. Like the built-in itertools.permutations(), this:
Is a generator and does not require storing anything
Will yield an empty tuple if passed length 0
Assumes every item in the permuted object is itself unique
from itertools import permutations
def unique_cyclic_permutations(thing, length):
if length == 0:
yield (); return
for x in permutations(thing[1:], length - 1):
yield (thing[0],) + x
if length < len(thing):
yield from unique_cyclic_permutations(thing[1:], length)
The algorithm works by choosing a pivot, fixing it at the beginning, and then permuting the rest of the objects. In the case of a non-full length permutation, there will also be some permutations that don't include the pivot object at all. In this case, the generator recursively calls itself while excluding the original pivot.

Update the quantile for a dataset when a new datapoint is added

Suppose I have a list of numbers and I've computed the q-quantile (using Quantile).
Now a new datapoint comes along and I want to update my q-quantile, without having stored the whole list of previous datapoints.
What would you recommend?
Perhaps it can't be done exactly without, in the worst case, storing all previous datapoints.
In that case, can you think of something that would work well enough?
One idea I had, if you can assume normality, is to use the inverse CDF instead of the q-quantile.
Keep track of the sample variance as you go and then you can compute InverseCDF[NormalDistribution[sampleMean,sampleVariance], q] which should be the value such that a fraction q of the values are smaller, which is what the q-quantile is.
(I see belisarius was thinking along the same lines.
Here's the link he pointed to: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm )
Unless you know that your underlying data comes from some distribution, it is not possible to update arbitrary quantiles without retaining the original data. You can, as others suggested, assume that the data has some sort of distribution and store the quantiles this way, but this is a rather restrictive approach.
Alternately, have you thought of programming this somewhere besides Mathematica? For example, you could create a class for your datapoints that contains (1) the Double value and (2) some timestamp for when the data came in. In a SortedList of these datapoints classes (which compares based on value), you could get the quantile very fast by simply referencing the index of the datapoints. Want to get a historical quantile? Simply filter on the timestamps in your sorted list.

Resources