Use Spark RDD to Find Path Cost - apache-spark

I am using Spark to design a TSP solver. Essentially, each element in the RDD is a 3-tuple (id, x, y) where id is the index of a point and x-y is the coordinate of that point. Given a RDD storing a sequence of 3-tuple, how can I evaluate the path cost of this sequence? For example, the sequence (1, 0, 0), (2, 0, 1), (3, 1, 1) will give the cost 1 + 1 = 2 (from the first point to the second point and then to the third point). It seems in order to do this I have to know how exactly the Spark partitions the sequence (RDD). Also, how can I evaluate the cost between boundary points of two partitions? Or is there any simple operation for me to do this?

With any parallel processing, you want to put serious thought into what a single data element is, so that only the data that needs to be together is together.
So instead of having every row be a point, it's likely that every row should be the array of points that define a path, at which point calculating the total path length with Spark becomes easy. You'd just use whatever you would normally use to calculate the total length of an array of line segments given the defining points.
But even then it's not clear that we need the full generality of points. For the TSP, a candidate solution is a path that includes all locations, which means that we don't need to store the locations of the cities for every solution, or calculate the distances every time. We just need to calculate one matrix of distances, which we can then broadcast so every Spark worker has access to it, and then lookup the distances instead of calculating them.
(It's actually a permutation of location ids, rather than just a list of them, which can simplify things even more.)


How to randomly shuffle a populaiton by preserving all properites except one?

A spherical region of space is filled with a specific distribution of smaller, different size spheres. Each sphere is associated with some physical properties: position, radius, mass, velocity, and ID all represented as 1d or 3d numpy arrays. I would like to shuffle this population of spheres in a totally random manner such that any single sphere preserves all of its properties except its 3d position array. I have encountered this similar question in here (Randomly shuffle columns except first column) but, is there an easy and fast pythonic way to do this without using DataFrame?
Thank for your help.
If you're using pandas, you could just shuffle one column:
df['col'] = df['col'].sample(frac=1).values
This works equally well on any subset of columns, e.g.
cols = ['col1', 'col2']
df[cols] = df[cols].sample(frac=1).values
The two columns are shuffled together, i.e. their respective values remain aligned.
See also this answer.
You can implement a Knuth shuffle (, its quite straight-forward.
You can adapt the implementation algorithm to only swap your desired properties.

Spark SQL window function causes skew in data distribution

The performance of this Spark SQL query is bad due to skew data distribution:
select c.*, coalesce(
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 336 PRECEDING and 1 PRECEDING), 0L) as totalRevenue
from records c
I see in SparkUI that single task stack and the cluster fail if I increase the scanned range.
I am using Yarn at AWS EMR, with Spark 2.2.0
How can I overcome this issue?
I can only recommend several approaches to alleviate your condition for investigation. I would actually try two approaches that don’t treat the skew first:
Try increasing the executor memory per the message. On YARN you may additionally need to increase the maximum container memory as well. The default on Spark IIRC is 2gb and its not uncommon to need to increase it.
Try switching to memory_and_disk or disk_only persistence levels. I believe this should work for your query although it can be hard to eyeball the full query plan
The reason for this is that at least to my eye your data is fundamentally skewed. You’re setting yourself up for maintenance difficulties if you start reshaping the data to address the skew in specific ways to the current shape of the data because the shape of the data may change over time. In my opinion at least you want to preserve the most straightforward implementation of your query for as long as you can, and only optimize skew issues programmatically if you hit problems with SLA violations, etc.
If those don’t work then you can try to address the skew directly. A simple approach for this is to create a third column that is populated by a random number for the column values that are known to be problematic. Do one pass of your summing operation with this in place, using it as a key, then a second pass with the extra random column removed. Alternatively you can do two queries and concatenate them: one with the random number for skewed data (which must still be handled in two passes) and another unaltered query for the non problematic data.
Edit - compute partial sums through two frames
The fundamentally useful observation here is that addition is commutative and associative. My original proposal based on random numbers won't work but this will. Basically, you want to compute the partial sum of the frame you want in several parts. The easiest way to do this is probably as a set of ranges (two used here for simplicity):
create temporary table partial_revenue_1 as select c.*, coalesce(
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 336 PRECEDING and 118 PRECEDING), 0L) as partialTotalRevenue
from records c
create temporary table partial_revenue_2 as select c.*, coalesce(
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 117 PRECEDING and 1 PRECEDING), 0L) as partialTotalRevenue
from records c
create temporary table combined_partials as select * from
partial_reveneue_1 union all select * from partial_revenue_2
select sum(partialTotalRevenue), first(c.some_col) ... from
combined_partials c group by cid, pid, code
Notice you need to use the first aggregate function to cull the duplicate fields that you will have from the earlier select * operations on the records table. Don't worry, this will be fine since both values came from the same table.

Spark moving average

I am trying to implement moving average for a dataset containing a number of time series. Each column represents one parameter being measured, while one row contains all parameters measured in a second. So a row would look something like:
timestamp, parameter1, parameter2, ..., parameterN
I found a way to do something like that using window functions, but the following bugs me:
Partitioning Specification: controls which rows will be in the same partition with the given row. Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame. If no partitioning specification is given, then all data must be collected to a single machine.
The thing is, I don't have anything to partition by. So can I use this method to calculate moving average without the risk of collecting all the data on a single machine? If not, what is a better way to do it?
Every nontrivial Spark job demands partitioning. There is just no way around it if you want your jobs to finish before the apocalypse. The question is simple: When it comes time to do the inevitable aggregation (in your case, an average), how can you partition your data in such a way as to minimize shuffle by grouping as much related data as possible on the same machine?
My experience with moving averages is with stocks. In that case it's easy; the partition would be on the stock ticker symbol. After all, the calculation of the 50-Day Moving Average for Stock A has nothing to with that for Stock B, so those data don't need to be on the same machine. The obvious partition makes this simpler than your situation--not to mention that it only requires one data point (probably) per day (the closing price of the stock at the end of trading) while you have one per second.
So I can only say that you need to consider adding a feature to your data set whose sole purpose is to serve as a partition key even if it is irrelevant to what you're measuring. I would be surprised if there isn't one, but if not, then consider a time-based partition on days for example.

Apache Spark RDD sortByKey algorithm and time complexity

What is the Big-O time complexity for Apache Spark RDD sortByKey?
I am trying to assign row numbers to an RDD based on a particular order.
Say I have a {K,V} pair RDD and I wish to perform an order by key using
What is the time complexity for this operation, in big-O form?
And what is happening under-the-covers? Bubble sort? I hope not! My dataset is very large and runs across partitions, so I'm curious whether the sortByKey function is optimal, or does some kind of intermediate data structure within a partition and then something else across partitions to optimize message passing, or what.
A quick look at the code shows that a RangePartitioner is being used under the covers. The docs say:
partitions sortable records by range into roughly
* equal ranges. The ranges are determined by sampling the content of the RDD passed in
So in essence your data is sampled (O[n]), then only the unique sample keys (m) are sorted are sorted (O[m log(m)]) and ranges of keys determined, then the entire data is shuffled around (O[n], but costly), then the data sorted internally for the range of keys received on a given partition (O[p log[p)).
zipWithIndex probably uses local sizes to assign numbers, using the partition number, so it is likely that partition metadata is stored for this effect:
Zips this RDD with its element indices. The ordering is first based on the partition index
* and then the ordering of items within each partition. So the first item in the first
* partition gets index 0, and the last item in the last partition receives the largest index.

combining rows/columns from spark data frames by mathematical operation

I have two spark data frames (A and B) with respective sizes a x m and b x m, containing floating point values.
Additionally, each data frame has a column 'ID', that is a string identifier. A and B have exactly the same set of 'ID's (i.e. contain information about the same group of customers.)
I'd like to combine a column of A with a column of B by some function.
More specifically, I'd like to build a scalar product a column of A with a column of B, with ordering of the columns according to the ID.
Even more specifically I'd like to calculate the correlation between columns of A and B.
Performing this operation on all pairs of columns would be the same as a matrix multiplication: A_transposed x B.
However, for now I'm only interested in correlations of a small subset of pairs.
I have two approaches in mind, but I struggle to implement them. (And don't know whether either is possible or advisable, at all.)
(1) Take the column of interest of each data frame and combines each entry to a key value pair, where the key is the ID. Then something like reduceByKey() on the two columns of key value pairs and subsequent summation.
(2) Take the column of interest of each data frame, sort it by its ID, cast it to an RDD (haven't figure out how to do this) and simply apply
Statistics.corr(rdd1,rdd2) from pyspark.mllib.stat.
Also I wonder: Is it generally computationally preferable to operate on columns rather than rows (since spark data frames are columnar oriented) or does that make no difference?
Starting from spark 1.4 and if all you need is pearson correlation then you could go like this:
cor = dfA.join(dfB, ==, how='inner').select(dfA.value.alias('aval'), dfB.value.alias('bval')).corr('aval', 'bval')
