How did Apache Spark implement its topK() API? - apache-spark

In Apache Spark there is an RDD.top() API, which can return the top k elements from an RDD. I'd like to know how this operations is implemented. Does it first sort the RDD then return the top k values? Or does it use some other more efficient implementation?

No, it doesn't sort the whole RDD, that operation would be too expensive.
It will rather select TOP N elements per each partition separately using a priority queue. And then these queues are merged together in the reduce operation. That means only small part of the whole RDD is shuffled across the network.
See RDD.scala for more details.
Example:
3 input partitions
RDD.top(2)
[3, 5, 7, 10], [8, 6, 4, 12], [9, 1, 2, 11]
|| || ||
[10, 7] [12, 8] [11, 9]
================== reduce ==================
[12, 11]

Related

HuggingFace transformers - encoding long input with context

I am using a BERT like model, which has a limit for input's length.
I am looking to encode a long input and feed into BERT.
Most common solution I know of is sliding-window to add context to input's segments.
For example:
model_max_size = 5
stride = 2
input = [1, ..., 12]
output = [
[1, 2, 3, 4, 5], -> [1, 2, 3, 4, 5]
[4, 5, 6, 7, 8], -> [6, 7, 8]
[7, 8, 9, 10, 11], -> [9, 10, 11]
[10, 11, 12] -> [12]
]
Is there a known good strategy?
Do you send each input into consecutive windows and average their outputs?
Any already built in implementation for this?
HuggingFace tokenizer has the stride and return_overflowing_tokens feature but it's not quite it as it works only for the first sliding window.
*I know there are other models accepting longer input (e.g. LongFormer, BigBird etc.) but I need to use this specific one.
Thanks!

Using multiple filter on multiple columns of numpy array - more efficient way?

I have the following 2 arrays:
arr = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[7, 5, 6, 3],
[2, 4, 8, 9]]
ids = np.array([6, 5, 7, 8])
Each row in the array arr describes a 4-digit id, there are no redundant ids - neither in their values nor their combination. So if [1, 2, 3, 4] exists, no other combination of these 4 digits can exist. This will be important in a sec.
The array ids contains a 4-digit id, however the order might not be correct. Now I need to go through each row of arr and look if this id exists. In this example ids fits to the 2nd row from the top of arr. So arr[1,:].
My current solution creates a filter of each column to check if the values of ids exist in any of the 4 columns. After that I use these filters on arr. This seems way too complicated.
So I pretty much do this:
filter_1 = np.in1d(arr[:, 0], ids)
filter_2 = np.in1d(arr[:, 1], ids)
filter_3 = np.in1d(arr[:, 2], ids)
filter_4 = np.in1d(arr[:, 3], ids)
result = arr[filter_1 & filter_2 & filter_3 & filter_4]
Does anyone know a simpler solution? Maybe using generators?
Use np.isin all across arr and all-reduce to get result -
In [15]: arr[np.isin(arr, ids).all(1)]
Out[15]: array([[5, 6, 7, 8]])

Does Scipy recognize the special structure of this matrix to decompose it faster?

I have a matrix whose many rows are already in the upper triangular form. I would like to ask if the command scipy.linalg.lu recognize this special structure to faster decompose it. If I decompose this matrix on paper, I only use Gaussian elimination on those rows that are not in the upper triangular form. For example, I will only make transformations on the last row of matrix B.
import numpy as np
A = np.array([[2, 5, 8, 7, 8],
[5, 2, 2, 8, 9],
[7, 5, 6, 6, 10],
[5, 4, 4, 8, 10]])
B = np.array([[2, 5, 8, 7, 8],
[0, 2, 2, 8, 9],
[0, 0, 6, 6, 10],
[5, 4, 4, 8, 10]])
Because my square matrix is of very large dimension and this procedure is repeated thousands of times. I would like to make use of this special structure to reduce the computational complexity.
Thank you so much for your elaboration!
Not automatically.
You'll need to use the structure yourself if want to. Whether you can make it faster then the built-in implementation depends on many factors (the number of zeros etc)

How does pandas perform rolling windows internally?

I've implemented some feature which behaves like rolling (with some meaningful differences) but I want to improve it's performance.
My question is, lets say I have this array
a = [1, 2, 3, 4, 5]
When I execute rolling sum of two elements, does pandas generate first this matrix
a = [[NaN, 1], [1, 2], [2, 3], [3, 4], [4, 5]]
and then aggregate by rows?
Does it use single for cycles to generate these windows?
I'd like to know how it's rolling system works because It would really help me to improve the performance of some implementations I've got.

Parallel algorithm to check if a sequence is sorted

I need a parallel algorithm (cost optimal) to check if a given sequence of n numbers is sorted .
For m threads, give each thread a chunk of n/m consecutive numbers with an overlap of 1 number. In each thread, check that that the sequence it is assigned is in sorted order. If all subsequences are sorted, then the entire sequence is sorted.
Examples:
[1, 4, 5, 6, 11, 42] => [1, 4, 5, 6*] and [6, 11, 42] with 2 threads
[1, 4, 5, 6, 11, 42] => [1, 4, 5*], [5, 6, 11*] and [11, 42] with 3 threads
* this is the overlap of 1.
This solution has complexity O(n/m).

Resources