How does pandas perform rolling windows internally?

How does pandas perform rolling windows internally? - python-3.x

I've implemented some feature which behaves like rolling (with some meaningful differences) but I want to improve it's performance.
My question is, lets say I have this array
a = [1, 2, 3, 4, 5]
When I execute rolling sum of two elements, does pandas generate first this matrix
a = [[NaN, 1], [1, 2], [2, 3], [3, 4], [4, 5]]
and then aggregate by rows?
Does it use single for cycles to generate these windows?
I'd like to know how it's rolling system works because It would really help me to improve the performance of some implementations I've got.

Related

What's the most efficient way to scatter plot a pair of 2D lists on top of each other in python?

Working with python and matplotlib. Lets's say for example I have the following lists:
A=[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
B=[[4, 2, 6], [3, 2, 1], [5, 1, 4]]
Each row of these lists represent a single scatter plot, A being x-axis and B being y-axis. Is there an efficient way of stacking these scatter plots on top of each other into a single scatter plot? I have already tried a "for" loop:
for i in range(len(A)):
plt.scatter(A[i], B[i])
It works, but it's a bit slow when working with larger numbers of entries. Is there a more efficient way to do this?

Unless there is a reason to do multiple calls to scatter, I would recommend flattening the lists and doing a single call to plt.scatter like so:
import itertools
A=[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
B=[[4, 2, 6], [3, 2, 1], [5, 1, 4]]
A_flat = list(itertools.chain.from_iterable(A))
B_flat = list(itertools.chain.from_iterable(B))
plt.scatter(A_flat, B_flat)

Numpy matrix addition vs ndarrays, convenient oneliner

How does numpy's matrix class work? I understand it will likely be removed in the future, so I am trying to understand how it works, so I can do the same with ndarrrays.
>>> x=np.matrix([[1,1,1],[2,2,2],[3,3,3]])
>>> x[:,0] + x[0,:]
matrix([[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
Seems like a row of ones got added to every row.
>>> x=np.matrix([[1,2,3],[1,2,3],[1,2,3]])
>>> x[0,:] + x[:,0]
matrix([[2, 3, 4],
[2, 3, 4],
[2, 3, 4]])
Now it seems like a column of ones got added to every column. What it does it with the identity is even weirder,
>>> x=np.matrix([[1,0,0],[0,1,0],[0,0,1]])
>>> x[0,:] + x[:,0]
matrix([[2, 1, 1],
[1, 0, 0],
[1, 0, 0]])
EDIT:
It seems if you take a (N,1) shape matrix and add it to a (N,1) shape matrix, then of one these is replicated to form a (N,N) matrix and the other is added to every row or column of this new matrix. It seems to be a convenience restricted to vectors of the right sizes. A nice use case was networkx's implementation of Floyd-Warshal.
Is there an equivalently convenient one-liner for this using standard numpy ndarrays?

Is it possible to sort lists of string elements, with respect to each other?

assuming i have he following list of list of list =
[[1, ["test"]], [2, ["array", "new", ]], [3, ["apple]], [4,["balls"]]]
What is the most efficient way to sort this list so that the inner most list, containing the strings, will be grouped by similar lengths and alphabetically, and assuming that the inner list of strings is already sorted alphabetically. Something like;
[[3, ["apple], [4,["balls"], [1, ["test"], [2, ["array", "new"]].
I was thinking of using a radix sort but I am unsure how to call radix sort to compare multiple lists

l = [[1, ["test"]], [2, ["array", "new"]], [3, ["apple"]], [4,["balls"]]]
print(sorted(l,key=lambda l:(len(l[1]),l[1])))
Output
[[3, ['apple']], [4, ['balls']], [1, ['test']], [2, ['array', 'new']]]
Complexity of functions in python
https://www.ics.uci.edu/~pattis/ICS-33/lectures/complexitypython.txt

python numpy stack matrices and add specific corner/column entries

Say we have two matrices A and B with a size of 2 by 2. Is there a command that can stack them horizontally and add A[:,1] to B[:,0] so that the resulting matrix C is 2 by 3, with C[:,0] = A[:,0], C[:,1] = A[:,1] + B[:,0], C[:,2] = B[:,1]. One step further, stacking them on diagonal so that C[0:2,0:2] = A, C[1:2,1:2] = B, C[1,1] = A[1,1] + B[0,0]. C is 3 by 3 in this case. Hard coding this routine is not hard, but I'm just curious since MATLAB has a similar function if my memory serves me well.

A straight forward approach is to copy or add the two arrays to a target:
In [882]: A=np.arange(4).reshape(2,2)
In [883]: C=np.zeros((2,3),int)
In [884]: C[:,:-1]=A
In [885]: C[:,1:]+=A # or B
In [886]: C
Out[886]:
array([[0, 1, 1],
[2, 5, 3]])
Another approach is to to pad A at the end, pad B at the start, and sum; while there is a convenient pad function, it won't be any faster.
And for the diagonal
In [887]: C=np.zeros((3,3),int)
In [888]: C[:-1,:-1]=A
In [889]: C[1:,1:]+=A
In [890]: C
Out[890]:
array([[0, 1, 0],
[2, 3, 1],
[0, 2, 3]])
Again the 2 arrays could be pad and added.
I'm not aware of any specialized function to do this; even if there were, it probably would do the same thing. This isn't a common enough operation to justify a compiled version.
I have built up finite element sparse matrices by adding over lapping element matrices. The sparse formats for both MATLAB and scipy facilitate this (duplicate coordinates are summed).
============
In [896]: np.pad(A,[[0,0],[0,1]],mode='constant')+np.pad(A,[[0,0],[1,0]],mode='
...: constant')
Out[896]:
array([[0, 1, 1],
[2, 5, 3]])
In [897]: np.pad(A,[[0,1],[0,1]],mode='constant')+np.pad(A,[[1,0],[1,0]],mode='
...: constant')
Out[897]:
array([[0, 1, 0],
[2, 3, 1],
[0, 2, 3]])
What's the special MATLAB code for doing this?
in Octave I found:
prepad(A,3,0,axis=2)+postpad(A,3,0,axis=2)

How did Apache Spark implement its topK() API?

In Apache Spark there is an RDD.top() API, which can return the top k elements from an RDD. I'd like to know how this operations is implemented. Does it first sort the RDD then return the top k values? Or does it use some other more efficient implementation?

No, it doesn't sort the whole RDD, that operation would be too expensive.
It will rather select TOP N elements per each partition separately using a priority queue. And then these queues are merged together in the reduce operation. That means only small part of the whole RDD is shuffled across the network.
See RDD.scala for more details.
Example:
3 input partitions
RDD.top(2)
[3, 5, 7, 10], [8, 6, 4, 12], [9, 1, 2, 11]
|| || ||
[10, 7] [12, 8] [11, 9]
================== reduce ==================
[12, 11]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How does pandas perform rolling windows internally? - python-3.x

Related

What's the most efficient way to scatter plot a pair of 2D lists on top of each other in python?

Numpy matrix addition vs ndarrays, convenient oneliner

Is it possible to sort lists of string elements, with respect to each other?

python numpy stack matrices and add specific corner/column entries

How did Apache Spark implement its topK() API?

Categories

Resources