Easiest way to get Pandas rolling window of values - python-3.x

I have a dataset. I want a window of 5 values. Does pandas have a native function that will give me a rolling window of 5 values until there are no longer 5 values that it can use? I want these to be rows.
I also want the new label to be the middle of the 5 values.
Input DataFrame
first label
0 1 0
1 2 1
2 3 2
3 4 3
4 5 4
5 6 5
Output DataFrame desired:
first label
0 [1, 2, 3, 4, 5] 2
1 [2, 3, 4, 5, 6] 3
I have tried using the .rolling function and haven't been successful.

You can use strides and for label get position of middle value and by numpy indexing set value:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(df['first'].to_numpy(), 5)
print (a)
[[ 1 2 3 4 5]
[2 3 4 5 6]]
#get positions of middle value
i = rolling_window(np.arange(len(df)), 5)[:, 2]
print (i)
[2 3]
df = pd.DataFrame({'first':a.tolist(),
'label': df['label'].to_numpy()[i]})
print (df)
first label
0 [1, 2, 3, 4, 5] 2
1 [2, 3, 4, 5, 6] 3
You can more optimalize code for run strides only one:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
#get positions
idx = rolling_window(np.arange(len(df)), 5)
print (idx)
[[0 1 2 3 4]
[1 2 3 4 5]]
df = pd.DataFrame({'first': df['first'].to_numpy()[idx].tolist(),
'label': df['label'].to_numpy()[idx][:, 2]})
print (df)
first label
0 [1, 2, 3, 4, 5] 2
1 [2, 3, 4, 5, 6] 3

An alternative, more of a hack, I don't think pandas has a native function for what you want.
Convert dataframe to numpy, transpose dataframe and pull out labels and array, using a list comprehension:
M = df.to_numpy().T
outcome = [(M[0,i:5+i],
M[1][(5+i)//2])
for i in range(0,M.shape[1])
if 5+i <=M.shape[1]
]
print(outcome)
[(array([1, 2, 3, 4, 5]), 2), (array([2, 3, 4, 5, 6]), 3)]
pd.DataFrame(outcome,columns=['first','label'])
first label
0 [1, 2, 3, 4, 5] 2
1 [2, 3, 4, 5, 6] 3

Related

How to append item to match the length of two list in Python

I am working on a Python script which is connected to a server. Every x min, server returns two list but the length of these list is not same. For ex:
a = [8, 10, 1, 34]
b = [4, 6, 8]
As you can see above that a is of length 4 and b is of length 3. Similarly, sometimes it returns
a = [3, 6, 4, 5]
b = [8, 3, 5, 2, 9, 3]
I have to write a logic where I have to check if length of these two list is not same, then add the 0 at the end of the list which is smaller than other list. So for ex, if input is:
a = [3, 6, 4, 5]
b = [8, 3, 5, 2, 9, 3]
then output will be:
a = [3, 6, 4, 5, 0, 0]
b = [8, 3, 5, 2, 9, 3]
What can I try to achieve this?
def pad(list1, list2):
# make copies of the existing lists so that original lists remain intact
list1_copy = list1.copy()
list2_copy = list2.copy()
len_list1 = len(list1_copy)
len_list2 = len(list2_copy)
# find the difference in the element count between the two lists
diff = abs(len_list1 - len_list2)
# add `diff` number of elements to the end of the list
if len_list1 < len_list2:
list1_copy += [0] * diff
elif len_list1 > len_list2:
list2_copy += [0] * diff
return list1_copy, list2_copy
a = [3, 6, 4, 5]
b = [8, 3, 5, 2, 9, 3]
# prints: ([3, 6, 4, 5, 0, 0], [8, 3, 5, 2, 9, 3])
print(pad(a, b))
a = [8, 10, 1, 34]
b = [4, 6, 8]
# prints: ([8, 10, 1, 34], [4, 6, 8, 0])
print(pad(a, b))
For now, I can suggest this solution:
a = [3, 6, 4, 5]
b = [8, 3, 5, 2, 9, 3]
# Gets the size of a and b.
sizeA, sizeB = len(a), len(b)
# Constructs the zeros...
zeros = [0 for _ in range(abs(sizeA-sizeB))]
# Determines whether a or b needs to be appended with 0,0,0,0...
if sizeA < sizeB:
a += zeros
else:
b += zeros
print(a,b)
You should use extend instead of append. This is the way to add a list to another list in Python. The list here is the list of zeros.
a = [3, 6, 4, 5, 9, 3]
b = [8, 3, 5, 2]
lenA, lenB = len(a), len(b)
diff=abs(len(a)-len(b))
if lenA < lenB:
a.extend([0]*diff)
else:
b.extend([0]*diff)
print(a)
print(b)
You could also try to use more_itertools padded() method:
It's prob. more elegant and adaptable for future Use cases.
Notes: just need to do pip install more_itertools first.
# simple example to demo it:
from more_itertools import padded
print(list(padded([1, 2, 3], 0, 5))) # last num: 5 is the numbers of 0 to be padded to make the total length to be 5. (needs 2 zeros)
# [1, 2, 3, 0, 0]
# more examples:
>>> L = [1, 2, 3]
>>> K = [3, 4, 5, 6, 8, 9]
>>> gap = len(K) - len(L)
# 3
# shorter list is L
>>>list(padded(L, 0, len(L) + gap))
[1, 2, 3, 0, 0, 0]

How to convert list like value in each row into pure value in python dataframe?

I have a dataframe that has a value that looks as belows
colA desired
[0] 0
[3, 1] 3,1
[3, 1, 2] 3,1,2
[3, 1] 3,1
The type for colA is object.
Is there a way to do it?
Thanks
without the lambda as that will be slower for larger data sets, you can simply cast the list to a string type then strip unwanted characters.
import pandas as pd
df = pd.DataFrame(data={'colA':[[0], [3,1],[3,1,2],[3,1]]})
df['desired'] = df.colA.astype(str).str.replace('\[|\]|\'', '')
df
Output:
colA desired
0 [0] 0
1 [3, 1] 3, 1
2 [3, 1, 2] 3, 1, 2
3 [3, 1] 3, 1
Try:
df = pd.DataFrame(data={'colA':[[0], [3,1],[3,1,2],[3,1], [4]]})
df['desired'] = df['colA'].apply(lambda x: ','.join(map(str, x)))
OUTPUT:
colA desired
0 [0] 0
1 [3, 1] 3,1
2 [3, 1, 2] 3,1,2
3 [3, 1] 3,1
4 [4] 4
If colA is obj:
from ast import literal_eval
df = pd.DataFrame(data={'colA':["[0]", "[3,1]","[3,1,2]","[3,1]", "[4]"]})
df['desired'] = df['colA'].apply(lambda x: ','.join(map(str, literal_eval(x))))
You can use str.replace:
df['desired'] = df['colA'].str.replace(r'[][]', '', regex=True)
Prints:
colA desired
0 [0] 0
1 [3, 1] 3, 1
2 [3, 1, 2] 3, 1, 2
3 [3, 1] 3, 1
You can use the regex demo to play with it.

Compare neighbouring cells in a 2d array

Suppose this is a 3X3 matrix and I need to find the number of elements that are greater than their neighbours.
[[1 2 7],
[4 5 6],
[3 8 9]]
Neighbours are those cells whose corners touch each other.
1 has neighbours 2,4,5.
2 has neighbours 1,7,4,5,6.
7 has 2,5,6.
5 has 1,2,7,4,6,3,8,9 and so on.
This problem can be solved in two steps/functions: 1) get_neighbors(matrix, r, c), and 2) compare_neighbors(matrix). In the 2nd function, compare_neighbors we just call get_neighbors and passing all coordinates by leveraging itertools.product.
# code snippet:
from itertools import product
def get_neighbors(matrix, r, c):
sum((row[c -(c>0): c+2]
for row in matrix[r -(r>0):r+2]), []) # sum() beats itertools.chain()
vals.remove(grid[r][c]) # rm itself.
return set(vals) # keep distinct nums. ONLY
def compare_neighbors(matrix):
ROW, COL = len(matrix), len(matrix[0])
result = []
for x, y in product(range(ROW), range(COL)):
current = matrix[x][y]
all_nums = get_neighbors(matrix, x, y)
if all(x < current for x in all_nums):
result.append(current)
return result
Program running:
grid = [[1, 5, 4, 9],
[2, 6, 3, 2],
[8, 3, 6, 3],
[5, 4, 7, 1]]
matrix = [[1, 2, 7],
[4, 5, 6],
[3, 8, 9]]
print(f' {compare_neighbors(matrix)} ') # [7, 9]
print(f' {compare_neighbors(grid) } ') # [9, 8, 7]
You need to tackle different problems:
ensure you got data that fits the problemstatement - a matrix of strings f.e. does not work nor does a non-quadratic data input
get the actual neighbouring indexes based on a potential neighbourhood
check all neighbours for bigger then testing-index
You can get the dimensions directly from the provided data (after assertioning it conforms to some base rules) and provide a generic check based on this like so:
# different neighbourhood tuples
neigh8 = tuple((a,b) for a in range(-1,2) for b in range(-1,2))
neigh4 = tuple((a,b) for (a,b) in neigh8 if a or b)
def assertions(data):
"""Check if data is list of lists and all dims are same.
Check if all elements are either ints or floats.
Exit with exception if not."""
assert isinstance(data, list)
outer_dim = len(data)
for inner in data:
assert outer_dim == len(inner), f"Inner element not of len {outer_dim}: {inner}"
assert isinstance(inner, list), f"Inner element not list: {inner}"
allNumbers = all(isinstance(i, (int, float)) for i in inner)
assert allNumbers, f"Not all elements ints or floats: {inner}"
return outer_dim
def test_surrounded_by_lower_numbers(data, idx, n_idx):
"""Test one element at 'idx' in 'data' for a given neighbourhood 'n_idx'
and return True if surrounded only by smaller numbers."""
def get_idx(data, idx, n_idx):
"""Get all indexes that conform to the given neighbourhood and are
not identical to idx nor out of bounds."""
n = []
for (a,b) in n_idx:
# identical to input idx
if (idx[0]+a , idx[1]+b) == idx:
continue
# out of bounds
if idx[0]+a < 0 or idx[1]+b < 0:
continue
if idx[0]+a >= len(data) or idx[1]+b >= len(data):
continue
n.append( (idx[0]+a , idx[1]+b ))
return n
value = data[idx[0]][idx[1]]
n = get_idx(data, idx, n_idx)
# check if all are smaller as the current value
return all (data[a][b] < value for a,b in n)
def test_matrix(matrix, n_idx = neigh8):
"""Check all matrix values for given neighbourhood. Output alle values that are
surrounded only by strictly smaller values."""
print()
for i in matrix:
for n in i:
print(f"{float(n):>5.2f} ".replace(".00"," "), end=" ")
print()
print()
dim = assertions(matrix)
for (a,b) in ((a,b) for a in range(dim) for b in range(dim)):
if test_surrounded_by_lower_numbers(matrix,(a,b)):
print(f"{(a,b)} = {matrix[a][b]} is biggest.")
Program:
# 3 x 3
test_matrix( [[1, 2, 7], [4, 5, 6], [3, 8, 9]] )
# 5 x 5
test_matrix([[1, 2, 7, 11, 9],
[4, 5, 6, -2, .5],
[9.1, 3,99.99, 8, 9.7],
[1,2,3,4,5],
[40,50,60,70,80]])
Output for 3x3 testcase:
1 2 7
4 5 6
3 8 9
(0, 2) = 7 is biggest.
(2, 2) = 9 is biggest.
Output for 5x5 testcase:
1 2 7 11 9
4 5 6 -2 0.50
9.10 3 99.99 8 9.70
1 2 3 4 5
40 50 60 70 80
(0, 3) = 11 is biggest.
(2, 0) = 9.1 is biggest.
(2, 2) = 99.99 is biggest.
(2, 4) = 9.7 is biggest.
(4, 4) = 80 is biggest.

Counting contiguous numbers within a list

I am completely new to the topic of programming but interested.
I am coding in python 3.x and have a question to my latest topic:
We have a list, containing a few tenthousands of randomly generated integers between 1 and 7.
import random
list_of_states = []
n = int(input('Enter number of elements:'))
for i in range(n):
list_of_states.append(random.randint(1,7))
print (list_of_states)
Afterwards, I would like to count the contiguous numbers in this list and put them into an numpy.array
example: [1, 2, 3, 4, 4, 4, 7, 3, 1, 1, 1]
1 1
2 1
3 1
4 3
7 1
3 1
1 3
I would like to know whether someone has a hint/an idea of how I could do this.
This part is a smaller part of a markov chain wherefor I need the frequency of each number.
Thanks for sharing
Nadim
Below is a crude way of doing this. I am creating a list of lists and then converting it to a numpy array. Please use this only a guidance and improvise on this.
import numpy as np
num_list = [1,1,1,1,2,2,2,3,4,5,6,6,6,6,7,7,7,7,1,1,1,1,3,3,3]
temp_dict = {}
two_dim_list = []
for x in num_list:
if x in temp_dict:
temp_dict[x] += 1
else:
if temp_dict:
for k,v in temp_dict.items():
two_dim_list.append([k,v])
temp_dict = {}
temp_dict[x] = 1
for k,v in temp_dict.items():
two_dim_list.append([k,v])
print ("List of List = %s" %(two_dim_list))
two_dim_arr = np.array(two_dim_list)
print ("2D Array = %s" %(two_dim_arr))
Output:
List of List = [[1, 4], [2, 3], [3, 1], [4, 1], [5, 1], [6, 4], [7, 4], [1, 4], [3, 3]]
2D Array = [[1 4]
[2 3]
[3 1]
[4 1]
[5 1]
[6 4]
[7 4]
[1 4]
[3 3]]

Returning the N largest values' indices in a multidimensional array (can find solutions for one dimension but not multi-dimension)

I have a numpy array X, and I'd like to return another array Y whose entries are the indices of the n largest values of X i.e. suppose I have:
a =np.array[[1, 3, 5], [4, 5 ,6], [9, 1, 7]]
then say, if I want the first 5 "maxs"'s indices-here 9, 7 , 6 , 5, 5 are the maxs, and their indices are:
b=np.array[[2, 0], [2 2], [ 2 1], [1 1], [0 , 2])
I've been able to find some solutions and make this work for a one dimensional array like
c=np.array[1, 2, 3, 4, 5, 6]:
def f(a,N):
return np.argsort(a)[::-1][:N]
But have not been able to generate something that works in more than one dimension. Thanks!
Approach #1
Get the argsort indices on its flattened version and select the last N indices. Then, get the corresponding row and column indices -
N = 5
idx = np.argsort(a.ravel())[-N:][::-1] #single slicing: `[:N-2:-1]`
topN_val = a.ravel()[idx]
row_col = np.c_[np.unravel_index(idx, a.shape)]
Sample run -
# Input array
In [39]: a = np.array([[1,3,5],[4,5,6],[9,1,7]])
In [40]: N = 5
...: idx = np.argsort(a.ravel())[-N:][::-1]
...: topN_val = a.ravel()[idx]
...: row_col = np.c_[np.unravel_index(idx, a.shape)]
...:
In [41]: topN_val
Out[41]: array([9, 7, 6, 5, 5])
In [42]: row_col
Out[42]:
array([[2, 0],
[2, 2],
[1, 2],
[1, 1],
[0, 2]])
Approach #2
For performance, we can use np.argpartition to get top N indices without keeping sorted order, like so -
idx0 = np.argpartition(a.ravel(), -N)[-N:]
To get the sorted order, we need one more round of argsort -
idx = idx0[a.ravel()[idx0].argsort()][::-1]

Resources