python count number of times pandas series value goes above value X - python-3.x

I see there are various python/pandas argument to count number of occurrences.
However, is there a way to count the number of times, when the series values goes above $X? It shouldn't double-count for the time when the series value is above $X. It should count only when it goes above it.
For example, in the data series below, I would expect the count = 2, and not 4; given X = $200.
list_a = [100, 150, 210, 185, 166, 205, 230, 222, 195, 190, 160]
pd.DataFrame(list_a, columns = ['mtm_level'])

Check with cumsum of a Boolean Series + where + nunique. Here "above" is not inclusive of X
X = 200
df['mtm_level'].le(X).cumsum().where(df['mtm_level'].gt(X)).nunique()
#2
X = 1000
df['mtm_level'].le(X).cumsum().where(df['mtm_level'].gt(X)).nunique()
#0

Related

python changing a list into a dictionary

so i'm taking a python class right now and am struggling with dictionaries at the moment. my assignment is simple, i have to create a fucntion "letter_positions" which will return a dictionary of all positions of a letter in a string.
for example
positions = letter_positions("fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's")
positions['e']
should return
{4, 5, 8, 14, 16, 43, 67, 83, 88, 89, 97, 121, 122, 141, 142}
so i'm pretty much done with the assignment but i'm running into the issue that i have all values (positions) assigned to the keys (letters) as a list.
here's my code:
def letter_positions(n):
answer = {}
n = n.lower()
x = 0
for letter in n:
if letter.isalpha():
if letter not in answer:
answer[letter] = []
answer[letter].append(x)
x += 1
return answer
so instead of getting a dictionary of positions i'm getting a list of positions.
positions = letter_positions("fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's")
positions['e']
returns
[4, 5, 8, 14, 16, 43, 67, 83, 88, 89, 97, 121, 122, 141, 142]
is there any way for me to simply change the list into a dictionary or am i approaching this in a completely wrong way?
change your code like this
def letter_positions(n):
answer = {}
n = n.lower()
x = 0
for letter in n:
if letter.isalpha():
answer[letter] = answer.get(letter,[])#if there is not the key letter add it as key with value an empty list
answer[letter].append(x)
x=x+1
return answer
positions = letter_positions("fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's")
print(positions['e'])

Find start/stop location of sharp cumulative events

Here is an example set of data:
EDIT: Included some more data.
x = [0, 5, 6,15, 20, 40, 73,
100,101,102,103,104,105,106,108,111,115,
116,117,118,119,120,123,124,125,126,127,
128,129,130,131, 150,161,170, 183, 194,
210, 234, 257, 271,272,273,274, 275,276,
277,278,279,280,281,282,283,284,285,287,
288,291,292,293,294,295,296,297,298,300,301,
302,303,304,305,306,307,308,309,310,311,
340, 351, 358, 360, 380, 390, 400,401,
402,403, 404, 405, 408, 409, 413, 420,
425,426,427,428,429,430,431,432,433,434,435,
436, 440, 450, 455]
y = np.arange(1, len(x)+1)
Here is what the data visually looks like and has the potentially for each sharp increase to be longer. The last sharp increase also has a pause, but I would like it to still be considered one set of data. Black dots are the gradient.
I am attempting to find the the start/end x-values for each sharp increase in cumulative counts. So the output should be an array of indexes, like what Riley has done.
A vectorized method would be ideal to help with any time constraints to quickly go through data. Here is rough outline of what has been done so far within a pandas dataframe.
Shift the "x-data" and take a difference
See if sequential differences are below a threshold to create logic array
Do rolling sum on logic array with so Trues will continue add to count
Find when rolling sum exceeds another threshold
Compare with previous value to ensure it is increase/decreasing for start/stop times
Add times to index list
It seems a little finicky on some of the rolling averages and isn't as quick as I would like. Multiplying some of these large arrays with logic arrays seems to take a good amount of time.
EDIT: Here is the code Riley has provided and offers an excellent start. It is also only a couple lines a code, versus my method above was almost 50 or so.
rate_threshold = 0.25
min_consecutive = 8
above_rate = np.gradient(y,x) >= rate_threshold
sequence_diff = np.diff(np.lib.stride_tricks.sliding_window_view(above_rate, window_shape=min_consecutive).all(axis=1).astype(int))
intervals = np.hstack([np.argwhere(sequence_diff==1) + 1, np.argwhere(sequence_diff==-1) + min_consecutive-1])
The new issue comes from the final sharp increase of data. Two sets of start/end points are returned, where the desired would just be one.
My initial thought is to include some kind of average routine with the sliding window to account for these drops in the gradient so the end is so hard set.
Not sure what your desired output would look like, so let's start with this, verify it does it what you want it to, then go from there
rate_threshold = 1
min_consecutive = 5
above_rate = np.diff(y)/np.diff(x) >= rate_threshold
sequence_diff = np.diff(sliding_window_view(above_rate, window_shape=min_consecutive).all(axis=1).astype(int))
intervals = np.hstack([np.argwhere(sequence_diff==1) + 1, np.argwhere(sequence_diff==-1) + min_consecutive-1])
intervals is a 2d numpy array of indices whose 2 columns are first index, and last index in a sequence (of length min_consecutive) of rates above the threshold
array([[ 7, 12],
[16, 20],
[22, 29],
[39, 52],
[56, 62],
[64, 74]], dtype=int64)

Python3 bisect_left: return value does not match the Python documentation description

My codes are as follows:
import bisect
a = [186, 186, 150, 200, 160, 130, 197, 200]
print("left",bisect.bisect_left(a,150))
The return value is: 0
But as specified in the document of Python 3.9:
If x is already present in a, the insertion point will be before (to the left of) any existing entries.
150 exists in the list "a", so the return value should be 1 (i.e., a.index(150) - 1), but the actual returned value is 0 .
Would you please explain the reason?
The bisect module and generally the underlying binary search algorithm is made for sorted data. For unsorted data, the result is effectively arbitrary.
For the bisect_left algorithm sorted-ness means the algorithm does not have to check for equality: In a sequence a the position i "to the left" of any existing x is the one where a[i] < x and x <= a[i + 1]. This is because sorted-ness enforces a[j] <= a[j+1].
As such, technically the insertion point will be before (to the left of) any existing entries equal or larger than x. Sorted-ness guarantees that this is before any existing entries of x.
For the sequence [186, 186, 150, 200, 160, 130, 197, 200] and x=150, the insertion point is 0 because:
The list is initially bisected into [186, 186, 150, 200] and [160, ...].
The head of the right bisect is equal or larger than x; assuming sorted'ness, there cannot be a value smaller than x in it.
All values in the left bisect are equal or larger than x; assuming sorted'ness, the insertion point must be before all of them.
The only point before all values of the left bisection is 0.

Different types of result for np.where .The x,y are interchanged in the two conditions. What am I missing?

I'm trying to initialize two arrays image and a.
In image I'm filling the contours with color value 255
in the contours cnt which is in the order x,y.
And in a array I'm assigning the value 1 for a[2][0] .where 2 is the x-axis point and 0 is the y-axis point.
But in the result I'm getting the order for image in y array ,x array but in a it is x array , y array .
Where am I making the mistake ???
import cv2
import numpy as np
image = np.zeros((700,700),dtype=np.int32)
cnt = np.array([[340, 157], [695, 157], [695, 309], [340, 309], [340, 157]])
image = cv2.fillPoly(image,[cnt],color=255)
a = np.zeros((10,10),dtype=np.int32)
a [2][0] = 1
print(np.where(a>0)) #output_1
print(np.where(image>0)) #output_2
output_1:
(array([2]), array([0]))
output_2:
(array([157, 157, 157, ..., 309, 309, 309]), array([340, 341, 342, ..., 693, 694, 695]))
There is no mistake with the code. I have rewritten this using numpy slices to fill up portions of the array.
import cv2
import numpy as np
image = np.zeros((700,700),dtype=np.int32)
print(image.shape)
# cnt = np.array([[340, 157], [695, 157], [695, 309], [340, 309], [340, 157]])
image[157:309, 340:695] = 255
# image = cv2.fillPoly(image,[cnt],color=255)
print(image.shape)
a = np.zeros((10,10),dtype=np.int32)
a [2, 0:5] = 1
print(np.where(a>0)) #output_1
print(np.where(image==255)) #output_2
(array([2], dtype=int64), array([0], dtype=int64))
(array([157, 157, 157, ..., 308, 308, 308], dtype=int64), array([340, 341, 342, ..., 692, 693, 694], dtype=int64))
If you look at this, image[157:309, 340:695] = 255
the slice along the x-axis is described by columns, and y-axis slice by rows.
The cnt part of your code, describes a range of rows from 157-309, and a range of columns from 340-695 and thats why the output.

How to extract specific pages based on a formula?

I'm trying to extract pages from a PDF that is 1000 pages long but I only need pages in the pattern of [9,10,17,18,25,26,33,34,...etc]. These numbers can be represented in the formula: pg = 1/2 (7 - 3 (-1)^n + 8*n).
I tried to define the formula and plug into tabula.read_pdf but I'm not sure how to define the 'n' variable where 'n' ranges from 0 up to 25. Right now I defined it as a list which I think is the problem...
n = list(range(25+1))
pg = 1/2 (7 - 3 (-1)^n + 8*n)
df = tabula.read_pdf(path, pages = 'pg',index_col=0, multiple_tables=False)
When trying to execute, I get a TypeError: 'int' object is not callable on line pg = 1/2 (7 - 3 (-1)^n + 8*n). How would I define the variables so that tabula extracts pages that fit the condition of the formula?
Formula is x = 1/2(8n - 3(-1)^n + 7)
Step1:
pg = [] #Empty list to store the pages numbers calculated by formula
for i in range(1, 25+1): # For 1000 pages pdf use 1000 instead of 25
k = int(1/2*((8*n[i])-3*((-1)**n[i])+7))
pg.append(k)
print(pg, end = '') # This will give you list of page numbers
#[9, 10, 17, 18, 25, 26, 33, 34, 41, 42, 49, 50, 57, 58, 65, 66, 73, 74, 81, 82, 89, 90, 97, 98, 105]
Step 2:
# Now run the loop through each of the pages with the table
df=pd.DataFrame([])
df_combine=pd.DataFrame([])
for pageiter in range(pg):
df = tabula.read_pdf(path, pages=pageiter+1 ,index_col=0, multiple_tables=False, guess=False) #modify it as per your requirement
df_combine=pd.concat([df,df_combine]) #you can choose between merge or concat as per your need
OR
df_data = []
for pageiter in range(pg):
df = tabula.read_pdf(path, pages=pageiter+1 ,index_col=0, multiple_tables=False, guess=False) #modify it as per your requirement
df_data.append(df)
df_combine= pd.concat(df_data, axis=1)
Reference link to create formula
https://www.wolframalpha.com/widgets/view.jsp?id=a3af2e675c3bfae0f2ecce820c2bef43

Resources