How to extract specific pages based on a formula? - python-3.x

I'm trying to extract pages from a PDF that is 1000 pages long but I only need pages in the pattern of [9,10,17,18,25,26,33,34,...etc]. These numbers can be represented in the formula: pg = 1/2 (7 - 3 (-1)^n + 8*n).
I tried to define the formula and plug into tabula.read_pdf but I'm not sure how to define the 'n' variable where 'n' ranges from 0 up to 25. Right now I defined it as a list which I think is the problem...
n = list(range(25+1))
pg = 1/2 (7 - 3 (-1)^n + 8*n)
df = tabula.read_pdf(path, pages = 'pg',index_col=0, multiple_tables=False)
When trying to execute, I get a TypeError: 'int' object is not callable on line pg = 1/2 (7 - 3 (-1)^n + 8*n). How would I define the variables so that tabula extracts pages that fit the condition of the formula?

Formula is x = 1/2(8n - 3(-1)^n + 7)
Step1:
pg = [] #Empty list to store the pages numbers calculated by formula
for i in range(1, 25+1): # For 1000 pages pdf use 1000 instead of 25
k = int(1/2*((8*n[i])-3*((-1)**n[i])+7))
pg.append(k)
print(pg, end = '') # This will give you list of page numbers
#[9, 10, 17, 18, 25, 26, 33, 34, 41, 42, 49, 50, 57, 58, 65, 66, 73, 74, 81, 82, 89, 90, 97, 98, 105]
Step 2:
# Now run the loop through each of the pages with the table
df=pd.DataFrame([])
df_combine=pd.DataFrame([])
for pageiter in range(pg):
df = tabula.read_pdf(path, pages=pageiter+1 ,index_col=0, multiple_tables=False, guess=False) #modify it as per your requirement
df_combine=pd.concat([df,df_combine]) #you can choose between merge or concat as per your need
OR
df_data = []
for pageiter in range(pg):
df = tabula.read_pdf(path, pages=pageiter+1 ,index_col=0, multiple_tables=False, guess=False) #modify it as per your requirement
df_data.append(df)
df_combine= pd.concat(df_data, axis=1)
Reference link to create formula
https://www.wolframalpha.com/widgets/view.jsp?id=a3af2e675c3bfae0f2ecce820c2bef43

Related

python changing a list into a dictionary

so i'm taking a python class right now and am struggling with dictionaries at the moment. my assignment is simple, i have to create a fucntion "letter_positions" which will return a dictionary of all positions of a letter in a string.
for example
positions = letter_positions("fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's")
positions['e']
should return
{4, 5, 8, 14, 16, 43, 67, 83, 88, 89, 97, 121, 122, 141, 142}
so i'm pretty much done with the assignment but i'm running into the issue that i have all values (positions) assigned to the keys (letters) as a list.
here's my code:
def letter_positions(n):
answer = {}
n = n.lower()
x = 0
for letter in n:
if letter.isalpha():
if letter not in answer:
answer[letter] = []
answer[letter].append(x)
x += 1
return answer
so instead of getting a dictionary of positions i'm getting a list of positions.
positions = letter_positions("fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's")
positions['e']
returns
[4, 5, 8, 14, 16, 43, 67, 83, 88, 89, 97, 121, 122, 141, 142]
is there any way for me to simply change the list into a dictionary or am i approaching this in a completely wrong way?
change your code like this
def letter_positions(n):
answer = {}
n = n.lower()
x = 0
for letter in n:
if letter.isalpha():
answer[letter] = answer.get(letter,[])#if there is not the key letter add it as key with value an empty list
answer[letter].append(x)
x=x+1
return answer
positions = letter_positions("fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's")
print(positions['e'])

Find start/stop location of sharp cumulative events

Here is an example set of data:
EDIT: Included some more data.
x = [0, 5, 6,15, 20, 40, 73,
100,101,102,103,104,105,106,108,111,115,
116,117,118,119,120,123,124,125,126,127,
128,129,130,131, 150,161,170, 183, 194,
210, 234, 257, 271,272,273,274, 275,276,
277,278,279,280,281,282,283,284,285,287,
288,291,292,293,294,295,296,297,298,300,301,
302,303,304,305,306,307,308,309,310,311,
340, 351, 358, 360, 380, 390, 400,401,
402,403, 404, 405, 408, 409, 413, 420,
425,426,427,428,429,430,431,432,433,434,435,
436, 440, 450, 455]
y = np.arange(1, len(x)+1)
Here is what the data visually looks like and has the potentially for each sharp increase to be longer. The last sharp increase also has a pause, but I would like it to still be considered one set of data. Black dots are the gradient.
I am attempting to find the the start/end x-values for each sharp increase in cumulative counts. So the output should be an array of indexes, like what Riley has done.
A vectorized method would be ideal to help with any time constraints to quickly go through data. Here is rough outline of what has been done so far within a pandas dataframe.
Shift the "x-data" and take a difference
See if sequential differences are below a threshold to create logic array
Do rolling sum on logic array with so Trues will continue add to count
Find when rolling sum exceeds another threshold
Compare with previous value to ensure it is increase/decreasing for start/stop times
Add times to index list
It seems a little finicky on some of the rolling averages and isn't as quick as I would like. Multiplying some of these large arrays with logic arrays seems to take a good amount of time.
EDIT: Here is the code Riley has provided and offers an excellent start. It is also only a couple lines a code, versus my method above was almost 50 or so.
rate_threshold = 0.25
min_consecutive = 8
above_rate = np.gradient(y,x) >= rate_threshold
sequence_diff = np.diff(np.lib.stride_tricks.sliding_window_view(above_rate, window_shape=min_consecutive).all(axis=1).astype(int))
intervals = np.hstack([np.argwhere(sequence_diff==1) + 1, np.argwhere(sequence_diff==-1) + min_consecutive-1])
The new issue comes from the final sharp increase of data. Two sets of start/end points are returned, where the desired would just be one.
My initial thought is to include some kind of average routine with the sliding window to account for these drops in the gradient so the end is so hard set.
Not sure what your desired output would look like, so let's start with this, verify it does it what you want it to, then go from there
rate_threshold = 1
min_consecutive = 5
above_rate = np.diff(y)/np.diff(x) >= rate_threshold
sequence_diff = np.diff(sliding_window_view(above_rate, window_shape=min_consecutive).all(axis=1).astype(int))
intervals = np.hstack([np.argwhere(sequence_diff==1) + 1, np.argwhere(sequence_diff==-1) + min_consecutive-1])
intervals is a 2d numpy array of indices whose 2 columns are first index, and last index in a sequence (of length min_consecutive) of rates above the threshold
array([[ 7, 12],
[16, 20],
[22, 29],
[39, 52],
[56, 62],
[64, 74]], dtype=int64)

How can I use the output from a previous iteration as the input for a new iteration?

I just want to run a simple code that uses the output of the previous iteration as the input for the newest iteration. I sort of want it along the lines of this.
a_1 = 2
a_2 = 3 * (a_(n-1))
a_2 = 6
The code I've included below is just what I would like, and doesn't reflect what I think the actual code should look like.
import numpy as np
Nloop = 10
cList = np.zeros(Nloop)
a_1 = 2 #Setting my inital value
cList[0] = a_1
for y in range(Nloop):
a = cList[y-1] # I know this isn't right, but for this I just
# want to get the output from the last iteration
a_n = a * 3
cList[y] = a_n
I want the result to look something like this:
print(cList)
[2, 6, 18, 54, 162, 486, 1458, 4374, 13122, 36366]
Any instructions/help/hints would be greatly appreciated. Let me know if you need more information.
Your code is correct. Simply start y from 1 instead of 0 by making the following change:
for y in range(1, Nloop):
As a side note, your whole program can be rewritten as:
>>> print([2 * (3 ** i) for i in range(10)])
[2, 6, 18, 54, 162, 486, 1458, 4374, 13122, 39366]

Wants to concatenate tow 3D matrix (10,61,1)

I try to concatenate two numpy arrays with 3 dimensions.
I Actually tried with np.append and np.concatenate.
The dimensions of what I want to concatenate are
((10, 61, 1) (1, 0, 0)).
So I want to save the same structure. The final results is something with this size ( ,) So Python put everything on the same dimension
matrix_results_final = np.empty((1,0,0), float)
vec_y_labels_final = np.empty((1), int)
dict_minmax_scall = searchMinMaxValuesRawDataByElectrode(raw_data)
for electrode_number in electrode_numbers:
#... Some code
matrix_results_final = np.append(matrix_results_final, matrix_results)
Where the matrix_results_final is (1,0,0) and the matrix_results (10,61,1)
So I want to wave (matrix_results_final (10,61,1) in the first iteration, (20,61,1) on the second,...
I don't have any error message.
if i understood you correctly you want to append some data in a loop like:
import numpy as np
arr = np.random.rand(10,61,1)
for i in range(3):
print(arr.shape)
arr2 = np.random.rand(10,61,1)
arr = np.concatenate([arr,arr2],axis=0)
print(arr.shape)
(10, 61, 1)
(20, 61, 1)
(30, 61, 1)
(40, 61, 1)
note that it would be more effective to first assign a empty array with the final shape and then fill it (so we don't have to find new memory space - because the array doesn't change the size - in each iteration)
arr = np.zeros((40,61,1))
for i in range(4):
ii = i * 10
arr[ii:ii+10,:,:] = np.random.rand(10,61,1)

python count number of times pandas series value goes above value X

I see there are various python/pandas argument to count number of occurrences.
However, is there a way to count the number of times, when the series values goes above $X? It shouldn't double-count for the time when the series value is above $X. It should count only when it goes above it.
For example, in the data series below, I would expect the count = 2, and not 4; given X = $200.
list_a = [100, 150, 210, 185, 166, 205, 230, 222, 195, 190, 160]
pd.DataFrame(list_a, columns = ['mtm_level'])
Check with cumsum of a Boolean Series + where + nunique. Here "above" is not inclusive of X
X = 200
df['mtm_level'].le(X).cumsum().where(df['mtm_level'].gt(X)).nunique()
#2
X = 1000
df['mtm_level'].le(X).cumsum().where(df['mtm_level'].gt(X)).nunique()
#0

Resources