so i'm taking a python class right now and am struggling with dictionaries at the moment. my assignment is simple, i have to create a fucntion "letter_positions" which will return a dictionary of all positions of a letter in a string.
for example
positions = letter_positions("fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's")
positions['e']
should return
{4, 5, 8, 14, 16, 43, 67, 83, 88, 89, 97, 121, 122, 141, 142}
so i'm pretty much done with the assignment but i'm running into the issue that i have all values (positions) assigned to the keys (letters) as a list.
here's my code:
def letter_positions(n):
answer = {}
n = n.lower()
x = 0
for letter in n:
if letter.isalpha():
if letter not in answer:
answer[letter] = []
answer[letter].append(x)
x += 1
return answer
so instead of getting a dictionary of positions i'm getting a list of positions.
positions = letter_positions("fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's")
positions['e']
returns
[4, 5, 8, 14, 16, 43, 67, 83, 88, 89, 97, 121, 122, 141, 142]
is there any way for me to simply change the list into a dictionary or am i approaching this in a completely wrong way?
change your code like this
def letter_positions(n):
answer = {}
n = n.lower()
x = 0
for letter in n:
if letter.isalpha():
answer[letter] = answer.get(letter,[])#if there is not the key letter add it as key with value an empty list
answer[letter].append(x)
x=x+1
return answer
positions = letter_positions("fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's")
print(positions['e'])
Related
Here is an example set of data:
EDIT: Included some more data.
x = [0, 5, 6,15, 20, 40, 73,
100,101,102,103,104,105,106,108,111,115,
116,117,118,119,120,123,124,125,126,127,
128,129,130,131, 150,161,170, 183, 194,
210, 234, 257, 271,272,273,274, 275,276,
277,278,279,280,281,282,283,284,285,287,
288,291,292,293,294,295,296,297,298,300,301,
302,303,304,305,306,307,308,309,310,311,
340, 351, 358, 360, 380, 390, 400,401,
402,403, 404, 405, 408, 409, 413, 420,
425,426,427,428,429,430,431,432,433,434,435,
436, 440, 450, 455]
y = np.arange(1, len(x)+1)
Here is what the data visually looks like and has the potentially for each sharp increase to be longer. The last sharp increase also has a pause, but I would like it to still be considered one set of data. Black dots are the gradient.
I am attempting to find the the start/end x-values for each sharp increase in cumulative counts. So the output should be an array of indexes, like what Riley has done.
A vectorized method would be ideal to help with any time constraints to quickly go through data. Here is rough outline of what has been done so far within a pandas dataframe.
Shift the "x-data" and take a difference
See if sequential differences are below a threshold to create logic array
Do rolling sum on logic array with so Trues will continue add to count
Find when rolling sum exceeds another threshold
Compare with previous value to ensure it is increase/decreasing for start/stop times
Add times to index list
It seems a little finicky on some of the rolling averages and isn't as quick as I would like. Multiplying some of these large arrays with logic arrays seems to take a good amount of time.
EDIT: Here is the code Riley has provided and offers an excellent start. It is also only a couple lines a code, versus my method above was almost 50 or so.
rate_threshold = 0.25
min_consecutive = 8
above_rate = np.gradient(y,x) >= rate_threshold
sequence_diff = np.diff(np.lib.stride_tricks.sliding_window_view(above_rate, window_shape=min_consecutive).all(axis=1).astype(int))
intervals = np.hstack([np.argwhere(sequence_diff==1) + 1, np.argwhere(sequence_diff==-1) + min_consecutive-1])
The new issue comes from the final sharp increase of data. Two sets of start/end points are returned, where the desired would just be one.
My initial thought is to include some kind of average routine with the sliding window to account for these drops in the gradient so the end is so hard set.
Not sure what your desired output would look like, so let's start with this, verify it does it what you want it to, then go from there
rate_threshold = 1
min_consecutive = 5
above_rate = np.diff(y)/np.diff(x) >= rate_threshold
sequence_diff = np.diff(sliding_window_view(above_rate, window_shape=min_consecutive).all(axis=1).astype(int))
intervals = np.hstack([np.argwhere(sequence_diff==1) + 1, np.argwhere(sequence_diff==-1) + min_consecutive-1])
intervals is a 2d numpy array of indices whose 2 columns are first index, and last index in a sequence (of length min_consecutive) of rates above the threshold
array([[ 7, 12],
[16, 20],
[22, 29],
[39, 52],
[56, 62],
[64, 74]], dtype=int64)
My codes are as follows:
import bisect
a = [186, 186, 150, 200, 160, 130, 197, 200]
print("left",bisect.bisect_left(a,150))
The return value is: 0
But as specified in the document of Python 3.9:
If x is already present in a, the insertion point will be before (to the left of) any existing entries.
150 exists in the list "a", so the return value should be 1 (i.e., a.index(150) - 1), but the actual returned value is 0 .
Would you please explain the reason?
The bisect module and generally the underlying binary search algorithm is made for sorted data. For unsorted data, the result is effectively arbitrary.
For the bisect_left algorithm sorted-ness means the algorithm does not have to check for equality: In a sequence a the position i "to the left" of any existing x is the one where a[i] < x and x <= a[i + 1]. This is because sorted-ness enforces a[j] <= a[j+1].
As such, technically the insertion point will be before (to the left of) any existing entries equal or larger than x. Sorted-ness guarantees that this is before any existing entries of x.
For the sequence [186, 186, 150, 200, 160, 130, 197, 200] and x=150, the insertion point is 0 because:
The list is initially bisected into [186, 186, 150, 200] and [160, ...].
The head of the right bisect is equal or larger than x; assuming sorted'ness, there cannot be a value smaller than x in it.
All values in the left bisect are equal or larger than x; assuming sorted'ness, the insertion point must be before all of them.
The only point before all values of the left bisection is 0.
I would like to delete the lines which are actually shown in the picture and also put the number (their values) in each graph, I mean the value which belong to each one. How can I do it?
The values are from a data set taken from Kaggle.
Here is some code to help you get the requested layout.
The states and the numbers are from Wikipedia.
import matplotlib.pyplot as plt
states = ['Acre', 'Alagoas', 'Amazonas', 'Amapá', 'Bahia', 'Ceará', 'Federal District',
'Espírito Santo', 'Goiás', 'Maranhão', 'Minas Gerais', 'Mato Grosso do Sul',
'Mato Grosso', 'Pará', 'Paraíba', 'Pernambuco', 'Piauí', 'Paraná', 'Rio de Janeiro',
'Rio Grande do Norte', 'Rondônia', 'Roraima', 'Rio Grande do Sul', 'Santa Catarina',
'Sergipe', 'São Paulo', 'Tocantins']
fires = [2918, 73, 7625, 24, 2383, 327, 68, 229, 1786, 5596, 2919, 451, 15476, 10747, 81, 132,
2818, 181, 396, 68, 6441, 4608, 2029, 1107, 62, 1616, 6436]
fires, states = zip(*sorted(zip(fires, states))) #sort both arrays on number of fires
fires = fires[-15:] # limit to the 15 highest numbers
states = states[-15:]
fig, ax = plt.subplots(figsize=(8, 6))
ax.barh(states, fires, color="#08519c")
plt.box(False) # remove the complete box around the plot
plt.xticks([]) # remove all the ticks on the x-axis
ax.yaxis.set_ticks_position('none') # removes the tick marks on the y-axis but leaves the text
for i, v in enumerate(fires):
ax.text(v + 180, i, f'{v:,}'.replace(',', '.'), color='#08519c', fontweight='normal', ha='left', va='center')
plt.subplots_adjust(left=0.22) # more space to read the names
plt.title('Wildfires Brazil 2019', fontsize=20, y=0.98) # title larger and a bit lower
plt.show()
PS: about
for i, v in enumerate(fires):
ax.text(v + 180, i, f'{v:,}'.replace(',', '.'), color='#08519c', fontweight='normal', ha='left', va='center')
This has a v going through each element of fires, one by one. i is the index for which fires[i] == b. ax.text(x, y, 'some text') puts a text on a certain position, where they are measured with the same distances as those marked on the axes (that's why default the axes are shown). When the axes are just text instead of numbers, they are numbered internally 0, 1, 2, 3, ... . So, x=v + 180 is the x-position where number-of-fires v+180 would be. And y=i means just the position of label number i.
I'm trying to extract pages from a PDF that is 1000 pages long but I only need pages in the pattern of [9,10,17,18,25,26,33,34,...etc]. These numbers can be represented in the formula: pg = 1/2 (7 - 3 (-1)^n + 8*n).
I tried to define the formula and plug into tabula.read_pdf but I'm not sure how to define the 'n' variable where 'n' ranges from 0 up to 25. Right now I defined it as a list which I think is the problem...
n = list(range(25+1))
pg = 1/2 (7 - 3 (-1)^n + 8*n)
df = tabula.read_pdf(path, pages = 'pg',index_col=0, multiple_tables=False)
When trying to execute, I get a TypeError: 'int' object is not callable on line pg = 1/2 (7 - 3 (-1)^n + 8*n). How would I define the variables so that tabula extracts pages that fit the condition of the formula?
Formula is x = 1/2(8n - 3(-1)^n + 7)
Step1:
pg = [] #Empty list to store the pages numbers calculated by formula
for i in range(1, 25+1): # For 1000 pages pdf use 1000 instead of 25
k = int(1/2*((8*n[i])-3*((-1)**n[i])+7))
pg.append(k)
print(pg, end = '') # This will give you list of page numbers
#[9, 10, 17, 18, 25, 26, 33, 34, 41, 42, 49, 50, 57, 58, 65, 66, 73, 74, 81, 82, 89, 90, 97, 98, 105]
Step 2:
# Now run the loop through each of the pages with the table
df=pd.DataFrame([])
df_combine=pd.DataFrame([])
for pageiter in range(pg):
df = tabula.read_pdf(path, pages=pageiter+1 ,index_col=0, multiple_tables=False, guess=False) #modify it as per your requirement
df_combine=pd.concat([df,df_combine]) #you can choose between merge or concat as per your need
OR
df_data = []
for pageiter in range(pg):
df = tabula.read_pdf(path, pages=pageiter+1 ,index_col=0, multiple_tables=False, guess=False) #modify it as per your requirement
df_data.append(df)
df_combine= pd.concat(df_data, axis=1)
Reference link to create formula
https://www.wolframalpha.com/widgets/view.jsp?id=a3af2e675c3bfae0f2ecce820c2bef43
I see there are various python/pandas argument to count number of occurrences.
However, is there a way to count the number of times, when the series values goes above $X? It shouldn't double-count for the time when the series value is above $X. It should count only when it goes above it.
For example, in the data series below, I would expect the count = 2, and not 4; given X = $200.
list_a = [100, 150, 210, 185, 166, 205, 230, 222, 195, 190, 160]
pd.DataFrame(list_a, columns = ['mtm_level'])
Check with cumsum of a Boolean Series + where + nunique. Here "above" is not inclusive of X
X = 200
df['mtm_level'].le(X).cumsum().where(df['mtm_level'].gt(X)).nunique()
#2
X = 1000
df['mtm_level'].le(X).cumsum().where(df['mtm_level'].gt(X)).nunique()
#0