Assigning the value to the list from an array - python-3.x

res is list has 1867 value divided into 11 sets with different sets of elements.
ex: res[0][:]=147,res[1][:]=174,res[2][:]=168 so on total 11 set = 1867 elements.
altitude=[125,85,69,754,855,324,...] has 1867 values.
I need to replace the res list values with a continuous altitude values in a list.
I have tried:
for h in range(len(res)):
res[h][:]=altitude
It is storing all 1867 values in all the sets. I need the first 147 elements in set1, next (starting from 148th value) 174 elements in set2 so on...
Thank You

You need to keep track of the number of elements assigned at each iteration to get the correct slice from altitude. If I understand correctly res is a list of lists with varying length.
Here is a possible solution:
current_position = 0
for sublist in res:
sub_len = len(sublist)
sublist[:] = altitude[current_position: current_position + sub_len]
current_position += sub_len

Related

Extract subsequences from main dataframe based on the locations in another dataframe

I want to extract the subsequences indicated by the first and last locations in data frame 'B'.
The algorithm that I came up with is:
Identify the rows of B that fall in the locations of A
Find the relative position of the locations (i.e. shift the locations to make them start from 0)
Start a for loop using the relative position as a range to extract the subsequences.
The issue with the above algorithm is runtime. I require an alternative approach to compile the code faster than the existing one.
Desired output:
first last sequences
3 5 ACA
8 12 CGGAG
105 111 ACCCCAA
115 117 TGT
Used data frames:
import pandas as pd
A = pd.DataFrame({'first.sequence': ['AAACACCCGGAG','ACCACACCCCAAATGTGT'
],'first':[1,100], 'last':[12,117]})
B = pd.DataFrame({'first': [3,8,105,115], 'last':[5,12,111,117]})
One solution could be as follows:
out = pd.merge_asof(B, A, on=['last'], direction='forward',
suffixes=('','_y'))
out.loc[:,['first','last']] = \
out.loc[:,['first','last']].sub(out.first_y, axis=0)
out = out.assign(sequences=out.apply(lambda row:
row['first.sequence'][row['first']:row['last']+1],
axis=1)).drop(['first.sequence','first_y'], axis=1)
out.update(B)
print(out)
first last sequences
0 3 5 ACA
1 8 12 CGGAG
2 105 111 ACCCCAA
3 115 117 TGT
Explanation
First, use df.merge_asof to match first values from B with first values from A. I.e. 3, 8 will match with 1, and 105, 115 will match with 100. Now we know which string (sequence) needs splitting and we also know where the string starts, e.g. at index 1 or 100 instead of a normal 0.
We use this last bit of information to find out where the string slice should start and end. So, we do out.loc[:,['first','last']].sub(out.first_y, axis=0). E.g. we "reset" 3 to 2 (minus 1) and 105 to 5 (minus 100).
Now, we can use df.apply to get the string slices for each sequence, essentially looping over each row. (if your slices would have started and ended at the same indices, we could have used Series.str.slice instead.
Finally, we assign the result to out (as col sequences), drop the cols we no longer need, and we use df.update to "reset" the columns first and last.

extract values from a list that falls in intervals specified by pandas dataframe

I have a huge list of length 103237. And I have a data frame of shape (8173,6). I want to extract those values from list that fall between values specified by two columns (1 AND 2) in pandas dataframe. For example:
lst = [182,73,137,1,938]
###dataframe
0 1 2 3 4
John 150 183 NY US
Peter 30 50 SE US
Stef 900 969 NY US
Expected output list:
lst = [182,938]
Since 182 falls between 150 and 183 of first row of dataframe and 938 falls between 900 and 969 of row 3 therefore the I want new list to have 182 and 938 from original list. In order to solve this problem I converted my dataframe to numpy array:
nn = df.values()
new_list = []
for item in lst:
for i in range(nn.shape[0]):
if item >= nn[i][1] and item <= nn[i][2]:
new_list.append(item)
But above mentioned code take a long time since its O(n^2) and it doesn't scale well to my list which has 103237 items. How can do this more efficiently?
Consider the following: Assuming you have a value item, you can ask if in inside any interval by the following line
((df[1] <= item) & (df[2] >= item)).any()
the statements (df[1] <= item) and (df[2] >= item) return an boolean array of true/false. The '&' will return a single boolean array whether item is in specific interval. The add of any() in the end returns true if there is any True value in the boolean array, aka if there is an interval which is "True" (the number is inside the interval).
So for the a single item, you can get an answer by the above line.
To scan over all items you can the following:
new_list = []
for item in lst:
if ((df[1] <= item) & (df[2] >= item)).any():
new_list.append(item)
or with list comperhension:
new_list = [item for item in lst if ((df[1] <= item) & (df[2] >= item)).any()]
Edit: if this code is too slowly you can accelerate if even further with numba, but I believe using pandas vectorization (aka using df[1]<=item is good enough)
You can iterate the list and compare each element with all pairs of column 1 and 2 to see if there is any pair that would include the element.
[e for e in lst if (df['1'].lt(e) & df['2'].gt(e)).any()]
I did a test with 110000 elements in the list and 9000 rows in the dataframe and the code takes 32s to run on a macbook pro.

I want to improve speed of my algorithm with multiple rows input. Python. Find average of consequitive elements in list

I need to find average of consecutive elements from list.
At first I am given lenght of list,
then list with numbers,
then am given how many test i need to perform(several rows with inputs),
then I am given several inputs to perform tests(and need to print as many rows with results)
every row for test consist of start and end element in list.
My algorithm:
nu = int(input()) # At first I am given lenght of list
numbers = input().split() # then list with numbers
num = input() # number of rows with inputs
k =[float(i) for i in numbers] # given that numbers in list are of float type
i= 0
while i < int(num):
a,b = input().split() # start and end element in list
i += 1
print(round(sum(k[int(a):(int(b)+1)])/(-int(a)+int(b)+1),6)) # round up to 6 decimals
But it's not fast enough.I was told it;s better to get rid of "while" but I don't know how. Appreciate any help.
Example:
Input:
8 - len(list)
79.02 36.68 79.83 76.00 95.48 48.84 49.95 91.91 - list
10 - number of test
0 0 - a1,b1
0 1
0 2
0 3
0 4
0 5
0 6
0 7
1 7
2 7
Output:
79.020000
57.850000
65.176667
67.882500
73.402000
69.308333
66.542857
69.713750
68.384286
73.668333
i= 0
while i < int(num):
a,b = input().split() # start and end element in list
i += 1
Replace your while-loop with a for loop. Also you could get rid of multiple int calls in the print statement:
for _ in range(int(num)):
a, b = [int(j) for j in input().split()]
You didn't spell out the constraints, but I am guessing that the ranges to be averaged could be quite large. Computing sum(k[int(a):(int(b)+1)]) may take a while.
However, if you precompute partial sums of the input list, each query can be answered in a constant time (sum of numbers in the range is a difference of corresponding partial sums).

I want to remove rows where a specific value doesn't increase. Is there a faster/more elegant way?

I have a dataframe with 30 columns, 1.000.000 rows and about 150 MB size. One column is categorical with 7 different elements and another column (Depth) contains mostly increasing numbers. The graph for each of the elements looks more or less like this.
I tried to save the column Depth as series and iterate through it while dropping rows that won't match the criteria. This was reeeeeaaaally slow.
Afterwards I added a boolean column to the dataframe which indicates if it will be dropped or not, so I could drop the rows in the end in a single step. Still slow. My last try (the code to it is in this post) was to create a boolean list to save the fact if it passes the criteria there. Still really slow (about 5 hours).
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Depth'].index.min()
maxIdx = df.loc[df['Element']==element]['Depth'].index.max()
for x in range(minIdx,maxIdx):
if df.loc[df['Element']==element]['Depth'][x] < currentMax:
dropList[x]=False
else:
currentMax = df.loc[df['Element']==element]['Depth'][x]
df: The main dataframe
elements: a list with the 7 different elements (same as in the categorical column in df)
All rows in an element, where the value Depth isn't bigger than all previous ones should be dropped. With the next element it should start with 0 again.
Example:
Input: 'Depth' = [0 1 2 3 4 2 3 5 6]
'AnyOtherColumn' = [a b c d e f g h i]
Output: 'Depth' [0 1 2 3 4 5 6]
'AnyOtherColumn' = [a b c d e h i]
This should apply to whole rows in the dataframe of course.
Is there a way to get this faster?
EDIT:
The whole rows of the input dataframe should stay as they are. Just the ones where the 'Depth' does not increase should be dropped.
EDIT2:
The remaining rows should stay in their initial order.
How about you take a 2-step approach. First you use a fast sorting algorithm (for example Quicksort) and next you get rid of all the duplicates?
Okay, I found a way thats faster. Here is the code:
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Tiefe'].index.min()
# maxIdx = df.loc[df['Element']==element]['Tiefe'].index.max()
elementList = df.loc[df['Element']==element]['Tiefe'].to_list()
for x in tqdm(range(len(elementList))):
if elementList[x] < currentMax:
dropList[x+minIdx]=False
else:
currentMax = elementList[x]
I took the column and saved it as a list. To preserve, the index of the dataframe I saved the lowest one and within the loop it gets added again.
Overall it seems the problem was the loc function. From initially 5 hours runtime, its now about 10 seconds.

Variable lengths don't respond to filters...then they do. Can someone explain?

Edit: All sorted, Imanol Luengo set me straight. Here's the end result, in all it's glory.
I don't understand the counts of my variables, maybe someone can explain? I'm filtering two columns of pass/fails for two locations. I want a count of all 4 pass/fails.
Here's the header of the columns. There are 126 values in total:
WT Result School
0 p Milan
1 p Roma
2 p Milan
3 p Milan
4 p Roma
Code so far:
data2 = pd.DataFrame(data[['WT Result', 'School']])
data2.dropna(inplace=True)
# Milan Counts
m_p = (data2['School']=='Milan') & (data2['WT Result']=='p')
milan_p = (m_p==True)
milan_pass = np.count_nonzero(milan_p) # Count of Trues for Milano
# Rome Counts
r_p = (data2['School']=='Roma') & (data2['WT Result']=='p')
rome_p = (r_p==True)
rome_pass = np.count_nonzero(rome_p) # Count of Trues for Rome
So what I've done, after stripping the excess columns (data2), is:
filter by location and == 'p' (vars m_p and r_p)
filter then by ==True (vars milan_p and rome_p)
Do a count_nonzero() for a count of 'True' (vars milan_pass and rome_pass)
Here's what I don't understand - these are the lengths of the variables:
data2: 126
m_p: 126
r_p: 126
milan_p: 126
rome_p: 126
milan_pass: 55
rome_pass: 47
Why do the lengths remain 126 once the filtering starts? To me, this shows that neither the filtering by location or by 'p' worked. But when I do the final count_nonzero() the results are suddenly separated into location. What is happening?
You are not filtering, you are masking. Step by step:
m_p = (data2['School']=='Milan') & (data2['WT Result']=='p')
Here m_p is a boolean array with the same length of a column from data2. Each element of m_p is set to True if it satisfies those 2 conditions, or to False otherwise.
milan_p = (m_p==True)
The above line is completely redundant. m_p is already a boolean array, comparing it to True will just create a copy of it. Thus, milan_p will be another boolean array with the same length as m_p.
milan_pass = np.count_nonzero(milan_p)
This just prints the number of nonzeros (e.g. True) elements of milan_p. Ofcourse, it matches the number of elements that you want to filter, but you are not filtering anything here.
Exactly the same applies to rome condition.
If you want to filter rows in pandas, you have to slice the dataframe with your newly generated mask:
filtered_milan = data2[m_p]
or alternatively
filtered_milan = data2[milan_p] # as m_p == milan_p
The above lines select the rows that have a True value in the mask (or condition), ignoring the False rows in the process.
The same applies for the second problem, rome.

Resources