Variable lengths don't respond to filters...then they do. Can someone explain? - python-3.x

Edit: All sorted, Imanol Luengo set me straight. Here's the end result, in all it's glory.
I don't understand the counts of my variables, maybe someone can explain? I'm filtering two columns of pass/fails for two locations. I want a count of all 4 pass/fails.
Here's the header of the columns. There are 126 values in total:
WT Result School
0 p Milan
1 p Roma
2 p Milan
3 p Milan
4 p Roma
Code so far:
data2 = pd.DataFrame(data[['WT Result', 'School']])
data2.dropna(inplace=True)
# Milan Counts
m_p = (data2['School']=='Milan') & (data2['WT Result']=='p')
milan_p = (m_p==True)
milan_pass = np.count_nonzero(milan_p) # Count of Trues for Milano
# Rome Counts
r_p = (data2['School']=='Roma') & (data2['WT Result']=='p')
rome_p = (r_p==True)
rome_pass = np.count_nonzero(rome_p) # Count of Trues for Rome
So what I've done, after stripping the excess columns (data2), is:
filter by location and == 'p' (vars m_p and r_p)
filter then by ==True (vars milan_p and rome_p)
Do a count_nonzero() for a count of 'True' (vars milan_pass and rome_pass)
Here's what I don't understand - these are the lengths of the variables:
data2: 126
m_p: 126
r_p: 126
milan_p: 126
rome_p: 126
milan_pass: 55
rome_pass: 47
Why do the lengths remain 126 once the filtering starts? To me, this shows that neither the filtering by location or by 'p' worked. But when I do the final count_nonzero() the results are suddenly separated into location. What is happening?

You are not filtering, you are masking. Step by step:
m_p = (data2['School']=='Milan') & (data2['WT Result']=='p')
Here m_p is a boolean array with the same length of a column from data2. Each element of m_p is set to True if it satisfies those 2 conditions, or to False otherwise.
milan_p = (m_p==True)
The above line is completely redundant. m_p is already a boolean array, comparing it to True will just create a copy of it. Thus, milan_p will be another boolean array with the same length as m_p.
milan_pass = np.count_nonzero(milan_p)
This just prints the number of nonzeros (e.g. True) elements of milan_p. Ofcourse, it matches the number of elements that you want to filter, but you are not filtering anything here.
Exactly the same applies to rome condition.
If you want to filter rows in pandas, you have to slice the dataframe with your newly generated mask:
filtered_milan = data2[m_p]
or alternatively
filtered_milan = data2[milan_p] # as m_p == milan_p
The above lines select the rows that have a True value in the mask (or condition), ignoring the False rows in the process.
The same applies for the second problem, rome.

Related

Extract subsequences from main dataframe based on the locations in another dataframe

I want to extract the subsequences indicated by the first and last locations in data frame 'B'.
The algorithm that I came up with is:
Identify the rows of B that fall in the locations of A
Find the relative position of the locations (i.e. shift the locations to make them start from 0)
Start a for loop using the relative position as a range to extract the subsequences.
The issue with the above algorithm is runtime. I require an alternative approach to compile the code faster than the existing one.
Desired output:
first last sequences
3 5 ACA
8 12 CGGAG
105 111 ACCCCAA
115 117 TGT
Used data frames:
import pandas as pd
A = pd.DataFrame({'first.sequence': ['AAACACCCGGAG','ACCACACCCCAAATGTGT'
],'first':[1,100], 'last':[12,117]})
B = pd.DataFrame({'first': [3,8,105,115], 'last':[5,12,111,117]})
One solution could be as follows:
out = pd.merge_asof(B, A, on=['last'], direction='forward',
suffixes=('','_y'))
out.loc[:,['first','last']] = \
out.loc[:,['first','last']].sub(out.first_y, axis=0)
out = out.assign(sequences=out.apply(lambda row:
row['first.sequence'][row['first']:row['last']+1],
axis=1)).drop(['first.sequence','first_y'], axis=1)
out.update(B)
print(out)
first last sequences
0 3 5 ACA
1 8 12 CGGAG
2 105 111 ACCCCAA
3 115 117 TGT
Explanation
First, use df.merge_asof to match first values from B with first values from A. I.e. 3, 8 will match with 1, and 105, 115 will match with 100. Now we know which string (sequence) needs splitting and we also know where the string starts, e.g. at index 1 or 100 instead of a normal 0.
We use this last bit of information to find out where the string slice should start and end. So, we do out.loc[:,['first','last']].sub(out.first_y, axis=0). E.g. we "reset" 3 to 2 (minus 1) and 105 to 5 (minus 100).
Now, we can use df.apply to get the string slices for each sequence, essentially looping over each row. (if your slices would have started and ended at the same indices, we could have used Series.str.slice instead.
Finally, we assign the result to out (as col sequences), drop the cols we no longer need, and we use df.update to "reset" the columns first and last.

extract values from a list that falls in intervals specified by pandas dataframe

I have a huge list of length 103237. And I have a data frame of shape (8173,6). I want to extract those values from list that fall between values specified by two columns (1 AND 2) in pandas dataframe. For example:
lst = [182,73,137,1,938]
###dataframe
0 1 2 3 4
John 150 183 NY US
Peter 30 50 SE US
Stef 900 969 NY US
Expected output list:
lst = [182,938]
Since 182 falls between 150 and 183 of first row of dataframe and 938 falls between 900 and 969 of row 3 therefore the I want new list to have 182 and 938 from original list. In order to solve this problem I converted my dataframe to numpy array:
nn = df.values()
new_list = []
for item in lst:
for i in range(nn.shape[0]):
if item >= nn[i][1] and item <= nn[i][2]:
new_list.append(item)
But above mentioned code take a long time since its O(n^2) and it doesn't scale well to my list which has 103237 items. How can do this more efficiently?
Consider the following: Assuming you have a value item, you can ask if in inside any interval by the following line
((df[1] <= item) & (df[2] >= item)).any()
the statements (df[1] <= item) and (df[2] >= item) return an boolean array of true/false. The '&' will return a single boolean array whether item is in specific interval. The add of any() in the end returns true if there is any True value in the boolean array, aka if there is an interval which is "True" (the number is inside the interval).
So for the a single item, you can get an answer by the above line.
To scan over all items you can the following:
new_list = []
for item in lst:
if ((df[1] <= item) & (df[2] >= item)).any():
new_list.append(item)
or with list comperhension:
new_list = [item for item in lst if ((df[1] <= item) & (df[2] >= item)).any()]
Edit: if this code is too slowly you can accelerate if even further with numba, but I believe using pandas vectorization (aka using df[1]<=item is good enough)
You can iterate the list and compare each element with all pairs of column 1 and 2 to see if there is any pair that would include the element.
[e for e in lst if (df['1'].lt(e) & df['2'].gt(e)).any()]
I did a test with 110000 elements in the list and 9000 rows in the dataframe and the code takes 32s to run on a macbook pro.

Assigning the value to the list from an array

res is list has 1867 value divided into 11 sets with different sets of elements.
ex: res[0][:]=147,res[1][:]=174,res[2][:]=168 so on total 11 set = 1867 elements.
altitude=[125,85,69,754,855,324,...] has 1867 values.
I need to replace the res list values with a continuous altitude values in a list.
I have tried:
for h in range(len(res)):
res[h][:]=altitude
It is storing all 1867 values in all the sets. I need the first 147 elements in set1, next (starting from 148th value) 174 elements in set2 so on...
Thank You
You need to keep track of the number of elements assigned at each iteration to get the correct slice from altitude. If I understand correctly res is a list of lists with varying length.
Here is a possible solution:
current_position = 0
for sublist in res:
sub_len = len(sublist)
sublist[:] = altitude[current_position: current_position + sub_len]
current_position += sub_len

I want to remove rows where a specific value doesn't increase. Is there a faster/more elegant way?

I have a dataframe with 30 columns, 1.000.000 rows and about 150 MB size. One column is categorical with 7 different elements and another column (Depth) contains mostly increasing numbers. The graph for each of the elements looks more or less like this.
I tried to save the column Depth as series and iterate through it while dropping rows that won't match the criteria. This was reeeeeaaaally slow.
Afterwards I added a boolean column to the dataframe which indicates if it will be dropped or not, so I could drop the rows in the end in a single step. Still slow. My last try (the code to it is in this post) was to create a boolean list to save the fact if it passes the criteria there. Still really slow (about 5 hours).
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Depth'].index.min()
maxIdx = df.loc[df['Element']==element]['Depth'].index.max()
for x in range(minIdx,maxIdx):
if df.loc[df['Element']==element]['Depth'][x] < currentMax:
dropList[x]=False
else:
currentMax = df.loc[df['Element']==element]['Depth'][x]
df: The main dataframe
elements: a list with the 7 different elements (same as in the categorical column in df)
All rows in an element, where the value Depth isn't bigger than all previous ones should be dropped. With the next element it should start with 0 again.
Example:
Input: 'Depth' = [0 1 2 3 4 2 3 5 6]
'AnyOtherColumn' = [a b c d e f g h i]
Output: 'Depth' [0 1 2 3 4 5 6]
'AnyOtherColumn' = [a b c d e h i]
This should apply to whole rows in the dataframe of course.
Is there a way to get this faster?
EDIT:
The whole rows of the input dataframe should stay as they are. Just the ones where the 'Depth' does not increase should be dropped.
EDIT2:
The remaining rows should stay in their initial order.
How about you take a 2-step approach. First you use a fast sorting algorithm (for example Quicksort) and next you get rid of all the duplicates?
Okay, I found a way thats faster. Here is the code:
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Tiefe'].index.min()
# maxIdx = df.loc[df['Element']==element]['Tiefe'].index.max()
elementList = df.loc[df['Element']==element]['Tiefe'].to_list()
for x in tqdm(range(len(elementList))):
if elementList[x] < currentMax:
dropList[x+minIdx]=False
else:
currentMax = elementList[x]
I took the column and saved it as a list. To preserve, the index of the dataframe I saved the lowest one and within the loop it gets added again.
Overall it seems the problem was the loc function. From initially 5 hours runtime, its now about 10 seconds.

Iterating over columns and comparing each row value of that column to another column's value in Pandas

I am trying to iterate through a range of 3 columns (named 0 ,1, 2). in each iteration of that column I want to compare each row-wise value to another column called Flag (row-wise comparison for equality) in the same frame. I then want to return the matching field.
I want to check if the values match.
Maybe there is an easier approach to concatenate those columns into a single list then iterate through that list and see if there are any matches to that extra column? I am not very well versed in Pandas or Numpy yet.
I'm trying to think of something efficient as well as I have a large data set to perform this on.
Most of this is pretty free thought so I am just trying lots of different methods
Some attempts so far using the iterate over each column method:
##Sample Data
df = pd.DataFrame([['123','456','789','123'],['357','125','234','863'],['168','298','573','298'], ['123','234','573','902']])
df = df.rename(columns = {3: 'Flag'})
##Loop to find matches
i = 0
while i <= 2:
df['Matches'] = df[i].equals(df['Flag'])
i += 1
My thought process is to iterate over each column named 0 - 2, check to see if the row-wise values match between 'Flag' and the columns 0-2. Then return if they matched or not. I am not entirely sure which would be the best way to store the match result.
Maybe utilizing a different structured approach would be beneficial.
I provided a sample frame that should have some matches if I can execute this properly.
Thanks for any help.
You can use iloc in combination with eq than return the row if any of the columns match with .any:
m = df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
df['indicator'] = m
0 1 2 Flag indicator
0 123 456 789 123 True
1 357 125 234 863 False
2 168 298 573 298 True
3 123 234 573 902 False
The result you get back you can select by boolean indexing:
df.iloc[:, :-1].eq(df['Flag'], axis=0)
0 1 2
0 True False False
1 False False False
2 False True False
3 False False False
Then if we chain it with any:
df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
0 True
1 False
2 True
3 False
dtype: bool

Resources