extract values from a list that falls in intervals specified by pandas dataframe - python-3.x

I have a huge list of length 103237. And I have a data frame of shape (8173,6). I want to extract those values from list that fall between values specified by two columns (1 AND 2) in pandas dataframe. For example:
lst = [182,73,137,1,938]
###dataframe
0 1 2 3 4
John 150 183 NY US
Peter 30 50 SE US
Stef 900 969 NY US
Expected output list:
lst = [182,938]
Since 182 falls between 150 and 183 of first row of dataframe and 938 falls between 900 and 969 of row 3 therefore the I want new list to have 182 and 938 from original list. In order to solve this problem I converted my dataframe to numpy array:
nn = df.values()
new_list = []
for item in lst:
for i in range(nn.shape[0]):
if item >= nn[i][1] and item <= nn[i][2]:
new_list.append(item)
But above mentioned code take a long time since its O(n^2) and it doesn't scale well to my list which has 103237 items. How can do this more efficiently?

Consider the following: Assuming you have a value item, you can ask if in inside any interval by the following line
((df[1] <= item) & (df[2] >= item)).any()
the statements (df[1] <= item) and (df[2] >= item) return an boolean array of true/false. The '&' will return a single boolean array whether item is in specific interval. The add of any() in the end returns true if there is any True value in the boolean array, aka if there is an interval which is "True" (the number is inside the interval).
So for the a single item, you can get an answer by the above line.
To scan over all items you can the following:
new_list = []
for item in lst:
if ((df[1] <= item) & (df[2] >= item)).any():
new_list.append(item)
or with list comperhension:
new_list = [item for item in lst if ((df[1] <= item) & (df[2] >= item)).any()]
Edit: if this code is too slowly you can accelerate if even further with numba, but I believe using pandas vectorization (aka using df[1]<=item is good enough)

You can iterate the list and compare each element with all pairs of column 1 and 2 to see if there is any pair that would include the element.
[e for e in lst if (df['1'].lt(e) & df['2'].gt(e)).any()]
I did a test with 110000 elements in the list and 9000 rows in the dataframe and the code takes 32s to run on a macbook pro.

Related

How to change Pandas Column Values in List Format

I'm trying to multiply each value in a column by 0.01 but the column values are in list format. How do I apply it to each element of the list in each row? For example, my data looks like this:
ID Amount
156 [14587, 38581, 55669]
798 [67178, 98635]
And I'm trying to multiply each element in the lists by 0.01.
ID Amount
156 [145.87, 385.81, 556.69]
798 [671.78, 986.35]
I've tried the following code but got an error message saying "can't multiply sequence by non-int of type 'float'".
df['Amount'] = df3['Amount'].apply(lambda x: x*0.00000001 in x)
You need another loop / list comprehension in apply:
df['Amount'] = df.Amount.apply(lambda lst: [x * 0.01 for x in lst])
df
ID Amount
0 156 [145.87, 385.81, 556.69]
1 798 [671.78, 986.35]

Assigning the value to the list from an array

res is list has 1867 value divided into 11 sets with different sets of elements.
ex: res[0][:]=147,res[1][:]=174,res[2][:]=168 so on total 11 set = 1867 elements.
altitude=[125,85,69,754,855,324,...] has 1867 values.
I need to replace the res list values with a continuous altitude values in a list.
I have tried:
for h in range(len(res)):
res[h][:]=altitude
It is storing all 1867 values in all the sets. I need the first 147 elements in set1, next (starting from 148th value) 174 elements in set2 so on...
Thank You
You need to keep track of the number of elements assigned at each iteration to get the correct slice from altitude. If I understand correctly res is a list of lists with varying length.
Here is a possible solution:
current_position = 0
for sublist in res:
sub_len = len(sublist)
sublist[:] = altitude[current_position: current_position + sub_len]
current_position += sub_len

I want to remove rows where a specific value doesn't increase. Is there a faster/more elegant way?

I have a dataframe with 30 columns, 1.000.000 rows and about 150 MB size. One column is categorical with 7 different elements and another column (Depth) contains mostly increasing numbers. The graph for each of the elements looks more or less like this.
I tried to save the column Depth as series and iterate through it while dropping rows that won't match the criteria. This was reeeeeaaaally slow.
Afterwards I added a boolean column to the dataframe which indicates if it will be dropped or not, so I could drop the rows in the end in a single step. Still slow. My last try (the code to it is in this post) was to create a boolean list to save the fact if it passes the criteria there. Still really slow (about 5 hours).
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Depth'].index.min()
maxIdx = df.loc[df['Element']==element]['Depth'].index.max()
for x in range(minIdx,maxIdx):
if df.loc[df['Element']==element]['Depth'][x] < currentMax:
dropList[x]=False
else:
currentMax = df.loc[df['Element']==element]['Depth'][x]
df: The main dataframe
elements: a list with the 7 different elements (same as in the categorical column in df)
All rows in an element, where the value Depth isn't bigger than all previous ones should be dropped. With the next element it should start with 0 again.
Example:
Input: 'Depth' = [0 1 2 3 4 2 3 5 6]
'AnyOtherColumn' = [a b c d e f g h i]
Output: 'Depth' [0 1 2 3 4 5 6]
'AnyOtherColumn' = [a b c d e h i]
This should apply to whole rows in the dataframe of course.
Is there a way to get this faster?
EDIT:
The whole rows of the input dataframe should stay as they are. Just the ones where the 'Depth' does not increase should be dropped.
EDIT2:
The remaining rows should stay in their initial order.
How about you take a 2-step approach. First you use a fast sorting algorithm (for example Quicksort) and next you get rid of all the duplicates?
Okay, I found a way thats faster. Here is the code:
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Tiefe'].index.min()
# maxIdx = df.loc[df['Element']==element]['Tiefe'].index.max()
elementList = df.loc[df['Element']==element]['Tiefe'].to_list()
for x in tqdm(range(len(elementList))):
if elementList[x] < currentMax:
dropList[x+minIdx]=False
else:
currentMax = elementList[x]
I took the column and saved it as a list. To preserve, the index of the dataframe I saved the lowest one and within the loop it gets added again.
Overall it seems the problem was the loc function. From initially 5 hours runtime, its now about 10 seconds.

Slicing specific rows of a column in pandas Dataframe

In the flowing data frame in Pandas, I want to extract columns corresponding dates between '03/01' and '06/01'. I don't want to use the index at all, as my input would be a start and end dates. How could I do so ?
A B
0 01/01 56
1 02/01 54
2 03/01 66
3 04/01 77
4 05/01 66
5 06/01 72
6 07/01 132
7 08/01 127
First create a list of the dates you need using daterange. I'm adding the year 2000 since you need to supply a year for this to work, im then cutting it off to get the desired strings. In real life you might want to pay attention to the actual year due to things like leap-days.
date_start = '03/01'
date_end = '06/01'
dates = [x.strftime('%m/%d') for x in pd.date_range('2000/{}'.format(date_start),
'2000/{}'.format(date_end), freq='D')]
dates is now equal to:
['03/01',
'03/02',
'03/03',
'03/04',
.....
'05/29',
'05/30',
'05/31',
'06/01']
Then simply use the isin argument and you are done
df = df.loc[df.A.isin(dates)]
df
If your columns is a datetime column I guess you can skip the strftime part in th list comprehension to get the right result.
You are welcome to use boolean masking, i.e.:
df[(df.A >= start_date) && (df.A <= end_date)]
Inside the bracket is a boolean array of True and False. Only rows that fulfill your given condition (evaluates to True) will be returned. This is a great tool to have and it works well with pandas and numpy.

Variable lengths don't respond to filters...then they do. Can someone explain?

Edit: All sorted, Imanol Luengo set me straight. Here's the end result, in all it's glory.
I don't understand the counts of my variables, maybe someone can explain? I'm filtering two columns of pass/fails for two locations. I want a count of all 4 pass/fails.
Here's the header of the columns. There are 126 values in total:
WT Result School
0 p Milan
1 p Roma
2 p Milan
3 p Milan
4 p Roma
Code so far:
data2 = pd.DataFrame(data[['WT Result', 'School']])
data2.dropna(inplace=True)
# Milan Counts
m_p = (data2['School']=='Milan') & (data2['WT Result']=='p')
milan_p = (m_p==True)
milan_pass = np.count_nonzero(milan_p) # Count of Trues for Milano
# Rome Counts
r_p = (data2['School']=='Roma') & (data2['WT Result']=='p')
rome_p = (r_p==True)
rome_pass = np.count_nonzero(rome_p) # Count of Trues for Rome
So what I've done, after stripping the excess columns (data2), is:
filter by location and == 'p' (vars m_p and r_p)
filter then by ==True (vars milan_p and rome_p)
Do a count_nonzero() for a count of 'True' (vars milan_pass and rome_pass)
Here's what I don't understand - these are the lengths of the variables:
data2: 126
m_p: 126
r_p: 126
milan_p: 126
rome_p: 126
milan_pass: 55
rome_pass: 47
Why do the lengths remain 126 once the filtering starts? To me, this shows that neither the filtering by location or by 'p' worked. But when I do the final count_nonzero() the results are suddenly separated into location. What is happening?
You are not filtering, you are masking. Step by step:
m_p = (data2['School']=='Milan') & (data2['WT Result']=='p')
Here m_p is a boolean array with the same length of a column from data2. Each element of m_p is set to True if it satisfies those 2 conditions, or to False otherwise.
milan_p = (m_p==True)
The above line is completely redundant. m_p is already a boolean array, comparing it to True will just create a copy of it. Thus, milan_p will be another boolean array with the same length as m_p.
milan_pass = np.count_nonzero(milan_p)
This just prints the number of nonzeros (e.g. True) elements of milan_p. Ofcourse, it matches the number of elements that you want to filter, but you are not filtering anything here.
Exactly the same applies to rome condition.
If you want to filter rows in pandas, you have to slice the dataframe with your newly generated mask:
filtered_milan = data2[m_p]
or alternatively
filtered_milan = data2[milan_p] # as m_p == milan_p
The above lines select the rows that have a True value in the mask (or condition), ignoring the False rows in the process.
The same applies for the second problem, rome.

Resources