I want to remove rows where a specific value doesn't increase. Is there a faster/more elegant way? - python-3.x

I have a dataframe with 30 columns, 1.000.000 rows and about 150 MB size. One column is categorical with 7 different elements and another column (Depth) contains mostly increasing numbers. The graph for each of the elements looks more or less like this.
I tried to save the column Depth as series and iterate through it while dropping rows that won't match the criteria. This was reeeeeaaaally slow.
Afterwards I added a boolean column to the dataframe which indicates if it will be dropped or not, so I could drop the rows in the end in a single step. Still slow. My last try (the code to it is in this post) was to create a boolean list to save the fact if it passes the criteria there. Still really slow (about 5 hours).
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Depth'].index.min()
maxIdx = df.loc[df['Element']==element]['Depth'].index.max()
for x in range(minIdx,maxIdx):
if df.loc[df['Element']==element]['Depth'][x] < currentMax:
dropList[x]=False
else:
currentMax = df.loc[df['Element']==element]['Depth'][x]
df: The main dataframe
elements: a list with the 7 different elements (same as in the categorical column in df)
All rows in an element, where the value Depth isn't bigger than all previous ones should be dropped. With the next element it should start with 0 again.
Example:
Input: 'Depth' = [0 1 2 3 4 2 3 5 6]
'AnyOtherColumn' = [a b c d e f g h i]
Output: 'Depth' [0 1 2 3 4 5 6]
'AnyOtherColumn' = [a b c d e h i]
This should apply to whole rows in the dataframe of course.
Is there a way to get this faster?
EDIT:
The whole rows of the input dataframe should stay as they are. Just the ones where the 'Depth' does not increase should be dropped.
EDIT2:
The remaining rows should stay in their initial order.

How about you take a 2-step approach. First you use a fast sorting algorithm (for example Quicksort) and next you get rid of all the duplicates?

Okay, I found a way thats faster. Here is the code:
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Tiefe'].index.min()
# maxIdx = df.loc[df['Element']==element]['Tiefe'].index.max()
elementList = df.loc[df['Element']==element]['Tiefe'].to_list()
for x in tqdm(range(len(elementList))):
if elementList[x] < currentMax:
dropList[x+minIdx]=False
else:
currentMax = elementList[x]
I took the column and saved it as a list. To preserve, the index of the dataframe I saved the lowest one and within the loop it gets added again.
Overall it seems the problem was the loc function. From initially 5 hours runtime, its now about 10 seconds.

Related

groupby consecutive identical values in pandas dataframe and cumulative count of the number of occurences

I have a problem where I would like to count the number of times the current value has not changed in a dataframe over rolling periods.
For example:
df = pd.DataFrame({'col':list('aaaabbab')})
would somehow give output of
0
1
2
3
0
1
0
0
I have been trying something along the following
df['col'] = df['col'] == df['col'].shift(1)
df.rolling(window=3).sum().reset_index(drop=True, level=0)
I have added in the rolling as I will want to look at the full data set in terms of rolling periods but even without having it over rolling periods I can not quite figure out the logic.
I am not sure if I am missing something simple or this may not be possible using shift
You need to generate a grouper for the change in values. For this compare each value with the previous one and apply a cumsum. This gives you groups in the itertools.groupby style ([1, 1, 1, 1, 2, 2, 3, 4]), finally group and apply a cumcount.
df['count'] = (df.groupby(df['col'].ne(df['col'].shift()).cumsum())
.cumcount()
)
output:
col count
0 a 0
1 a 1
2 a 2
3 a 3
4 b 0
5 b 1
6 a 0
7 b 0
edit: for fun here is a solution using itertools (much faster):
from itertools import groupby, chain
df['count'] = list(chain(*(list(range(len(list(g))))
for _,g in groupby(df['col']))))
NB. this runs much faster (88 µs vs 707 µs on the provided example)
I can't comment so just to add some more to #mozway answer.
My goal was to count consecutives value for an entire huge dataframe effectively.
The pb I encounter is that by construction
np.nan == np.nan
will return False so you could have a whole column full of only NaN and yet the counter will be at 0.
A simple workaround would be to replace all NaN in your df by a value not already in it.
For instance in the case of a float dataset you could do
df.fillna('NA')
which will work but by changing the dtype of your columns to Object the following code will be much slower (20x on my set up).
I would rather advised something like :
all_values = list(np.unique(np.array(df)))
all_values = [a for a in all_values if a==a]
unik_val = min(all_values)-1
temp = df.fillna(unik_val).copy()
from itertools import groupby, chain
for col in temp.columns:
temp[col] = list(chain(*(list(range(len(list(g))))
for _,g in groupby(temp[col]))))
count_df

How to combine two rows of same dataset side by side in python?

I have a dataset and I want to combine the first two rows of the same dataset into a single dataset. The original dataset is very big but I have mentioned a small example here.
df
one two three
0 T H A
1 N K S
2 F O R
3 H L P
After combining the first two rows it should look like this:
df
one two three one two three
0 T H A N K S
I'm very new to StackOverflow and started my career recently in python. If my question is not formatted correctly please suggest edits. Thanks.
You can use df.iloc to get two slices of the dataframe one for even rows and another for odd rows. Then pd.concat(..., axis=1) to get them back together.
Notice pd.concat will try to align the input dataframes on their index (i.e. 0, 1, 2, 3) and if one of the dataframes does not have data for a particular index then it will fill with null values. So we need reset_index to get the desired output.
df = pd.concat(
[
df.iloc[::2].reset_index(drop=True),
df.iloc[1::2].reset_index(drop=True)
], axis=1
)
Output
one two three one two three
0 T H A N K S
1 F O R H L P
You can read more about pd.concat in this answer and of course the user guide

I want to improve speed of my algorithm with multiple rows input. Python. Find average of consequitive elements in list

I need to find average of consecutive elements from list.
At first I am given lenght of list,
then list with numbers,
then am given how many test i need to perform(several rows with inputs),
then I am given several inputs to perform tests(and need to print as many rows with results)
every row for test consist of start and end element in list.
My algorithm:
nu = int(input()) # At first I am given lenght of list
numbers = input().split() # then list with numbers
num = input() # number of rows with inputs
k =[float(i) for i in numbers] # given that numbers in list are of float type
i= 0
while i < int(num):
a,b = input().split() # start and end element in list
i += 1
print(round(sum(k[int(a):(int(b)+1)])/(-int(a)+int(b)+1),6)) # round up to 6 decimals
But it's not fast enough.I was told it;s better to get rid of "while" but I don't know how. Appreciate any help.
Example:
Input:
8 - len(list)
79.02 36.68 79.83 76.00 95.48 48.84 49.95 91.91 - list
10 - number of test
0 0 - a1,b1
0 1
0 2
0 3
0 4
0 5
0 6
0 7
1 7
2 7
Output:
79.020000
57.850000
65.176667
67.882500
73.402000
69.308333
66.542857
69.713750
68.384286
73.668333
i= 0
while i < int(num):
a,b = input().split() # start and end element in list
i += 1
Replace your while-loop with a for loop. Also you could get rid of multiple int calls in the print statement:
for _ in range(int(num)):
a, b = [int(j) for j in input().split()]
You didn't spell out the constraints, but I am guessing that the ranges to be averaged could be quite large. Computing sum(k[int(a):(int(b)+1)]) may take a while.
However, if you precompute partial sums of the input list, each query can be answered in a constant time (sum of numbers in the range is a difference of corresponding partial sums).

Exporting a list as a new column in a pandas dataframe as part of a nested for loop

I am inputting multiple spreadsheets with multiple columns of data. For each spreadsheet, the maximum value of each column is found. Then, for each element in the column, the element is divided by the maximum value of that column. The output should be a value (between 0 and 1) for each element in the column in ascending order. This is appended to a list which should be added to the source spreadsheet as a column.
Currently, the nested loops are performing correctly apart from the final step, as far as I understand. Each column is added to the spreadsheet EXCEPT the values are for the final column of the source spreadsheet rather than values related to each individual column.
I have tried changing the indents to associate levels of the code with different parts (as I think this is the problem) and tried moving the appended column along in the dataframe, to no avail.
for i in distlist:
#listname = i[4:] + '_norm'
df2 = pd.read_excel(i,header=0,index_col=None, skip_blank_lines=True)
df3 = df2.dropna(axis=0, how='any')
cols = []
for column in df3:
cols.append(column)
for x in cols:
listname = x + ' norm'
maxval = df3[x].max()
print(maxval)
mylist = []
for j in df3[x]:
findNL = (j/maxval)
mylist.append(findNL)
df3[listname] = mylist
saveloc = 'E:/test/'
filename = i[:-18] + '_Normalised.xlsx'
df3.to_excel(saveloc+filename, index=False)
New columns are added to the output dataframe with bespoke headings relating to the field headers in the source spreadsheet and renamed according to (listname). The data in each one of these new columns is identical and relates to the final column in the spreadsheet. To me, it seems to be overwriting the values each time (as if looping through the entire spreadsheet, not outputting for each column), and adding it to the spreadsheet.
Any help would be much appreciated. I think it's something simple, but I haven't managed to work out what...
If I understand you correctly, you are overcomplicating things. You dont need a for loop for this. You can simplify your code:
# Make example dataframe, this is not provided
df = pd.DataFrame({'col1':[1, 2, 3, 4],
'col2':[5, 6, 7, 8]})
print(df)
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8
Now we can use DataFrame.apply and use add_suffix to give the new columns _norm suffix and after that concat the columns to one final dataframe
df_conc = pd.concat([df, df.apply(lambda x: x/x.max()).add_suffix('_norm')],axis=1)
print(df_conc)
col1 col2 col1_norm col2_norm
0 1 5 0.25 0.625
1 2 6 0.50 0.750
2 3 7 0.75 0.875
3 4 8 1.00 1.000
Many thanks. I think I was just overcomplicating it. Incidentally, I think my code may do the same job, but because there is so little difference in the values, it wasn't notable.
Thanks for your help #Erfan

pandas how to flatten a list in a column while keeping list ids for each element

I have the following df,
A id
[ObjectId('5abb6fab81c0')] 0
[ObjectId('5abb6fab81c3'),ObjectId('5abb6fab81c4')] 1
[ObjectId('5abb6fab81c2'),ObjectId('5abb6fab81c1')] 2
I like to flatten each list in A, and assign its corresponding id to each element in the list like,
A id
ObjectId('5abb6fab81c0') 0
ObjectId('5abb6fab81c3') 1
ObjectId('5abb6fab81c4') 1
ObjectId('5abb6fab81c2') 2
ObjectId('5abb6fab81c1') 2
I think the comment is coming from this question ? you can using my original post or this one
df.set_index('id').A.apply(pd.Series).stack().reset_index().drop('level_1',1)
Out[497]:
id 0
0 0 1.0
1 1 2.0
2 1 3.0
3 1 4.0
4 2 5.0
5 2 6.0
Or
pd.DataFrame({'id':df.id.repeat(df.A.str.len()),'A':df.A.sum()})
Out[498]:
A id
0 1 0
1 2 1
1 3 1
1 4 1
2 5 2
2 6 2
This probably isn't the most elegant solution, but it works. The idea here is to loop through df (which is why this is likely an inefficient solution), and then loop through each list in column A, appending each item and the id to new lists. Those two new lists are then turned into a new DataFrame.
a_list = []
id_list = []
for index, a, i in df.itertuples():
for item in a:
a_list.append(item)
id_list.append(i)
df1 = pd.DataFrame(list(zip(alist, idlist)), columns=['A', 'id'])
As I said, inelegant, but it gets the job done. There's probably at least one better way to optimize this, but hopefully it gets you moving forward.
EDIT (April 2, 2018)
I had the thought to run a timing comparison between mine and Wen's code, simply out of curiosity. The two variables are the length of column A, and the length of the list entries in column A. I ran a bunch of test cases, iterating by orders of magnitude each time. For example, I started with A length = 10 and ran through to 1,000,000, at each step iterating through randomized A entry list lengths of 1-10, 1-100 ... 1-1,000,000. I found the following:
Overall, my code is noticeably faster (especially at increasing A lengths) as long as the list lengths are less than ~1,000. As soon as the randomized list length hits the ~1,000 barrier, Wen's code takes over in speed. This was a huge surprise to me! I fully expected my code to lose every time.
Length of column A generally doesn't matter - it simply increases the overall execution time linearly. The only case in which it changed the results was for A length = 10. In that case, no matter the list length, my code ran faster (also strange to me).
Conclusion: If the list entries in A are on the order of a few hundred elements (or less) long, my code is the way to go. But if you're working with huge data sets, use Wen's! Also worth noting that as you hit the 1,000,000 barrier, both methods slow down drastically. I'm using a fairly powerful computer, and each were taking minutes by the end (it actually crashed on the A length = 1,000,000 and list length = 1,000,000 case).
Flattening and unflattening can be done using this function
def flatten(df, col):
col_flat = pd.DataFrame([[i, x] for i, y in df[col].apply(list).iteritems() for x in y], columns=['I', col])
col_flat = col_flat.set_index('I')
df = df.drop(col, 1)
df = df.merge(col_flat, left_index=True, right_index=True)
return df
Unflattening:
def unflatten(flat_df, col):
flat_df.groupby(level=0).agg({**{c:'first' for c in flat_df.columns}, col: list})
After unflattening we get the same dataframe except column order:
(df.sort_index(axis=1) == unflatten(flatten(df)).sort_index(axis=1)).all().all()
>> True
To create unique index you can call reset_index after flattening

Resources