Finding smallest numbers Python numpy list - python-3.x

I have a Python 3 list which contains arbitrary number of numpy arrays of varying size/shape. The problem is to remove the smallest p% (where, p = 20%, say) of number (in terms of magnitude) in the list to zero.
Example code:
l = []
l.append(np.random.normal(1.5, 1, size = (4, 3)))
l.append(np.random.normal(1, 1, size = (4, 4)))
l.append(np.random.normal(1.8, 2, size = (2, 4)))
for x in l:
print(x.shape)
'''
(4, 3)
(4, 4)
(2, 4)
'''
How can I remove smallest p% of numbers from 'l' Python list "globally", this means that for all of the numpy arrays contained within the list 'l', it will remove the smallest p% of the smallest numbers (in terms of magnitude) to zero?
I am using Python 3.8 and numpy 1.18.
Thanks!
Toy example:
l
'''
[array([[ 0.95400011, 1.95433152, 0.40316605],
[ 1.34477354, 3.24612127, 1.54138912],
[ 1.158594 , 0.77954464, 0.4600395 ],
[-0.03092974, 3.55349303, 0.85526191]]),
array([[ 2.33613547, 0.12361808, 0.27620035, 0.70452795],
[ 0.76989846, -0.28613191, 1.90050011, 2.73843595],
[ 0.13510186, 0.91035556, 1.42402321, 0.60582303],
[-0.13655066, 2.4881577 , 2.0882935 , 1.40347429]]),
array([[-1.63365952, 1.2616223 , 0.86784273, -0.34538727],
[ 1.37161267, 2.4570491 , -0.72419948, 1.91873343]])]
'''
'l' has 36 numbers in it. Now 20% of 36 = 7.2 or rounded down = 7. So the idea is that 7 smallest magnitude numbers out of 36 numbers are removed by masking them to zero!

you can try the following. It looks for the threshold value and update the list in place to 0 when the value is under the threshold.
Let me know if you need more details
import numpy as np
l = []
l.append(np.random.normal(1.5, 1, size = (4, 3)))
l.append(np.random.normal(1, 1, size = (4, 4)))
l.append(np.random.normal(1.8, 2, size = (2, 4)))
acc = []
p = 20 #percentile to update to 0
for x in l:
acc.append(x.flatten())
threshold = np.percentile(np.concatenate(acc),p)
for x in l:
x[x < threshold] = 0

You can use this:
p = 20 #percentile to remove
lower = np.percentile(np.hstack([x.flatten() for x in l]), p)
for x in l:
x[x<lower] = 0
You basically stack all numbers into single array and using np.percentile, find the threshold for p% lower bound and then filter arrays using the lower threshold.

Related

Keep duplciate items in list of tuples if only the first index matches between the tuples

Input [(1,3), (3,1), (1,5), (2,3), (2,4), (44,33), (33,22), (44,22), (22,33)]
Expected Output [(1,3), (1,5), (2,3), (2,4), (44,33), (44,22)]
I am trying to figure out the above and have tried lots of stuff. So far my only success has been,
for x in range(len(list1)):
if list1[0][0] == list1[x][0]:
print(list1[x])
Output: (1, 3) \n (1, 5)
Any sort of advice or help would be appreciated.
Use a collections.defaultdict(list) keyed by the first value, and keep only the values that are ultimately duplicated:
from collections import defaultdict # At top of file, for collecting values by first element
from itertools import chain # At top of file, for flattening result
dct = defaultdict(list)
inp = [(1,3), (3,1), (1,5), (2,3), (2,4), (44,33), (33,22), (44,22), (22,33)]
# For each tuple
for tup in inp:
first, _ = tup # Extract first element (and verify it's actually a pair)
dct[first].append(tup) # Collect with other tuples sharing the same first element
# Extract all lists which have two or more elements (first element duplicated at least once)
# Would be list of lists, with each inner list sharing the same first element
onlydups = [lst for firstelem, lst in dct.items() if len(lst) > 1]
# Flattens to make single list of all results (if desired)
flattened_output = list(chain.from_iterable(onlydups))
Importantly, this doesn't required ordered input, and scales well, doing O(n) work (scaling your solution naively would produce a O(n²) solution, considerably slower for larger inputs).
Another approach is the following :
def sort(L:list):
K = []
for i in L :
if set(i) not in K :
K.append(set(i))
output = [tuple(m) for m in K]
return output
output :
[(1, 3), (1, 5), (2, 3), (2, 4), (33, 44), (33, 22), (44, 22)]

Defining a function to calculate mean-differences at specific array size

I have an array:
arr = np.array([1,2,3,4,5,6,7,8]
I want to define a function to calculate the difference of means of the elements of this array but at a given length.
For example:
diff_avg(arr, size=2)
Expected Result:
[-2, -2]
because:
((1+2)/2) - ((3+4)/2)) = -2 -> first 4 elements because size is 2, so 2 groups of 2 elements
((5+6)/2) - ((7+8)/2)) = -2 -> last 4 elements
if size=3
then:
output: [-3]
because:
((1+2+3)/3) - ((4+5+6)/3)) = -3 -> first 6 elements
what I did so far:
def diff_avg(first_group, second_group, size):
results =[]
x = np.mean(first_group) - np.mean(second_group)
results.append(x)
return results
I don't know how to add the size parameter
I can use the first size elements with arr[:size] but how to get the next size elements.
Does anyone can help me?
First, truncate the array to remove the extra items:
size = 3
sized_array = arr[:arr.size // (size * 2) * (size * 2)]
# array([1, 2, 3, 4, 5, 6])
Next, reshape the sized array and get the means:
means = sized_array.reshape([2, size, -1]).mean(axis=1)
# array([[2.], [5.]])
Finally, take the differences:
means[0] - means[1]
#array([-3.])

What is the best possible way to find the first AND the last occurrences of an element in a list in Python?

The basic way I usually use is by using the list.index(element) and reversed_list.index(element), but this fails when I need to search for many elements and the length of the list is too large say 10^5 or say 10^6 or even larger than that. What is the best possible way (which uses very little time) for the same?
You can build auxiliary lookup structures:
lst = [1,2,3,1,2,3] # super long list
last = {n: i for i, n in enumerate(lst)}
first = {n: i for i, n in reversed(list(enumerate(lst)))}
last[3]
# 5
first[3]
# 2
The construction of the lookup dicts takes linear time, but then the lookup itself is constant.
Whreas calls to list.index() take linear time, and repeatedly doing so is then quadratic (given the number of lookups you make depends on the size of the list).
You could also build a single structure in one iteration:
from collections import defaultdict
lookup = defaultdict(lambda: [None, None])
for i, n in enumerate(lst):
lookup[n][1] = i
if lookup[n][0] is None:
lookup[n][0] = i
lookup[3]
# [2, 5]
lookup[2]
# [1, 4]
Well, someone needs to do the work in finding the element, and in a large list this can take time! Without more information or a code example, it'll be difficult to help you, but usually the go-to answer is to use another data structure- for example, if you can keep your elements in a dictionary instead of a list with the key being the element and the value being an array of indices, you'll be much quicker.
You can just remember first and last index for every element in the list:
In [9]: l = [random.randint(1, 10) for _ in range(100)]
In [10]: first_index = {}
In [11]: last_index = {}
In [12]: for idx, x in enumerate(l):
...: if x not in first_index:
...: first_index[x] = idx
...: last_index[x] = idx
...:
In [13]: [(x, first_index.get(x), last_index.get(x)) for x in range(1, 11)]
Out[13]:
[(1, 3, 88),
(2, 23, 90),
(3, 10, 91),
(4, 13, 98),
(5, 11, 57),
(6, 4, 99),
(7, 9, 92),
(8, 19, 95),
(9, 0, 77),
(10, 2, 87)]
In [14]: l[0]
Out[14]: 9
Your approach sounds good, I did some testing and:
import numpy as np
long_list = list(np.random.randint(0, 100_000, 100_000_000))
# This takes 10ms in my machine
long_list.index(999)
# This takes 1,100ms in my machine
long_list[::-1].index(999)
# This takes 1,300ms in my machine
list(reversed(long_list)).index(999)
# This takes 200ms in my machine
long_list.reverse()
long_list.index(999)
long_list.reverse()
But at the end of the day, a Python list does not seem like the best data structure for this.
As others have sugested, you can build a dict:
indexes = {}
for i, val in enumerate(long_list):
if val in indexes.keys():
indexes[val].append(i)
else:
indexes[val] = [i]
This is memory expensive, but solves your problem (depends on how often you modify the original list).
You can then do:
# This takes 0.02ms in my machine
ix = indexes.get(999)
ix[0], ix[-1]

Generate a list with two unique elements with specific length [duplicate]

Simple question here:
I'm trying to get an array that alternates values (1, -1, 1, -1.....) for a given length. np.repeat just gives me (1, 1, 1, 1,-1, -1,-1, -1). Thoughts?
I like #Benjamin's solution. An alternative though is:
import numpy as np
a = np.empty((15,))
a[::2] = 1
a[1::2] = -1
This also allows for odd-length lists.
EDIT: Also just to note speeds, for a array of 10000 elements
import numpy as np
from timeit import Timer
if __name__ == '__main__':
setupstr="""
import numpy as np
N = 10000
"""
method1="""
a = np.empty((N,),int)
a[::2] = 1
a[1::2] = -1
"""
method2="""
a = np.tile([1,-1],N)
"""
method3="""
a = np.array([1,-1]*N)
"""
method4="""
a = np.array(list(itertools.islice(itertools.cycle((1,-1)), N)))
"""
nl = 1000
t1 = Timer(method1, setupstr).timeit(nl)
t2 = Timer(method2, setupstr).timeit(nl)
t3 = Timer(method3, setupstr).timeit(nl)
t4 = Timer(method4, setupstr).timeit(nl)
print 'method1', t1
print 'method2', t2
print 'method3', t3
print 'method4', t4
Results in timings of:
method1 0.0130500793457
method2 0.114426136017
method3 4.30518102646
method4 2.84446692467
If N = 100, things start to even out but starting with the empty numpy arrays is still significantly faster (nl changed to 10000)
method1 0.05735206604
method2 0.323992013931
method3 0.556654930115
method4 0.46702003479
Numpy arrays are special awesome objects and should not be treated like python lists.
use resize():
In [38]: np.resize([1,-1], 10) # 10 is the length of result array
Out[38]: array([ 1, -1, 1, -1, 1, -1, 1, -1, 1, -1])
it can produce odd-length array:
In [39]: np.resize([1,-1], 11)
Out[39]: array([ 1, -1, 1, -1, 1, -1, 1, -1, 1, -1, 1])
Use numpy.tile!
import numpy
a = numpy.tile([1,-1], 15)
use multiplication:
[1,-1] * n
If you want a memory efficient solution, try this:
def alternator(n):
for i in xrange(n):
if i % 2 == 0:
yield 1
else:
yield -1
Then you can iterate over the answers like so:
for i in alternator(n):
# do something with i
Maybe you're looking for itertools.cycle?
list_ = (1,-1,2,-2) # ,3,-3, ...
for n, item in enumerate(itertools.cycle(list_)):
if n==30:
break
print item
I'll just throw these out there because they could be more useful in some circumstances.
If you just want to alternate between positive and negative:
[(-1)**i for i in range(n)]
or for a more general solution
nums = [1, -1, 2]
[nums[i % len(nums)] for i in range(n)]

Python3: set a range of data

I feel this must be very basic but I cannot find a simple way.
I am using python3
I have many data files with x,y data where x goes from 0 to 140 (floating).
Let's say
0, 2.1
0.5,3.5
0.8,3.2
...
I want to import values of x within the range 25.4 to 28.1 and their correspondent values in y. Every file might have different length so the value x>25.4 might appear in different row.
I am looking for something equivalent to the following command in gnuplot:
set xrange [25.4:28.1]
This time I cannot use gnuplot because the data processing requires more than the capabilities of gnuplot.
I imported the data with Pandas but I cannot set a range.
Thank you.
r = range(start, stop, step) is the pattern for this in Python.
So, for example, to get:
r == [0, 1, 2]
You would write:
r = [x for x in range(3)]
And to get:
r == [0, 5, 10]
You would write:
r = [x for x in range(0, 11, 5)]
This doesn't get you very far because:
r = [0, .2, 4.3, 6.3]
r = [x for x in r if x in range(3, 10)]
# r == []
But you can do:
r = [0, .2, 4.3, 6.3]
r = [x for x in r if ((x > 3) & (x < 10))]
# r == [4.3, 6.3]
Pandas and Numpy give you a much more concise way of doing this. Consider the following demo of .between
import pandas as pd
import io
text = io.StringIO("""Close Top_Barrier Bottom_Barrier
0 441.86 441.964112 426.369888
1 448.95 444.162225 425.227108
2 449.99 446.222271 424.285063
3 449.74 447.947051 423.678282
4 451.97 449.879254 423.029413""")
df = pd.read_csv(text, sep='\\s+')
df = df[df["Close"].between(449, 452)] # between
df
So for your df you can do the same: df = df[df["x"].between(min, max)]

Resources