List comprehension of 3 nested loops and the output is based on if-else condition - python-3.x

Is it possible to convert this into a list comprehension? For example, I have a list v. On the source code below, v = dictionary.keys()
v = ["naive", "bayes", "classifier"]
I have the following nested list t.
t = [["naive", "bayes"], ["lol"]]
The expected output O should be:
O = [[1 1 0], [0 0 0]]
1 if the dictionary contains the word and 0 if not. I'm creating a spam/ham feature matrix. Due to the large dataset, I'd like to convert the code below into a list comprehension for a faster iteration.
ham_feature_matrix = []
for each_file in train_ham:
feature_vector = [0] * len(dictionary)
for each_word in each_file:
for d,dicword in enumerate(dictionary.keys()):
if each_word == dicword:
feature_vector[d] = 1
ham_feature_matrix.append(feature_vector)

I couldn't test this, but this translates as:
ham_feature_matrix = [[[int(each_word == dicword) for dicword in dictionary] for each_word in each_file] for each_file in train_ham]
[int(each_word == dicword) for dicword in dictionary] is the part which changes the most compared to your original code.
Basically, since you're iterating on the words of the dictionary, you don't need enumerate to set the matching slots to 1. The comprehension builds the list with the result of the comparison which is 0 or 1 when converted to integers. You don't need to get the keys since iterating on a dictionary iterates on the keys by default.
The rest of the loops is trivial.
The issue I'm seeing here is that you're iterating on a dictionary to create a list of booleans, but the order of the dictionary isn't fixed, so you'll have different results each time (like in your original code) unless you sort the items somehow.

Related

How a Python code to store integer in list and then find the sum of integer stored in the List

List of integer value passed through input function and then stored in a list. After which performing the operation to find the sum of all the numbers in the list
lst = list( input("Enter the list of items :") )
sum_element = 0
for i in lst:
sum_element = sum_element+int(i)
print(sum_element)
Say you want to create a list with 8 elements. By writing list(8) you do not create a list with 8 elements, instead you create the list that has the number 8 as it's only element. So you just get [8].
list() is not a Constructor (like what you might expect from other languages) but rather a 'Converter'. And list('382') will convert this string to the following list: ['3','8','2'].
So to get the input list you might want to do something like this:
my_list = []
for i in range(int(input('Length: '))):
my_list.append(int(input(f'Element {i}: ')))
and then continue with your code for summation.
A more pythonic way would be
my_list = [int(input(f'Element {i}: '))
for i in range(int(input('Length: ')))]
For adding all the elements up you could use the inbuilt sum() function:
my_list_sum = sum(my_list)
lst=map(int,input("Enter the elements with space between them: ").split())
print(sum(lst))

The best way of iterating through an array whose length changes in Python

I am implementing an algorithm which might affect the size of some array, and I need to iterate through the entire array. Basically a 'for x in arrayname' would not work because it does not update if the contents of arrayname are changed in the loop. I came up with an ugly solution which is shown in the following example:
test = np.array([1,2,3])
N = len(test)
ii=0
while ii < N:
N = len(test)
print(test[ii])
if test[ii] ==2:
test = np.append(test,4)
ii+=1
I am wondering whether a cleaner solution exists.
Thanks in advance!
Assuming all the elements are going to be added at the end and no elements are being deleted you could store the new elements in a separate list:
master_list = [1,2,3]
curr_elems = master_list
while len(curr_elems) > 0: # keep looping over new elements added
new_elems = []
for item in curr_elems: # loop over the current list of elements, initially the list but then all the added elements on second run etc
if should_add_element(item):
new_elems.append(generate_new_element(item))
master_list.extend(new_elems) # add all the new elements to our master list
curr_elems = new_elems # and prep to iterate over the new elements for next iteration of the while loop
The while loop seems the best solution. As the condition is re-evaluated at each iteration, you don’t need to reset the length of the list in the loop, you can do it inside the condition:
import random
l = [1, 2, 3, 4, 5]
i = 0
while i < len(l):
if random.choice([True, False]):
del l[i]
else:
i += 1
print(f'{l=}')
This example gives a blueprint for a more complex algorithm. Of course, in this simple case, it could be coded more simply with a filter, or like this:
l = [1, 2, 3, 4, 5]
[x for x in l if random.choice([True, False])]
You might want to check this related post for more creative solutions: How to remove items from a list while iterating?

If condition working differently for same value in python

I am trying to write a function which will return True or False if the given number is not greater than 2.
So simple, but the if condition is returning different outputs for same value '2'. The code I used is:
The code I used is:
ele_list = [1,2,3,2]
for i in ele_list:
if not i>2:
print(i,False)
ele_list.remove(i)
print(ele_list)
The ouput I am receiving is:
1 False
[2, 3, 2]
2 False
[3, 2]
I am confused to see that the first 2 in the list is passing through the if condition but the second 2 in the list is not passing through the condition. Please help me figure out this..
Removing elements from the list you're looping over is generally a bad idea.
What's happening here is that when you're removing an element, you're changing the length of the array, and therefor changing what elements are located at what indexes as well as changing the "goal" of the forloop.
Lets have a look at the following example:
ele_list = [4,3,2,1]
for elem in ele_list:
print(elem)
ele_list.remove(elem)
In the first iteration of the loop elem is the value 4 which is located at index 0. Then you're removing from the array the first value equal to elem. In other words the value 4 at index 0 is now removed. This shifts which element is stored at what index. Before the removal ele_list[0] would be equal to 4, however after the removal ele_list[0] will equal 3, since 3 is the value that prior to the removal was stored at index 1.
Now when the loop continues to the second iteration the index that the loop "looks at" is incremented by 1. So the variable elem will now be the value of ele_list[1] which in the updated list (after the removal of the value 4 in the previous iteration) is equal to 2. Then you're (same as before) removing the value at index 1 from the list, so now the length of the list just 2 elements.
When the loops is about to start the third iteration it checks to see if the new index (in this case 2) is smaller than the length of the list. Which its not, since 2 is not smaller than 2. So the loop ends.
The simplest solutions is to create a new copy of the array and loop over the copy instead. This can easily be done using the slice syntax: ele_list[:]
ele_list = [1,2,3,2]
for elem in ele_list[:]:
if not elem > 2:
print(elem, False)
ele_list.remove(elem)
print(ele_list)
the problem is that you're modifying your list as you're iterating over it, as mentioned in #Olian04's answer.
it sounds like what you really want to do, however, is only keep values that are > 2. this is really easy using a list comprehension:
filtereds_vals = [v for v in ele_list if v > 2]
if you merely want a function that gives you True for numbers greater than 2 and False for others, you can do something like this:
def gt_2(lst):
return [v > 2 for v in lst]
or, finally, if you want to find out if any of the values is > 2 just do:
def any_gt_2(lst):
return any(v > 2 for v in lst)
I think the problem here is how the remove function interacts with the for function.
See the documentation, read the "note" part:
https://docs.python.org/3.7/reference/compound_stmts.html?highlight=while#grammar-token-for-stmt
This can lead to nasty bugs that can be avoided by making a temporary copy using a slice of the whole sequence
A possible solution, as suggested into the documentation:
ele_list = [1,2,3,2]
for i in ele_list[:]:
if not i>2:
print(i,False)
ele_list.remove(i)
print(ele_list)
"""
1 False
[2, 3, 2]
2 False
[3, 2]
2 False
[3]
"""

List index out of range with one some data sets?

I am trying to code up a numerical clustering tool. Basically, I have a list (here called 'product') that should be transformed from an ascending list to a list that indicates linkage between numbers in the data set. Reading in the data set, removing carriage returns and hyphens works okay, but manipulating the list based on the data set is giving me a problem.
# opening file and returning raw data
file = input('Data file: ')
with open(file) as t:
nums = t.readlines()
t.close()
print(f'Raw data: {nums}')
# counting pairs in raw data
count = 0
for i in nums:
count += 1
print(f'Count of number pairs: {count}')
# removing carriage returns and hyphens
one = []
for i in nums:
one.append(i.rsplit())
new = []
for i in one:
for a in i:
new.append(a.split('-'))
print(f'Data sets: {new}')
# finding the range of the final list
my_list = []
for i in new:
for e in i:
my_list.append(int(e))
ran = max(my_list) + 1
print(f'Range of final list: {ran}')
# setting up the product list
rcount = count-1
product = list(range(ran))
print(f'Unchanged product: {product}')
for i in product:
for e in range(rcount):
if product[int(new[e][0])] < product[int(new[e][1])]:
product[int(new[e][1])] = product[int(new[e][0])]
else:
product[int(new[e][0])] = product[int(new[e][1])]
print(f'Resulting product: {product}')
I expect the result to be [0, 1, 1, 1, 1, 5, 5, 7, 7, 9, 1, 5, 5], but am met with a 'list index out of range' when using a different data set.
the data set used to give the above desired product is as follows: '1-2\n', '2-3\n', '3-4\n', '5-6\n', '7-8\n', '2-10\n', '11-12\n', '5-12\n', '\n'
However, the biggest issue I am facing is using other data sets. If there is not an additional carriage return, as it turns out, I will have the list index out of range error.
I can't quite figure out what you're actually trying to do here. What does "indicates linkages" mean, and how does the final output do so? Also, can you show an example of a dataset where it actually fails? And provide the actual exception that you get?
Regardless, your code is massively over-complicated, and cleaning it up a little may also fix your index issue. Using nums as from your sample above:
# Drop empty elements, split on hyphen, and convert to integers
pairs = [list(map(int, item.split('-'))) for item in nums if item.strip()]
# You don't need a for loop to count a list
count = len(pairs)
# You can get the maximum element with a nested generator expression
largest = max(item for p in pairs for item in p)
Also, in your final loop you're iterating over product while also modifying it in-place, which tends to not be a good idea. If I had more understanding of what you're trying to achieve I might be able to suggest a better approach.

Efficiently Perform Nested Dictionary Lookups and List Appending Using Numpy Nonzero Indices

I have working code to perform a nested dictionary lookup and append results of another lookup to each key's list using the results of numpy's nonzero lookup function. Basically, I need a list of strings appended to a dictionary. These strings and the dictionary's keys are hashed at one point to integers and kept track of using separate dictionaries with the integer hash as the key and the string as the value. I need to look up these hashed values and store the string results in the dictionary. It's confusing so hopefully looking at the code helps. Here's a simplified version of code:
for key in ResultDictionary:
ResultDictionary[key] = []
true_indices = np.nonzero(numpy_array_of_booleans)
for idx in range(0, len(true_indices[0])):
ResultDictionary.get(HashDictA.get(true_indices[0][idx])).append(HashDictB.get(true_indices[1][idx]))
This code works for me, but I am hoping there's a way to improve the efficiency. I am not sure if I'm limited due to the nested lookup. The speed is also dependent on the number of true results returned by the nonzero function. Any thoughts on this? Appreciate any suggestions.
Here are two suggestions:
1) since your hash dicts are keyed with ints it might help to transform them into arrays or even lists for faster lookup if that is an option.
k, v = map(list, (HashDictB.keys(), HashDictB.values())
mxk, mxv = max(k), max(v, key=len)
lookupB = np.empty((mxk+1,), dtype=f'U{mxv}')
lookupB[k] = v
2) you probably can save a number of lookups in ResultDictionary and HashDictA by processing your numpy_array_of_booleans row-wise:
i, j = np.where(numpy_array_of_indices)
bnds, = np.where(np.r_[True, i[:-1] != i[1:], True])
ResultDict = {HashDictA[i[l]]: [HashDictB[jj] for jj in j[l:r]] for l, r in zip(bnds[:-1], bnds[1:])}
2b) if for some reason you need to incrementally add associations you could do something like (I'll shorten variable names for that)
from operator import itemgetter
res = {}
def add_batch(data, res, hA, hB):
i, j = np.where(data)
bnds, = np.where(np.r_[True, i[:-1] != i[1:], True])
for l, r in zip(bnds[:-1], bnds[1:]):
if l+1 == r:
res.setdefault(hA[i[l]], set()).add(hB[j[l]])
else:
res.setdefault(hA[i[l]], set()).update(itemgetter(*j[l:r])(hB))
You can't do much about the dictionary lookups - you have to do those one at a time.
You can clean up the array indexing a bit:
idxes = np.argwhere(numpy_array_of_booleans)
for i,j in idxes:
ResultDictionary.get(HashDictA.get(i)).append(HashDictB.get(j)
argwhere is transpose(nonzero(...)), turning the tuple of arrays into a (n,2) array of index pairs. I don't think this makes a difference in speed, but the code is cleaner.

Resources