Subset String Array based on length - string

I have a vector with > 30000 words. I want to create a subset of this vector which contains only those words whose length is greater than 5. What is the best way to achieve this?
Basically df contains mutiple sentences.
So,
wordlist = df2;
wordlist = [strip(wordlist[i]) for i in [1:length(wordlist)]];
Now, I need to subset wordlist so that it contains only those words whose length is greater than 5.

sub(A,find(x->length(x)>5,A)) # => creates a view (most efficient way to make a subset)
EDIT: getindex() returns a copy of desired elements
getindex(A,find(x->length(x)>5,A)) # => makes a copy

You can use filter
wordlist = filter(x->islenatleast(x,6),wordlist)
and combine it with a fast condition such as islenatleast defined as:
function islenatleast(s,l)
if sizeof(s)<l return false end
# assumes each char takes at least a byte
l==0 && return true
p=1
i=0
while i<l
if p>sizeof(s) return false end
p = nextind(s,p)
i += 1
end
return true
end
According to my timings islenatleast is faster than calculating the whole length (in some conditions). Additionally, this shows the strength of Julia, by defining a primitive competitive with the core function length.
But doing:
wordlist = filter(x->length(x)>5,wordlist)
will also do.

Related

Palindrome problem - Trying to check 2 lists for equality python3.9

I'm writing a program to check if a given user input is a palindrome or not. if it is the program should print "Yes", if not "no". I realize that this program is entirely too complex since I actually only needed to check the whole word using the reversed() function, but I ended up making it quite complex by splitting the word into two lists and then checking the lists against each other.
Despite that, I'm not clear why the last conditional isn't returning the expected "Yes" when I pass it "racecar" as an input. When I print the lists in line 23 and 24, I get two lists that are identical, but then when I compare them in the conditional, I always get "No" meaning they are not equal to each other. can anyone explain why this is? I've tried to convert the lists to strings but no luck.
def odd_or_even(a): # function for determining if odd or even
if len(a) % 2 == 0:
return True
else:
return False
the_string = input("How about a word?\n")
x = int(len(the_string))
odd_or_even(the_string) # find out if the word has an odd or an even number of characters
if odd_or_even(the_string) == True: # if even
for i in range(x):
first_half = the_string[0:int((x/2))] #create a list with part 1
second_half = the_string[(x-(int((x/2)))):x] #create a list with part 2
else: #if odd
for i in range(x):
first_half = the_string[:(int((x-1)/2))] #create a list with part 1 without the middle index
second_half = the_string[int(int(x-1)/2)+1:] #create a list with part 2 without the middle index
print(list(reversed(second_half)))
print(list(first_half))
if first_half == reversed(second_half): ##### NOT WORKING BUT DONT KNOW WHY #####
print("Yes")
else:
print("No")
Despite your comments first_half and second_half are substrings of your input, not lists. When you print them out, you're converting them to lists, but in the comparison, you do not convert first_half or reversed(second_half). Thus you are comparing a string to an iterator (returned by reversed), which will always be false.
So a basic fix is to do the conversion for the if, just like you did when printing the lists out:
if list(first_half) == list(reversed(second_half)):
A better fix might be to compare as strings, by making one of the slices use a step of -1, so you don't need to use reversed. Try second_half = the_string[-1:x//2:-1] (or similar, you probably need to tweak either the even or odd case by one). Or you could use the "alien smiley" slice to reverse the string after you slice it out of the input: second_half = second_half[::-1].
There are a few other oddities in your code, like your for i in range(x) loop that overwrites all of its results except the last one. Just use x - 1 in the slicing code and you don't need that loop at all. You're also calling int a lot more often than you need to (if you used // instead of /, you could get rid of literally all of the int calls).

list and dictionary: which one is faster

I have the following pieces of code doing the sorting of a list by swapping pairs of elements:
# Complete the minimumSwaps function below.
def minimumSwaps(arr):
counter = 0
val_2_indx = {val: arr.index(val) for val in arr}
for indx, x in enumerate(arr):
if x != indx+1:
arr[indx] = indx+1
s_indx = val_2_indx[indx+1]
arr[s_indx] = x
val_2_indx[indx+1] = indx
val_2_indx[x] = s_indx
counter += 1
return counter
def minimumSwaps(arr):
temp = [0] * (len(arr) + 1)
for pos, val in enumerate(arr):
temp[val] = pos
swaps = 0
for i in range(len(arr)):
if arr[i] != i+1:
swaps += 1
t = arr[i]
arr[i] = i+1
arr[temp[i+1]] = t
temp[t] = temp[i+1]
temp[i+1] = i
return swaps
The second function works much faster than the first one. However, I was told that dictionary is faster than list.
What's the reason here?
A list is a data structure, and a dictionary is a data structure. It doesn't make sense to say one is "faster" than the other, any more than you can say that an apple is faster than an orange. One might grow faster, you might be able to eat the other one faster, and they might both fall to the ground at the same speed when you drop them. It's not the fruit that's faster, it's what you do with it.
If your problem is that you have a sequence of strings and you want to know the position of a given string in the sequence, then consider these options:
You can store the sequence as a list. Finding the position of a given string using the .index method requires a linear search, iterating through the list in O(n) time.
You can store a dictionary mapping strings to their positions. Finding the position of a given string requires looking it up in the dictionary, in O(1) time.
So it is faster to solve that problem using a dictionary.
But note also that in your first function, you are building the dictionary using the list's .index method - which means doing n linear searches each in O(n) time, building the dictionary in O(n^2) time because you are using a list for something lists are slow at. If you build the dictionary without doing linear searches, then it will take O(n) time instead:
val_2_indx = { val: i for i, val in enumerate(arr) }
But now consider a different problem. You have a sequence of numbers, and they happen to be the numbers from 1 to n in some order. You want to be able to look up the position of a number in the sequence:
You can store the sequence as a list. Finding the position of a given number requires linear search again, in O(n) time.
You can store them in a dictionary like before, and do lookups in O(1) time.
You can store the inverse sequence in a list, so that lst[i] holds the position of the value i in the original sequence. This works because every permutation is invertible. Now getting the position of i is a simple list access, in O(1) time.
This is a different problem, so it can take a different amount of time to solve. In this case, both the list and the dictionary allow a solution in O(1) time, but it turns out it's more efficient to use a list. Getting by key in a dictionary has a higher constant time than getting by index in a list, because getting by key in a dictionary requires computing a hash, and then probing an array to find the right index. (Getting from a list just requires accessing an array at an already-known index.)
This second problem is the one in your second function. See this part:
temp = [0] * (len(arr) + 1)
for pos, val in enumerate(arr):
temp[val] = pos
This creates a list temp, where temp[val] = pos whenever arr[pos] == val. This means the list temp is the inverse permutation of arr. Later in the code, temp is used only to get these positions by index, which is an O(1) operation and happens to be faster than looking up a key in a dictionary.

Recursion Problem - breaking long sentence in multiple short strings

I'm trying to take a string and break it into small chunks if it is over certain number of words.
I keep on getting a RecursionError: maximum recursion depth exceeded in comparison
What in my code is making this happen?
import math
# Shorten Sentence into small pieces
def shorten(sentenceN):
# If it is a string - and length over 6 - then shorten recursively
if (isinstance(sentenceN, str)):
sentence = sentenceN.split(' ')
array = []
length = len(sentenceN)
halfed = math.floor(length / 2)
if length < 6:
return [sentenceN]
# If sentence is long - break into two parts then rerun shorten on each part
else:
first = shorten(" ".join(sentence[:halfed]))
second = shorten(" ".join(sentence[halfed:]))
array.append(first)
array.append(second)
return array
# If the object is an array (sentence is already broken up) - run shorten on each - append
# result to array for returning
if(isinstance(sentenceN, list)):
array = []
for sentence in sentenceN:
array.append(shorten(sentence))
return array
# example sentences to use
longSentence = "On offering to help the blind man, the man who then stole his car, had not, at that precise moment."
shortSentence = "On offering to help the blind man."
shorten(shortSentence)
shorten(longSentence)
When you execute a recursive function in Python on a large input ( > 10^4), you might encounter a “maximum recursion depth exceeded error”.
here you have recursion:
first = shorten(" ".join(sentence[:halfed]))
second = shorten(" ".join(sentence[halfed:]))
it means calling the same function over and over, it has to store in a stack to be returned in someplace, but it seems like your sentence is too long that stack is overflowing and hit maximum recursion depth.
you have to do something with the logic of the code like increase this 6 to a greater number
if length < 6:
return [sentenceN]
or just increase recursion depth with
import sys
sys.setrecursionlimit(10**6)

How to slice a list of strings till index of matched string depending on if-else condition

I have a list of strings =
['after','second','shot','take','note','of','the','temp']
I want to strip all strings after the appearance of 'note'.
It should return
['after','second','shot','take']
There are also lists which does not have the flag word 'note'.
So in case of a list of strings =
['after','second','shot','take','of','the','temp']
it should return the list as it is.
How to do that in a fast way? I have to repeat the same thing with many lists with unequal length.
tokens = [tokens[:tokens.index(v)] if v == 'note' else v for v in tokens]
There is no need of an iteration when you can slice list:
strings[:strings.index('note')+1]
where s is your input list of strings. The end slice is exclusive, hence a +1 makes sure 'note' is part.
In case of missing data ('note'):
try:
final_lst = strings[:strings.index('note')+1]
except ValueError:
final_lst = strings
if you want to make sure the flagged word is present:
if 'note' in lst:
lst = lst[:lst.index('note')+1]
Pretty much the same as #Austin's answer above.

Choosing minimum numbers from a given list to give a sum N( repetition allowed)

How to find the minimum number of ways in which elements taken from a list can sum towards a given number(N)
For example if list = [1,3,7,4] and N=14 function should return 2 as 7+7=14
Again if N= 11, function should return 2 as 7+4 =11. I think I have figured out the algorithm but unable to implement it in code.
Pls use Python, as that is the only language I understand(at present)
Sorry!!!
Since you mention dynamic programming in your question, and you say that you have figured out the algorithm, i will just include an implementation of the basic tabular method written in Python without too much theory.
The idea is to have a tabular structure we will use to compute all possible values we need without having to doing the same computations many times.
The basic formula will try to sum values in the list till we reach the target value, for every target value.
It should work, but you can of course make some optimization like trying to order the list and/or find dividends in order to construct a smaller table and have faster termination.
Here is the code:
import sys
# num_list : list of numbers
# value: value for which we want to get the minimum number of addends
def min_sum(num_list, value):
list_len = len(num_list)
# We will use the tipycal dynamic programming table construct
# the key of the list will be the sum value we want,
# and the value will be the
# minimum number of items to sum
# Base case value = 0, first element of the list is zero
value_table = [0]
# Initialize all table values to MAX
# for range i use value+1 because python range doesn't include the end
# number
for i in range(1, value+1):
value_table.append(sys.maxsize);
# try every combination that is smaller than <value>
for i in range(1, value+1):
for j in range(0, list_len):
if (num_list[j] <= i):
tmp = value_table[i-num_list[j]]
if ((tmp != sys.maxsize) and (tmp + 1 < value_table[i])):
value_table[i] = tmp + 1
return value_table[value]
## TEST ##
num_list = [1,3,16,5,3]
value = 22
print("Min Sum: ",min_sum(num_list,value)) # Outputs 3
it would be helpful if you include your Algorithm in Pseudocode - it will very much look like Python :-)
Another aspect: your first operation is a multiplication with one item from the list (7) and one outside of the list (2), whereas for the second opration it is 7+4 - both values in the list.
Is there a limitation for which operation or which items to use (from within or without the list)?

Resources