How to access the count value of a Counter object in Python3? - python-3.x

Scenario
Given a few lines of code, I have included the line
counts = Counter(rank for rank in ranks)
because I want to find the highest count of a character in a string.
So I end up with the following object:
Counter({'A': 4, 'K': 1})
Here, the value I'm looking for is 4, because it is the highest count. Assume the object is called counts, then max(counts) returns 'K', presumably because 'K' > 'A' in unicode.
Question
How can I access the largest count/value, rather than the "largest" key?

You can use max as suggested by others. Note, though, that the Counter class provides the most_common(k) method which is slightly more flexible:
counts.most_common(1)[0][1]
Its real performance benefits, however, will only be seen if you want more than 1 most common element.

Maybe
max(counts.values())
would work?
From the Python documentation:
A Counter is a dict subclass for counting hashable objects. It is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values.
So you should treat the counter as a dictionary. To take the biggest value, use max() on the counters .value().

Related

Pythonic way to find lowest value in dictionary with list as values

I have a dictionary with values as list. In the given dictionary I want to find the lowest number (for every item in the list considering the value at index 0). I have written a script and works fine, but I am looking for a more Pythonic way to solve this.
c={'Apple': ['210-219', '246-255'], 'Orange': ['159-161', '202-204', '207-209', '209-211', '220-222', '238-240', '245-247', '261-263']}
loweststart=[]
for ckey, cvalue in c.items():
for i in cvalue:
print (i.split('-')[0])
start=int(i.split('-')[0])
loweststart.append(start)
print (loweststart)
print ('The lowest:',min(loweststart))
A pythonic way:
min_list = [int(element.split('-')[0]) for value in c.values() for element in value]
print(min(min_list))
You can use the min function with a generator expression that iterates through the items in the sub-lists of the dict and outputs the integer values of the first tokens in the strings:
min(int(s[:s.find('-')]) for l in c.values() for s in l)
Using a generator expression is more efficient because it avoids the need to create a temporary list to store all the values extracted from the sub-lists.
As much as I hate the adjective Pythonic, this would seem to qualify:
min([min([int(i.split('-')[0]) for i in ci[1]]) for ci in c.items()])
(The logic is slightly different than the original, in that it finds the minimum of each list, then the minimum of those minima, but the end result is the same.)

Getting a "list index can't be a float" error when i use the same iterator from a loop in an if statment in Python 3.4

I try to iterate through a list and check each value if it is a negative number, using the following code :
for i in listy:
if (listy[i]<0):...
For some reason, python tries to evaluate listy[0.5]<0, which is the 1st item on the list. How can I fix this ?
i is the value not the index.
This line is not what you want (listy[i]<0).
You probably meant to do i<0
(listy[i]<0) is trying to use the value as the index. In you list the value is a float which can't be used as an index.
If you really want to use the index you could do:
for i in range(len(listy)):
if listy[i] < 0:
#do something
In C and many other languages, you often use the length of an array when iterating through it. You can do this in Python as well but you can also iterate through the elements without explicitly using the index (position). You seem to be mixing the two approaches and therefore you get an unexpected result.
This will iterate through the values only:
for i in listy:
# here, i is the value of the list entry
print(i)
This will use the length of the list and the index (position):
for i in range(len(listy)):
# here, i is the index (position) of the list entry, not the value
print(listy[i])
This will give you both index and value:
for i, val in enumerate(listy):
# here, i is the index (position) and val is the value
print(i, val)

how would i look for the shortest unique subsequence from a set of words in python?

If i have a set of similar words such as:
\bigoplus
\bigotimes
\bigskip
\bigsqcup
\biguplus
\bigvee
\bigwedge
...
\zebra
\zeta
i would like to find the shortest unique set of letters that would characterize each word uniquely
i.e.
\bigop:
\bigoplus
\bigot:
\bigotimes
\bigsk:
\bigskip
EDIT: notice the unique sequence identifier always starts from the begining of the word. I writting an app that gives snippet suggestions when typing. So in general users will start typing from the start of the word
and so on, the sequence needs only be as long as is enough to characterize a word uniquely.
EDIT: but needs to start from the begining of the word.
The characterization always begins from the beginning of the word.
My thoughts:
i was thinking of sorting the words, and grouping based on the fist alphabetical letter, then probably use a longest common subsequence algorithm to find the longest subsequence in common, take its length and use length+1 chars for that unique substring, but im stuck since the algorithms i know for longest subsequence will usually only take two parameters at a time, and i may have more than two words in each group starting with a particular alphabetical letter.
Im i solving an already solved probelem? google was no help.
I'm assuming you want to find the prefixes that uniquely identify the strings, because if you could pick any subsequence, then for example om would be enough to identify \bigotimes in your example.
You can make use of the fact that for a given word, the word with the longest common prefix will be adjacent to it in lexicographical order.
Since your dictionary seems to be sorted already, you can figure out the solution for every word by finding the longest prefix that disambiguates it from both its neighbors.
Example:
>>> lst = r"""
... \bigoplus
... \bigotimes
... \bigskip
... \bigsqcup
... \biguplus
... \bigvee
... \bigwedge
... """.split()
>>> lst.sort() # necessary if lst is not already sorted
>>> lst = [""] + lst + [""]
>>> def cp(x): return len(os.path.commonprefix(x))
...
>>> { lst[i]: 1 + max(cp(lst[i-1:i+1]), cp(lst[i:i+2])) for i in range(1,len(lst)-1) }
{'\\bigvee': 5,
'\\bigsqcup': 6,
'\\biguplus': 5,
'\\bigwedge': 5,
'\\bigotimes': 6,
'\\bigoplus': 6,
'\\bigskip': 6}
The numbers indicate how long the minimal uniquely identifying prefix of a word is.
Thought I'd dump this here since it was the most similar to a question I was about to ask:
Looking for a better solution (will report back when I find one) to iterating through a sequence of strings, trying to map the shortest unique string for/to each.
For example, in a sequence of:
['blue', 'black', 'bold']
# 'blu' --> 'blue'
# 'bla' --> 'black'
# 'bo' --> 'bold'
Looking to improve upon my first, feeble solution. Here's what I came up with:
# Note: Iterating through the keys in a dict, mapping shortest
# unique string to the original string.
shortest_unique_strings = {}
for k in mydict:
for ix in range(len(k)):
# When the list-comp only has one item.
# 'key[:ix+1]' == the current substring
if len([key for key in mydict if key.startswith(key[:ix+1])]) == 1:
shortest_unique_strings[key[:ix+1]] = k
break
Note: On improving efficiency: we should be able to remove those keys/strings that have already been found, so that successive searches don't have to repeat on those items.
Note: I specifically refrained from creating/using any functions outside of built-ins.

algorithms for fast string approximate matching

Given a source string s and n equal length strings, I need to find a quick algorithm to return those strings that have at most k characters that are different from the source string s at each corresponding position.
What is a fast algorithm to do so?
PS: I have to claim that this is a academic question. I want to find the most efficient algorithm if possible.
Also I missed one very important piece of information. The n equal length strings form a dictionary, against which many source strings s will be queried upon. There seems to be some sort of preprocessing step to make it more efficient.
My gut instinct is just to iterate over each String n, maintaining a counter of how many characters are different than s, but I'm not claiming it is the most efficient solution. However it would be O(n) so unless this is a known performance problem, or an academic question, I'd go with that.
Sedgewick in his book "Algorithms" writes that Ternary Search Tree allows "to locate all words within a given Hamming distance of a query word". Article in Dr. Dobb's
Given that the strings are fixed length, you can compute the Hamming distance between two strings to determine the similarity; this is O(n) on the length of the string. So, worst case is that your algorithm is O(nm) for comparing your string against m words.
As an alternative, a fast solution that's also a memory hog is to preprocess your dictionary into a map; keys are a tuple (p, c) where p is the position in the string and c is the character in the string at that position, values are the strings that have characters at that position (so "the" will be in the map at {(0, 't'), "the"}, {(1, 'h'), "the"}, {(2, 'e'), "the"}). To query the map, iterate through query string's characters and construct a result map with the retrieved strings; keys are strings, values are the number of times the strings have been retrieved from the primary map (so with the query string "the", the key "thx" will have a value of 2, and the key "tee" will have a value of 1). Finally, iterate through the result map and discard strings whose values are less than K.
You can save memory by discarding keys that can't possibly equal K when the result map has been completed. For example, if K is 5 and N is 8, then when you've reached the 4th-8th characters of the query string you can discard any retrieved strings that aren't already in the result map since they can't possibly have 5 matching characters. Or, when you've finished with the 6th character of the query string, you can iterate through the result map and remove all keys whose values are less than 3.
If need be you can offload the primary precomputed map to a NoSql key-value database or something along those lines in order to save on main memory (and also so that you don't have to precompute the dictionary every time the program restarts).
Rather than storing a tuple (p, c) as the key in the primary map, you can instead concatenate the position and character into a string (so (5, 't') becomes "5t", and (12, 'x') becomes "12x").
Without knowing where in each input string the match characters will be, for a particular string, you might need to check every character no matter what order you check them in. Therefore it makes sense to just iterate over each string character-by-character and keep a sum of the total number of mismatches. If i is the number of mismatches so far, return false when i == k and true when there are fewer than k-i unchecked characters remaining in the string.
Note that depending on how long the strings are and how many mismatches you'll allow, it might be faster to iterate over the whole string rather than performing these checks, or perhaps to perform them only after every couple characters. Play around with it to see how you get the fastest performance.
My method if we're thinking out loud :P I can't see a way to do this without going through each n string, but I'm happy to be corrected. On that it would begin with a pre-process to save a second set of your n strings so that the characters are in ascending order.
The first part of the comparison would then be to check each n string a character at a time say n' to each character in s say s'.
If s' is less than n' then not equal and move to the next s'. If n' is less than s' then go to next n'. Otherwise record a matching character. Repeat this until k miss matches are found or the alternate matches are found and mark n accordingly.
For further consideration, an added pre-processing could be done on each adjacent string in n to see the total number of characters that differ. This could then be used when comparing strings n to s and if sufficient difference exist between these and the adjacent n there may not be a need to compare it?

find frequency of every word

There is a question asked to me in the interview, but I am not able to answer that.
Question is :
You are given a directed graph in which every node is a character and you are also given a array of strings.
The task is to calculate the frequency of every string in the array by searching in the graph.
My approach : I used trie, Suffix tree, but the interviewer is not fully satisfied. Can you give me an algorithm for the given problem.
How about the following... To find the number of occurrences of a String, s, in a directed graph.
Start with a bread first search (marking already visited nodes to avoid cycles)
When the first character is found, switch to a depth first search with max-depth = length(s)
If the string sequence is detected, increment occurrence count for each occurence of the DFS
Resume the BFS
Some caveats
I do not believe the DFS should share the BFS's visited node list (you may need to go back to the beginning and overlap for example
The BFS should also not shared the DFS visited list. For example, you could be looking for "Alan" and have "AAlan" and make sure you re-start on the second A
Now for an array, I can just repeat this procedure for each string.. Sure there may be more efficient solution, but I'd start off thinking about it this way..
Did your answer include any conversation about a breadth-first or depth-first search? If someone mentioned searching a graph, I'd almost always reply with a variation of one of these
Here's another solution:
First we need to do some preprocessing on the string array.
Let's define C as the subset of all the characters composing all the strings in the array.
For each character in C, we are going to keep track of each string containing that character and its position in that string + a Boolean value stating if its the last char in that string. This can be done using a dictionary.
For example, let's say our array is ['one', 'two', 'three']. Our dictionary would look something like this:
'o': (0, 0, false),(1,2,true)
't': (1, 0, false),(2,0,false)
'n': (0, 1, false)
'e': (2, 3, false),(2,4, true)
'h': (2, 1, false)
'r': (2, 2, false)
'w': (2, 1, false)
Next we are going to use DFS and Dynamic Programming.
Basically, whenever you visit an edge, you check the parent and the child on the dict to see if they compose a substring and you store that information.
Using this method, you can easily detect all recurrence of every string in the array.
Building the preprocessing table can be done in o(L) where L is the sum of the lengths of all the strings in the array.
Discovering all recurrence can be done in O(m * k) where m is the number of edges (and not the number of nodes, as a node can be discovered multiple times) and k is the number of strings.
The implementation can be a little tricky and there are some pitfalls you should avoid.
see this graph, each level has all 4*4 edges(hard to draw, plz stand me)
there may be a lot of occurrences.
i think he may be expecting dynamic programming:
process each string individually, f[i][j] denotes the total numbers to accomplish the string's last j letters starting from node i, the rest would be easy.

Resources