Check if a set of characters is contained in a string? - string

There is a pool of letters (chosen randomly), and you want to make a word with these letters. I found some codes that can help me with this, but then if the word has for example 2 L's and the pool only 1, I'd like the program to know when this happens.

If I understand this correctly, you will also need a list of all valid words in whichever language you are using.
Assuming you have this, then one strategy for solving this problem could be to generate a key for every word in the dictionary that is a sorted list of the letters in that word. You could then group all words in the dictionary by these keys.
Then the task of finding out if a valid word can be constructed from a given list of random characters would be easy and fast.
Here is a simple implementation of what I am suggesting:
list_of_all_valid_words = ['this', 'pot', 'is', 'not', 'on', 'top']
def make_key(word):
return "".join(sorted(word))
lookup_dictionary = {}
for word in list_of_all_valid_words:
key = make_key(word)
lookup_dictionary[key] = lookup_dictionary.get(key, set()).union(set([word]))
def words_from_chars(s):
return list(lookup_dictionary.get(make_key(s), set()))
print words_from_chars('xyz')
print words_from_chars('htsi')
print words_from_chars('otp')
Output:
[]
['this']
['pot', 'top']

Related

Getting individual characters in a list instead of the words themselves

def get_unique_words(text):
split_text = text.split()
print(split_text)
for word in text:
print(word)
Hi there,
in this code, I am trying to create a list which contains the words of text sorted alphabetically. For example, with The quick brown fox jumps over the lazy dog!, it would give ['brown', 'dog', 'fox', 'jumps', 'lazy', 'over', 'quick', 'the'].
However, in the code below, I always get the individual words instead of a list of words.
output
Why am I getting the invidual characters instead of words?
Note: I don't need to type out the get_unique_wods( ) part
I do not believe the for loop is necessary. I think that is what is causing the issue you described. A for loop could be used to check for duplicate values.
I think you can sort your list alphabetically in the following way
def get_unique_words(text):
# converts all alphabetical characters to lower
lower_text = text.lower()
# splits string on space character
split_text = lower_text.split(' ')
# sorts values in list
split_text.sort()
# empty list to populate unique words
results_list = []
# iterate over the list
for word in split_text:
# check to see if value is already in results lists
if word not in results_list:
# append the word if it is unique
results_list.append(word)
print(results_list)
text = "The quick brown fox jumps over the lazy dog!"
get_unique_words(text)
This returns the following list
['brown', 'dog!', 'fox', 'jumps', 'lazy', 'over', 'quick', 'the']
Taking the next step you probably want to remove duplicates and also drop any non-alphabetical characters.
For the non-alphabetical characters, it would be best to use regex which can be imported
import re
Here is a good post on how to remove non-alphabetical characters
Python, remove all non-alphabet chars from string
And for removing duplicates it may be best to convert the list into a dictionary. Here is a post on how to do just that
https://www.w3schools.com/python/trypython.asp?filename=demo_howto_remove_duplicates
You aren't returning anything from your function, but that's a separate issue.
Calling list on a string will make a list of all the characters. You if you do mylist.append(word) instead of my_list = my_list + list(word), you should get closer to what you are looking for.
Also note that capital letters are sorted before lowercase, so as is your list will start with "The".

How can I create a dictionary for a large amount to text and list the most frequent word?

I am new to coding and I am trying to create a dictionary from a large body of text and would also like the most frequent word to be shown?
For example, if I had a block of text such as:
text = '''George Gordon Noel Byron was born, with a clubbed right foot, in London on January 22, 1788. He was the son of Catherine Gordon of Gight, an impoverished Scots heiress, and Captain John (“Mad Jack”) Byron, a fortune-hunting widower with a daughter, Augusta. The profligate captain squandered his wife’s inheritance, was absent for the birth of his only son, and eventually decamped for France as an exile from English creditors, where he died in 1791 at 36.'''
I know the steps I would like the code to take. I want words that are the same but capitalised to be counted together so Hi and hi would count as Hi = 2.
I am trying to get the code to loop through the text and create a dictionary showing how many times each word appears. My final goal is to them have the code state which word appears most frequently.
I don't know how to approach such a large amount of text, the examples I have seen are for a much smaller amount of words.
I have tried to remove white space and also create a loop but I am stuck and unsure if I am going the right way about coding this problem.
a.replace(" ", "")
#this gave built-in method replace of str object at 0x000001A49AD8DAE0>, I have now idea what this means!
print(a.replace) # this is what I tried to write to remove white spaces
I am unsure of how to create the dictionary.
To count the word frequency would I do something like:
frequency = {}
for value in my_dict.values() :
if value in frequency :
frequency[value] = frequency[value] + 1
else :
frequency[value] = 1
What I was expecting to get was a dictionary that lists each word shown with a numerical value showing how often it appears in the text.
Then I wanted to have the code show the word that occurs the most.
This may be too simple for your requirements, but you could do this to create a dictionary of each word and its number of repetitions in the text.
text = "..." # text here.
frequency = {}
for word in text.split(" "):
if word not in frequency.keys():
frequency[word] = 1
else:
frequency[word] += 1
print(frequency)
This only splits the text up at each ' ' and counts the number of each occurrence.
If you want to get only the words, you may have to remove the ',' and other characters which you do not wish to have in your dictionary.
To remove characters such as ',' do.
text = text.replace(",", "")
Hope this helps and happy coding.
First, to remove all non-alphabet characters, aside from ', we can use regex
After that, we go through a list of the words and use a dictionary
import re
d = {}
text = text.split(" ")#turns it into a list
text = [re.findall("[a-zA-Z']", text[i]) for i in range(len(text))]
#each word is split, but non-alphabet/apostrophe are removed
text = ["".join(text[i]) for i in range(len(text))]
#puts each word back together
#there may be a better way for the short-above. If so, please tell.
for word in text:
if word in d.keys():
d[word] += 1
else:
d[word] = 1
d.pop("")
#not sure why, but when testing I got one key ""
You can use regex and Counter from collections :
import re
from collections import Counter
text = "This cat is not a cat, even if it looks like a cat"
# Extract words with regex, ignoring symbols and space
words = re.compile(r"\b\w+\b").findall(text.lower())
count = Counter(words)
# {'cat': 3, 'a': 2, 'this': 1, 'is': 1, 'not': 1, 'even': 1, 'if': 1, 'it': 1, 'looks': 1, 'like': 1}
# To get the most frequent
most_frequent = max(count, key=lambda k: count[k])
# 'cat'

Comparing user input list with dictionary and printing out corresponding value

Starting out by saying this is for school and I'm still learning so I'm not looking for a direct solution.
What I want to do is take an input from a user (one word or more).
I then make it in to a list.
I have my dictionary and the code that I'm posting is printing out the values correctly.
My question is how do I compare the characters in my list to the keys in the dictionary and then print only those values that correspond to the keys?
I have also read a ton of different questions regarding dictionaries but it was no help at all.
Example on output;
Word: wow
Output: 96669
user_word = input("Please enter a word: ")
user_listed = list(user_word)
def keypresses():
my_dict = {'.':1, ',':11, '?':111, '!':1111, ':':11111, 'a':2, 'b':22, 'c':222, 'd':3, 'e':33, 'f':333, 'g':4, 'h':44,
'i':444, 'j':5, 'k':55, 'l':555, 'm':6, 'n':66, 'o':666, 'p':7, 'q':77, 'r':777, 's':7777, 't':8, 'u':88,
'v':888, 'w':9, 'x':99, 'y':999, 'z':9999, ' ':0}
for key, value in my_dict.items():
print(value)
I am not going to hand you code for the project, but I will definitely send you in a right direction;
so, 2 parts to this in my view; match each character to a key/get a value, and combine the numbers for an output.
For the first part, you can iterate character-by-character by simply making a for loop;
for letter in 'string':
print(letter)
would output s t r i n g. So you can use this to find the value of the key(each letter)
Then, you can get the definition as a string(so as not to add each number mathematically) so something like;
letter = 'w'
value = my_dict[letter]
value_as_string = str(value)
then, combine this all into a for loop and add each string to each other to create the desired output.

Matching the value of a word in a list with the place value of another list

I am trying to work out how I can compare a list of words against a string and report back the word number from list one when they match. I can easily get the unique list of words from a sentence - just removing duplicates, and with enumerate I can get a value for each word, so Mary had a little lamb becomes 1, Mary, 2, had, 3, a etc. But I cannot work out how to then search the original list again and replace each word with its number value (so it becomes 1 2 3 etc).
Any ideas greatly received!
my_list.index(word)
will return the index of the item word within my_list. You can start digging into the documentation here
Thank you for this info. I can see the logic for this and it should work, however I get: line 27, in output=words.index(result) ValueError: ['word1', 'word2'] is not in list With the following code:
def remove_duplicates(words):
output = []
seen = set()
for value in words:
# If value has not been encountered yet,
# ... add it to both list and set.
if value not in seen:
output.append(value)
seen.add(value)
return output
# Remove duplicates from this list.
sentence = input("Enter a sentence ")
words = sentence.split(' ')
result = remove_duplicates(words)
print(result)
Very confusing :(
I have found an answer on here:
positions = [ i+1 for i in range(len(result)) if each == result[i]]
Which works well.

Generate sensible strings using a pattern

I have a table of strings (about 100,000) in following format:
pattern , string
e.g. -
*l*ph*nt , elephant
c*mp*t*r , computer
s*v* , save
s*nn] , sunny
]*rr] , worry
To simplify, assume a * denotes a vowel, a consonant stands unchanged and ] denotes either a 'y' or a 'w' (say, for instance, semi-vowels/round-vowels in phonology).
Given a pattern, what is the best way to generate the possible sensible strings? A sensible string is defined as a string having each of its consecutive two-letter substrings, that were not specified in the pattern, inside the data-set.
e.g. -
h*ll* --> hallo, hello, holla ...
'hallo' is sensible because 'ha', 'al', 'lo' can be seen in the data-set as with the words 'have', 'also', 'low'. The two letters 'll' is not considered because it was specified in the pattern.
What are the simple and efficient ways to do this?
Are there any libraries/frameworks for achieving this?
I've no specific language in mind but prefer to use java for this program.
This is particularly well suited to Python itertools, set and re operations:
import re
import itertools
VOWELS = 'aeiou'
SEMI_VOWELS = 'wy'
DATASET = '/usr/share/dict/words'
SENSIBLES = set()
def digraphs(word, digraph=r'..'):
'''
>>> digraphs('bar')
set(['ar', 'ba'])
'''
base = re.findall(digraph, word)
base.extend(re.findall(digraph, word[1:]))
return set(base)
def expand(pattern, wildcard, elements):
'''
>>> expand('h?', '?', 'aeiou')
['ha', 'he', 'hi', 'ho', 'hu']
'''
tokens = re.split(re.escape(wildcard), pattern)
results = set()
for perm in itertools.permutations(elements, len(tokens)):
results.add(''.join([l for p in zip(tokens, perm) for l in p][:-1]))
return sorted(results)
def enum(pattern):
not_sensible = digraphs(pattern, r'[^*\]]{2}')
for p in expand(pattern, '*', VOWELS):
for q in expand(p, ']', SEMI_VOWELS):
if (digraphs(q) - not_sensible).issubset(SENSIBLES):
print q
## Init the data-set (may be long...)
## you may want to pre-compute this
## and adapt it to your data-set.
for word in open(DATASET, 'r').readlines():
for digraph in digraphs(word.rstrip()):
SENSIBLES.add(digraph)
enum('*l*ph*nt')
enum('s*nn]')
enum('h*ll*')
As there aren't many possibilites for two-letter substrings, you can go through your dataset and generate a table that contains the count for every two-letter substring, so the table will look something like this:
ee 1024 times
su 567 times
...
xy 45 times
xz 0 times
The table will be small as you'll only have about 26*26 = 676 values to store.
You have to do this only once for your dataset (or update the table every time it changes if the dataset is dynamic) and can use the table for evaluating possible strings. F.e., for your example, add the values for 'ha', 'al' and 'lo' to get a "score" for the string 'hallo'. After that, choose the string(s) with the highest score(s).
Note that the scoring can be improved by checking longer substrings, f.e. three letters, but this will also result in larger tables.

Resources