Searching for strings in a 'dictionary' file with multiple wildcard values - string

I am trying to create a function which will take 2 parameters. A word with wildcards in it like "*arn*val" and a file name containing a dictionary. It returns a list of all words that match the word like ["carnival"].
My code works fine for anything with only one "*" in it, however any more and I'm stumped as to how to do it.
Just searching for the wildcard string in the file was returning nothing.
Here is my code:
dictionary_file = open(dictionary_filename, 'r')
dictionary = dictionary_file.read()
dictionary_file.close()
dictionary = dictionary.split()
alphabet = ["a","b","c","d","e","f","g","h","i",
"j","k","l","m","n","o","p","q","r",
"s","t","u","v","w","x","y","z"]
new_list = []
for letter in alphabet:
if wildcard.replace("*", letter) in dictionary:
new_list += [wildcard.replace("*", letter)]
return new_list
The parameters parameters: First is the wildcard string (wildcard), and second is the dictionary file name (dictionary_filename).
Most answers on this site were about Regex, which I have no knowledge of.

Your particular error is that .replace replaces all occurrences e.g., "*arn*val" -> "CarnCval" or "IarnIval". You want different letters here. You could use the second nested loop over the alphabet (or use itertools.product() to generate all possible letter pairs) to fix it but a simpler way is to use regular expressions:
import re
# each `*` corresponds to an ascii lowercase letter
pattern = re.escape(wildcard).replace("\\*", "[a-z]")
matches = list(filter(re.compile(pattern+"$").match, known_words))
Note: it doesn't support escaping * in the wildcard.
If input wildcards are file patterns then you could use fnmatch module to filter words:
import fnmatch
matches = fnmatch.filter(known_words, wildcard)

Related

How can I find all the strings that contains "/1" and remove from a file using Python?

I have this file that contains these kinds of strings "1405079/1" the only common in them is the "/1" at the end. I want to be able to find those strings and remove them, below is sample code
but it's not doing anything.
with open("jobstat.txt","r") as jobstat:
with open("runjob_output.txt", "w") as runjob_output:
for line in jobstat:
string_to_replace = ' */1'
line = line.replace(string_to_replace, " ")
with open("jobstat.txt","r") as jobstat:
with open("runjob_output.txt", "w") as runjob_output:
for line in jobstat:
string_to_replace ='/1'
line =line.rstrip(string_to_replace)
print(line)
Anytime you have a "pattern" you want to match against, use a regular expression. The pattern here, given the information you've provided, is a string with an arbitrary number of digits followed by /1.
You can use re.sub to match against that pattern, and replace instances of it with another string.
import re
original_string= "some random text with 123456/1, and midd42142/1le of words"
pattern = r"\d*\/1"
replacement = ""
re.sub(pattern, replacement, original_string)
Output:
'some random text with , and middle of words'
Replacing instances of the pattern with something else:
>>> re.sub(pattern, "foo", original_string)
'some random text with foo, and middfoole of words'

Get number from string in Python

I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').

How to get demangled function name using regex

I have list of demangled-function names like _Z6__comp7StudentS_
_Z4SortiSt6vectorI7StudentSaIS0_EE. I read wiki and found out that it follows some sort of defined structure. _Z is mangled Symbol followed by a number and then the function name of that length.
So I wanted to retrieve that function name using regex. I only come close to _Z(?:\d)(?<function_name>[a-z_A-Z]){\1}. But referring \1 won't work because its string, right? Is there a single regex pattern solution to this.
You can use 2 capture groups, and get the part of the string using the position of capture group 2
import re
pattern = r"_Z(\d+)([a-z_A-Z]+)"
s = "_Z4SortiSt6vectorI7StudentSaIS0_EE"
m = re.search(pattern, s)
if m:
print(m.group(2)[0: int(m.group(1))])
Output
Sort
Using _Z6__comp7StudentS_ will return __comp

How to remove/delete characters from end of string that match another end of string

I have thousands of strings (not in English) that are in this format:
['MyWordMyWordSuffix', 'SameVocabularyItemMyWordSuffix']
I want to return the following:
['MyWordMyWordSuffix', 'SameVocabularyItem']
Because strings are immutable and I want to start the matching from the end I keep confusing myself on how to approach it.
My best guess is some kind of loop that starts from the end of the strings and keeps checking for a match.
However, since I have so many of these to process it seems like there should be a built in way faster than looping through all the characters, but as I'm still learning Python I don't know of one (yet).
The nearest example I could find already on SO can be found here but it isn't really what I'm looking for.
Thank you for helping me!
You can use commonprefix from os.path to find the common suffix between them:
from os.path import commonprefix
def getCommonSuffix(words):
# get common suffix by reversing both words and finding the common prefix
prefix = commonprefix([word[::-1] for word in words])
return prefix[::-1]
which you can then use to slice out the suffix from the second string of the list:
word_list = ['MyWordMyWordSuffix', 'SameVocabularyItemMyWordSuffix']
suffix = getCommonSuffix(word_list)
if suffix:
print("Found common suffix:", suffix)
# filter out suffix from second word in the list
word_list[1] = word_list[1][0:-len(suffix)]
print("Filtered word list:", word_list)
else:
print("No common suffix found")
Output:
Found common suffix: MyWordSuffix
Filtered word list: ['MyWordMyWordSuffix', 'SameVocabularyItem']
Demo: https://repl.it/#glhr/55705902-common-suffix

regex - Making all letters in a text lowercase using re.sub in python but exclude specific string?

I am writing a script to convert all uppercase letters in a text to lower case using regex, but excluding specific strings/characters such as "TEA", "CHI", "I", "#Begin", "#Language", "ENG", "#Participants", "#Media", "#Transcriber", "#Activities", "SBR", "#Comment" and so on.
The script I have is currently shown below. However, it does not provide the desired outputs. For instance when I input "#Activities: SBR", the output given is "#Activities#activities: sbr#activities: sbrSBR". The intended output is "#Activities": "SBR".
I am using Python 3.5.2
Can anyone help to provide some guidance? Thank you.
import os
from itertools import chain
import re
def lowercase_exclude_specific_string(line):
line = line.strip()
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
return filtered_line
First, let's see why you're getting the wrong output.
For instance when I input "#Activities: SBR", the output given is
"#Activities#activities: sbr#activities: sbrSBR".
This is because your code
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
is doing negated character class matching, meaning it will match all characters that are not in the list and replace them with line.lower() (which is "#activities: sbr"). You can see the matched characters in this regex demo.
The code will match ":" and " " (whitespace) and replace both with "#activities: sbr", giving you the result "#Activities#activities: sbr#activities: sbrSBR".
Now to fix that code. Unfortunately, there is no direct way to negate words in a line and apply substitution on the other words on that same line. Instead, you can split the line first into individual words, then apply re.sub on it using your PATTERN. Also, instead of a negated character class, you should use a negative lookahead:
(?!...)
Negative lookahead assertion. This is the opposite of the positive assertion; it succeeds if the contained expression doesn’t match at
the current position in the string.
Here's the code I got:
def lowercase_exclude_specific_string(line):
line = line.strip()
words = re.split("\s+", line)
result = []
for word in words:
PATTERN = r"^(?!TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment).*$"
lword = re.sub(PATTERN, word.lower(), word)
result.append(lword)
return " ".join(result)
The re.sub will only match words not in the PATTERN, and replace it with its lowercase value. If the word is part of the excluded pattern, it will be unmatched and re.sub returns it unchanged.
Each word is then stored in a list, then joined later to form the line back.
Samples:
print(lowercase_exclude_specific_string("#Activities: SBR"))
print(lowercase_exclude_specific_string("#Activities: SOME OTHER TEXT SBR"))
print(lowercase_exclude_specific_string("Begin ABCDEF #Media #Comment XXXX"))
print(lowercase_exclude_specific_string("#Begin AT THE BEGINNING."))
print(lowercase_exclude_specific_string("PLACE #Begin AT THE MIDDLE."))
print(lowercase_exclude_specific_string("I HOPe thIS heLPS."))
#Activities: SBR
#Activities: some other text SBR
begin abcdef #Media #Comment xxxx
#Begin at the beginning.
place #Begin at the middle.
I hope this helps.
EDIT:
As mentioned in the comments, apparently there is a tab in between : and the next character. Since the code splits the string using \s, the tab can't be preserved, but it can be restored by replacing : with :\t in the final result.
return " ".join(result).replace(":", ":\t")

Resources