regex_replace on string for string match and not substring match - apache-spark

This:
words = words.withColumn('value_2', F.regexp_replace('value', '|'.join(stopWords), ''))
works fine for substrings.
However, I have a stop word 'a' and as a result 'was' becomes 'ws'. I only want to see it on 'A' or 'a', and leave was as is.

Place word boundaries around the alternation:
words = words.withColumn('value_2', F.regexp_replace('value', '\\b(' + '|'.join(stopWords) + ')\\b', ''))

Related

Variable length lookahead/behind regex when you don't know exactly what you're matching (python)

I have some transcriptions that unfortunately contain lots of occurrences of words separated by a period but no space (ie word.word).
Is there a way to use regex to separate these, but leave other words like decimals and abbreviations such as U.K. or U.S.A alone? I'm planning to tokenize the text, and so i want the word.word occurrences to be counted as separate words, but I don't want to mess up abbreviations/decimals/any other places where the period is part of the word. Since I would want to replace these specific word.word periods with a space but leave all others alone (or at least not replace them with a space because then it would break up the abbreviation), my first thought was something like this:
text = re.sub("(?<!\d){2,}\.(?!\d){2,}", " ", text)
look for periods that are surrounded by at least two or more not-digits, and then just replace the period with a space. But it seems that variable length lookbehind/lookahead isn't really a thing you can do. I've tested this out in some regex testers and it still matches the letter abbreviations above, although it does not match decimals.
Is there another way to write what I've thought about or another way to approach this? I've gotten somewhat mentally stuck in this solution and I can't find another way that will do close to what I'm looking to do - can it even be done?
Thank you!
Ok, so :D
i have written this code, which i have given it the string "i.would.like.to.visit.the.U.S.A.or.the.u.k.while.i.am.eating.a.banana.b" (the b is there for a purpose, to make sure it doesn't delete one letters for no reason), and the output was:
['i', 'would', 'like', 'to', 'visit', 'the', 'USA', 'or', 'the', 'uk', 'while', 'i', 'am', 'eating', 'a', 'banana', 'b'].
The code is:
text = "i.would.like.to.visit.the.U.S.A.or.the.u.k.while.i.am.eating.a.banana.b"
def split(string: str):
string = string.split(".")
length = len(string) - 1
obj = enumerate(string)
together = []
for index, word in obj:
sub = []
if index and len(word) == 1 and index < length:
idx = index
while len(string[idx]) == 1:
sub.append((string[idx], idx))
idx += 1
next(obj)
together.append(sub)
if together:
deleted = 0
for sub in together:
if len(sub) > 1:
string[sub[0][1] - deleted:sub[-1][1] + 1 - deleted] = ["".join(x[0] for x in sub)]
deleted += len(sub) - 1
return string
print(split(text))
You can edit the section "".join(x[0] for x in sub) to ".".join(x[0] for x in sub) in order to keep the dots, (U.S.A instead of USA)
If you are just trying to add space if both sides are two or more characters the following is what you are looking for.
text = re.sub(r"([^\d.]{2})\.([^\d.]{2})", r"\1. \2", text)
Example:
"This sentence ends.The following is an abbreviation A.B.C." becomes
"This sentence ends. The following is an abbreviation A.B.C."

How can create a new string from an original string replacing all non-instances of a character

So Let's say I have a random string "Mississippi"
I want to create a new string from "Mississippi" but replacing all the non-instances of a particular character.
For example if we use the letter "S". In the new string, I want to keep all the S's in "MISSISSIPPI" and replace all the other letters with a "_".
I know how to do the reverse:
word = "MISSISSIPPI"
word2 = word.replace("S", "_")
print(word2)
word2 gives me MI__I__IPPI
but I can't figure out how to get word2 to be __SS_SS____
(The classic Hangman Game)
You would need to use the sub method of Python strings with a regular expression for symbolizing a NOT character set such as
import re
line = re.sub(r"[^S]", "_", line)
This replaces any non S character with the desired character.
You could do this with str.maketrans() and str.translate() but it would be easier with regular expressions. The trick is you need to insert your string of valid characters into the regular expression programattically:
import re
word = "MISSISSIPPI"
show = 'S' # augment as the game progresses
print(re.sub(r"[^{}]".format(show), "_", word))
A simpler way is to map a function across the string:
>>> ''.join(map(lambda w: '_' if w != 'S' else 'S', 'MISSISSIPPI'))
'__SS_SS____'

Replace number into character from string using python

I have a string like this
convert_text = "tet1+tet2+tet34+tet12+tet3"
I want to replace digits into character from above string.That mapping list available separately.so,When am trying to replace digit 1 with character 'g' using replace like below
import re
convert_text = convert_text.replace('1','g')
print(convert_text)
output is
"tetg+tet2+tet34+tetg2+tet3"
How to differentiate single digit and two digit values.Is there is any way to do with Regexp or something else?
You can use a regular expression with a callable replacement argument to substitute consecutive runs of digits with a value in a lookup table, eg:
import re
# Input text
convert_text = "tet1+tet2+tet34+tet12+tet3"
# to->from of digits to string
replacements = {'1': 'A', '2': 'B', '3': 'C', '12': 'T', '34': 'X'}
# Do actual replacement of digits to string
converted_text = re.sub('(\d+)', lambda m: replacements[m.group()], convert_text)
Which gives you:
'tetA+tetB+tetX+tetT+tetC'
import re
convert_text = "tet1+tet2+tet34+tet12+tet3"
pattern = re.compile(r'((?<!\d)\d(?!\d))')
convert_text2=pattern.sub('g',convert_text)
convert_text2
Out[2]: 'tetg+tetg+tet34+tet12+tetg'
You have to use negative lookahead and negative lookbehind patterns which are in between parenthesis
(?!pat) and
(?<!pat),
you have the same with = instead of ! for positive lookahead/lookbehind.
EDIT: if you need replacement of strings of digits, regex is
pattern2 = re.compile(r'\d+')
In any pattern you can replace \d by a specific digit you need.

Is there a way to substring, which is between two words in the string in Python?

My question is more or less similar to:
Is there a way to substring a string in Python?
but it's more specifically oriented.
How can I get a par of a string which is located between two known words in the initial string.
Example:
mySrting = "this is the initial string"
Substring = "initial"
knowing that "the" and "string" are the two known words in the string that can be used to get the substring.
Thank you!
You can start with simple string manipulation here. str.index is your best friend there, as it will tell you the position of a substring within a string; and you can also start searching somewhere later in the string:
>>> myString = "this is the initial string"
>>> myString.index('the')
8
>>> myString.index('string', 8)
20
Looking at the slice [8:20], we already get close to what we want:
>>> myString[8:20]
'the initial '
Of course, since we found the beginning position of 'the', we need to account for its length. And finally, we might want to strip whitespace:
>>> myString[8 + 3:20]
' initial '
>>> myString[8 + 3:20].strip()
'initial'
Combined, you would do this:
startIndex = myString.index('the')
substring = myString[startIndex + 3 : myString.index('string', startIndex)].strip()
If you want to look for matches multiple times, then you just need to repeat doing this while looking only at the rest of the string. Since str.index will only ever find the first match, you can use this to scan the string very efficiently:
searchString = 'this is the initial string but I added the relevant string pair a few more times into the search string.'
startWord = 'the'
endWord = 'string'
results = []
index = 0
while True:
try:
startIndex = searchString.index(startWord, index)
endIndex = searchString.index(endWord, startIndex)
results.append(searchString[startIndex + len(startWord):endIndex].strip())
# move the index to the end
index = endIndex + len(endWord)
except ValueError:
# str.index raises a ValueError if there is no match; in that
# case we know that we’re done looking at the string, so we can
# break out of the loop
break
print(results)
# ['initial', 'relevant', 'search']
You can also try something like this:
mystring = "this is the initial string"
mystring = mystring.strip().split(" ")
for i in range(1,len(mystring)-1):
if(mystring[i-1] == "the" and mystring[i+1] == "string"):
print(mystring[i])
I suggest using a combination of list, split and join methods.
This should help if you are looking for more than 1 word in the substring.
Turn the string into array:
words = list(string.split())
Get the index of your opening and closing markers then return the substring:
open = words.index('the')
close = words.index('string')
substring = ''.join(words[open+1:close])
You may want to improve a bit with the checking for the validity before proceeding.
If your problem gets more complex, i.e multiple occurrences of the pair values, I suggest using regular expression.
import re
substring = ''.join(re.findall(r'the (.+?) string', string))
The re should store substrings separately if you view them in list.
I am using the spaces between the description to rule out the spaces between words, you can modify to your needs as well.

Removing a string that startswith a specific char Python

text='I miss Wonderland #feeling sad #omg'
prefix=('#','#')
for line in text:
if line.startswith(prefix):
text=text.replace(line,'')
print(text)
The output should be:
'I miss Wonderland'
But my output is the original string with the prefix removed
So it seems that you do not in fact want to remove the whole "string" or "line", but rather the word? Then you'll want to split your string into words:
words = test.split(' ')
And now iterate through each element in words, performing your check on the first letter. Lastly, combine these elements back into one string:
result = ""
for word in words:
if !word.startswith(prefix):
result += (word + " ")
for line in text in your case will iterate over each character in the text, not each word. So when it gets to e.g., '#' in '#feeling', it will remove the #, but 'feeling' will remain because none of the other characters in that string start with/are '#' or '#'. You can confirm that your code is going character by character by doing:
for line in text:
print(line)
Try the following instead, which does the filtering in a single line:
text = 'I miss Wonderland #feeling sad #omg'
prefix = ('#','#')
words = text.split() # Split the text into a list of its individual words.
# Join only those words that don't start with prefix
print(' '.join([word for word in words if not word.startswith(prefix)]))

Resources