replacing unigrams and n-grams in python without changing words

replacing unigrams and n-grams in python without changing words - python-3.x

This seems like it should be straightforward, but it is not, I want to implement string replacement in python, the strings to be replaced can be unigrams or n-grams, but I do not want to replace a string contained within a word.
So for example:
x='hello world'
x.replace('llo','ll)
returns:
'hell world'
but I dont want that to happen.
Splitting the string on whitespace works for inidividual words (unigrams) but I also want to replace n-grams
so:
'this world is a happy place to be'
to be converted to:
'this world is a miserable cesspit to be'
and splitting on whitespace does not work.
Is there an in-built function in Python3 that allows me to do this?
I could do:
if len(new_string.split(' '))>1:
x.replace(old_string,new_string)
else:
x_array=x.split(' ')
x_array=[new_string if y==old_string else y for y in x_array]
x=' '.join(x_array)

you could do this:
import re
re_search = '(?P<pre>[^ ])llo(?P<post>[^ ])'
re_replace = '\g<pre>ll\g<post>'
print(re.sub(re_search, re_replace, 'hello world'))
print(re.sub(re_search, re_replace, 'helloworld'))
output:
hello world
hellworld
note how you need to add pre and post again.
now i see the comments... \b may work nicer.

Related

How to replace hyphen and newline in string in Python

I am working in a text with several syllables divisions.
A typical string is something like that
"this good pe-
riod has"
I tried:
my_string.replace('-'+"\r","")
However, it is not working.
I would like to get
"this good period has"

Have you tried this?
import re
text = """this good pe-
riod has"""
print(re.sub(r"-\s+", '', text))
# this good period has

After you match -, you should match the newline \n :
my_string = """this good pe-
riod has"""
print(my_string.replace("-\n",""))
# this good period has

It depends how your string ends, you could also use my_string.replace('-\r\n', '') or an optional carriage return using re.sub and -(?:\r?\n|\r)
If there has to be a word character before and after, instead of removing all the hyphens at the end of the line, you could use lookarounds:
(?<=\w)-\r?\n(?=\w)
Regex demo | Python demo
For example
import re
regex = r"(?<=\w)-\r?\n(?=\w)"
my_string = """this good pe-
riod has"""
print (re.sub(regex, "", my_string))
Output
this good period has

How to filter only text in a line?

I have many lines like these:
_ÙÓ´Immediate Transformation With Vee_ÙÓ´
‰ÛÏThe Real Pernell Stacks‰Û
I want to get something like this:
Immediate Transformation With Vee
The Real Pernell Stacks
I tried this:
for t in test:
t.isalpha()
but characters like this Ó count as well
So I also thought that I can create a list of English words, a space and punctuation marks and delete all the elements from the line that are not in this list, but I do not think that this is the right option, since the line can contain not only English words and that's fine.

Using Regex.
Ex:
import re
data = """_ÙÓ´Immediate Transformation With Vee_ÙÓ´
‰ÛÏThe Real Pernell Stacks‰Û"""
for line in data.splitlines(keepends=False):
print(re.sub(r"[^A-Za-z\s]", "", line))
Output:
Immediate Transformation With Vee
The Real Pernell Stacks

use re
result = ' '.join(re.split(r'[^A-Za-z]', s))

print(f"...:")-statement too long - break it into multiple lines without messing up the format

I have a console program with formatted output. to always get the same length of the printout, I have a rather complex formatted print output.
print(f"\n{WHITE_BG}{64*'-'}")
print(f"\nDirektvergleich{9*' '}{RED}{players[0].name}{4*' '}{GREEN}vs.{4*' '}{RED}{players[1].name}{CLEAR}\n")
print(f"""{15*'~'}{' '}{YELLOW}Gesamt{CLEAR}:{' '}{players[0].name}{' '}{GREEN}{int(player1_direct_wins)}{(int(4-len(player1_direct_wins)))*' '}-{(int(4-len(player1_direct_losses)))*' '}{int(player1_direct_losses)}{CLEAR}{' '}{players[1].name}{' '}{(28-len(players[0].name)-len(players[1].name))*'~'}\n""")
print(f"""{15*'~'}{' '}{YELLOW}Trend{CLEAR}:{' '}{players[0].name}{' '}{GREEN}{int(player1_trend_wins)}{(int(4-len(player1_trend_wins)))*' '}-{(int(4-len(player1_trend_losses)))*' '}{int(player1_trend_losses)}{CLEAR}{' '}{players[1].name}{' '}{(28-len(players[0].name)-len(players[1].name))*'~'}""")
print(f"\n{WHITE_BG}{64*'-'}")
This leads to the following output in my windows cmd
For readibility purpose, I tried to make the print over multiple lines, therefore I found on stackoverflow the idea to start with triple quotes. But when I cut this print(f"...") statement in the middle, I mess up my formatting.
Example:
print(f"\n{WHITE_BG}{64*'-'}") #als String einspeisen?!
print(f"\nDirektvergleich{9*' '}{RED}{players[0].name}{4*' '}{GREEN}vs.{4*' '}{RED}{players[1].name}{CLEAR}\n")
print(f"""{15*'~'}{' '}{YELLOW}Gesamt{CLEAR}:{' '}{players[0].name}{' '}{GREEN}{int(player1_direct_wins)}{(int(4-len(player1_direct_wins)))*' '}-
{(int(4-len(player1_direct_losses)))*' '}{int(player1_direct_losses)}{CLEAR}{' '}{players[1].name}{' '}{(28-len(players[0].name)-len(players[1].name))*'~'}\n""")
print(f"""{15*'~'}{' '}{YELLOW}Trend{CLEAR}:{' '}{players[0].name}{' '}{GREEN}{int(player1_trend_wins)}{(int(4-len(player1_trend_wins)))*' '}-
{(int(4-len(player1_trend_losses)))*' '}{int(player1_trend_losses)}{CLEAR}{' '}{players[1].name}{' '}{(28-len(players[0].name)-len(players[1].name))*'~'}""")
print(f"\n{WHITE_BG}{64*'-'}")
leads to...
can anyone point me in the right direction how to format my output in the displayed way, but without having this absurd long line length.
Thank you guys in advance!

Triple quoted strings preserve newline characters, so they are indeed not what you want here. Now when it finds two adjacent strings, the Python parser automagically concatenates them into a single string, i.e.:
s = "foo" "bar"
is equivalent to
s = "foobar"
And this works if you put your strings within parens:
s = ("foo" "bar")
in which case you can put each string on its own line as well:
s = (
"foo"
"bar"
)
This also applies to "fstrings" so what you want is something like:
print((
f"{15*'~'}{' '}{YELLOW}Gesamt{CLEAR}:{' '}{players[0].name}{' '}{GREEN} "
f"{int(player1_direct_wins)}{(int(4-len(player1_direct_wins)))*' '}-"
f"{(int(4-len(player1_direct_losses)))*' '}{int(player1_direct_losses)}"
f"{CLEAR}{' '}{players[1].name}{' '}{(28-len(players[0].name)-"
f"len(players[1].name))*'~'}\n"
))
That being said, I'd rather use intermediate variables than trying to cram such complex expressions in a fstring.

An Elegant Solution to Python's Multiline String?

I was trying to log a completion of a scheduled event I set to run on Django. I was trying my very best to make my code look presentable, So instead of putting the string into a single line, I have used a multiline string to output to the logger within a Command Management class method. The example as code shown:
# the usual imports...
# ....
import textwrap
logger = logging.getLogger(__name__)
class Command(BaseCommand):
def handle(self, *args, **kwargs):
# some codes here
# ....
final_statement = f'''\
this is the final statements \
with multiline string to have \
a neater code.'''
dedented_text = textwrap.dedent(final_statment)
logger.info(dedent.replace(' ',''))
I have tried a few methods I found, however, most quick and easy methods still left a big chunk of spaces on the terminal. As shown here:
this is the final statement with multiline string to have a neater code.
So I have come up with a creative solution to solve my problem. By using.
dedent.replace(' ','')
Making sure to replace two spaces with no space in order not to get rid of the normal spaces between words. Which finally produced:
this is the final statement with multiline string to have a neater code.
Is this an elegant solution or did I missed something on the internet?

You could use regex to simply remove all white space after a newline. Additionally, wrapping it into a function leads to less repetitive code, so let's do that.
import re
def single_line(string):
return re.sub("\n\s+", "", string)
final_statement = single_line(f'''
this is the final statements
with multiline string to have
a neater code.''')
print(final_statement)
Alternatively, if you wish to avoid this particular problem (and don't mine the developmental overhead), you could store them inside a file, like JSON so you can quickly edit prompts while keeping your code clean.

Thanks to Neil's suggestion, I have come out with a more elegant solution. By creating a function to replace the two spaces with none.
def single_line(string):
return string.replace(' ','')
final_statement = '''\
this is a much neater
final statement
to present my code
'''
print(single_line(final_statement)
As improvised from Neil's solution, I have cut down the regex import. That's one line less of code!
Also, making it a function improves on readability as the whole print statement just read like English. "Print single line final statement"
Any better idea?

The issue with both Neil’s and Wong Siwei’s answers is they don’t work if your multiline string contains lines more indented than others:
my_string = """\
this is my
string and
it has various
identation
levels"""
What you want in the case above is to remove the two-spaces indentation, not every space at the beginning of a line.
The solution below should work in all cases:
import re
def dedent(s):
indent_level = None
for m in re.finditer(r"^ +", s):
line_indent_level = len(m.group())
if indent_level is None or indent_level > line_indent_level:
indent_level = line_indent_level
if not indent_level:
return s
return re.sub(r"(?:^|\n) {%s}" % indent_level, "", s)
It first scans the whole string to find the lowest indentation level then uses that information to dedent all lines of it.
If you only care about making your code easier to read, you may instead use C-like strings "concatenation":
my_string = (
"this is my string"
" and I write it on"
" multiple lines"
)
print(repr(my_string))
# => "this is my string and I write it on multiple lines"
You may also want to make it explicit with +s:
my_string = "this is my string" + \
" and I write it on" + \
" multiple lines"

How to decode a text file by extracting alphabet characters and listing them into a message?

So we were given an assignment to create a code that would sort through a long message filled with special characters (ie. [,{,%,$,*) with only a few alphabet characters throughout the entire thing to make a special message.
I've been searching on this site for a while and haven't found anything specific enough that would work.
I put the text file into a pastebin if you want to see it
https://pastebin.com/48BTWB3B
Anywho, this is what I've come up with for code so far
code = open('code.txt', 'r')
lettersList = code.readlines()
lettersList.sort()
for letters in lettersList:
print(letters)
It prints the code.txt out but into short lists, essentially cutting it into smaller pieces. I want it to find and sort out the alphabet characters into a list and print the decoded message.

This is something you can do pretty easily with regex.
import re
with open('code.txt', 'r') as filehandle:
contents = filehandle.read()
letters = re.findall("[a-zA-Z]+", contents)
if you want to condense the list into a single string, you can use a join:
single_str = ''.join(letters)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

replacing unigrams and n-grams in python without changing words - python-3.x

Related

How to replace hyphen and newline in string in Python

How to filter only text in a line?

print(f"...:")-statement too long - break it into multiple lines without messing up the format

An Elegant Solution to Python's Multiline String?

How to decode a text file by extracting alphabet characters and listing them into a message?

Categories

Resources