clean whitespaces using chatterbot preprocessors - python-3.x

I write "hi hello", after cleaning whitespace using chatterbot.preprocessors.clean_whitespace I want to show my input as "hihello", bt chatterbot replies me another answer. How can I print my input after preprocessing?

The clean_whitespace preprocessor in ChatterBot won't remove all white space, just preceding, trailing, and consecutive spaces. It cleans the white space, it doesn't remove it entirely.
It sounds like you want to create your own preprocessor. This can be as simple as creating a function like this:
def remove_whitespace(chatbot, statement):
for character in ['\n', '\r', '\t', ' ']:
statement.text = statement.text.replace(character, '')
return statement

Related

How to filter only text in a line?

I have many lines like these:
_ÙÓ´Immediate Transformation With Vee_ÙÓ´
‰ÛÏThe Real Pernell Stacks‰Û
I want to get something like this:
Immediate Transformation With Vee
The Real Pernell Stacks
I tried this:
for t in test:
t.isalpha()
but characters like this Ó count as well
So I also thought that I can create a list of English words, a space and punctuation marks and delete all the elements from the line that are not in this list, but I do not think that this is the right option, since the line can contain not only English words and that's fine.
Using Regex.
Ex:
import re
data = """_ÙÓ´Immediate Transformation With Vee_ÙÓ´
‰ÛÏThe Real Pernell Stacks‰Û"""
for line in data.splitlines(keepends=False):
print(re.sub(r"[^A-Za-z\s]", "", line))
Output:
Immediate Transformation With Vee
The Real Pernell Stacks
use re
result = ' '.join(re.split(r'[^A-Za-z]', s))

How to split a csv line that has commas for formatting numbers

I download a cvs file ussing request and when I need to split but it has some formatting commas in the numbers fields, like:
line='2019-07-05,sitename.com,"14,740","14,559","7,792",$11.47'
when I try to splits:
data = line.split(',')
it got this value:
['2019-07-05', 'nacion.com', '"14', '740"', '"14', '559"',
'"7','792"', '$11.47']
I would need:
['2019-07-05', 'nacion.com', '14740', '14559', '7792', '$11.47']
I need to solve it in python 3.7
any help is welcome
I usually don't like using regex but there may be no other option here. Try this - it basically removes the inside ,s in two steps:
import re
line='2019-07-05,sitename.com,"14,740","14,559","7,792",$11.47'
new_line = re.sub(r',(?!\d)', r"xxx", line).replace(',','').replace('xxx',',')
print(new_line)
Output
2019-07-05,sitename.com,"14740","14559","7792",$11.47
You can now use:
data = new_line.split(',')
Explanation:
The regex ,(?!\d) selects all ,s in line that are not between two digits. The .sub replaces those (temporarily) with xxxs. The next .replace deletes the remaining ,s which are inside numbers by replacing them with nothing and, finally, the last .replace restores the , delimiters by replacing the temporary xxxs with ,.

regex - Making all letters in a text lowercase using re.sub in python but exclude specific string?

I am writing a script to convert all uppercase letters in a text to lower case using regex, but excluding specific strings/characters such as "TEA", "CHI", "I", "#Begin", "#Language", "ENG", "#Participants", "#Media", "#Transcriber", "#Activities", "SBR", "#Comment" and so on.
The script I have is currently shown below. However, it does not provide the desired outputs. For instance when I input "#Activities: SBR", the output given is "#Activities#activities: sbr#activities: sbrSBR". The intended output is "#Activities": "SBR".
I am using Python 3.5.2
Can anyone help to provide some guidance? Thank you.
import os
from itertools import chain
import re
def lowercase_exclude_specific_string(line):
line = line.strip()
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
return filtered_line
First, let's see why you're getting the wrong output.
For instance when I input "#Activities: SBR", the output given is
"#Activities#activities: sbr#activities: sbrSBR".
This is because your code
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
is doing negated character class matching, meaning it will match all characters that are not in the list and replace them with line.lower() (which is "#activities: sbr"). You can see the matched characters in this regex demo.
The code will match ":" and " " (whitespace) and replace both with "#activities: sbr", giving you the result "#Activities#activities: sbr#activities: sbrSBR".
Now to fix that code. Unfortunately, there is no direct way to negate words in a line and apply substitution on the other words on that same line. Instead, you can split the line first into individual words, then apply re.sub on it using your PATTERN. Also, instead of a negated character class, you should use a negative lookahead:
(?!...)
Negative lookahead assertion. This is the opposite of the positive assertion; it succeeds if the contained expression doesn’t match at
the current position in the string.
Here's the code I got:
def lowercase_exclude_specific_string(line):
line = line.strip()
words = re.split("\s+", line)
result = []
for word in words:
PATTERN = r"^(?!TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment).*$"
lword = re.sub(PATTERN, word.lower(), word)
result.append(lword)
return " ".join(result)
The re.sub will only match words not in the PATTERN, and replace it with its lowercase value. If the word is part of the excluded pattern, it will be unmatched and re.sub returns it unchanged.
Each word is then stored in a list, then joined later to form the line back.
Samples:
print(lowercase_exclude_specific_string("#Activities: SBR"))
print(lowercase_exclude_specific_string("#Activities: SOME OTHER TEXT SBR"))
print(lowercase_exclude_specific_string("Begin ABCDEF #Media #Comment XXXX"))
print(lowercase_exclude_specific_string("#Begin AT THE BEGINNING."))
print(lowercase_exclude_specific_string("PLACE #Begin AT THE MIDDLE."))
print(lowercase_exclude_specific_string("I HOPe thIS heLPS."))
#Activities: SBR
#Activities: some other text SBR
begin abcdef #Media #Comment xxxx
#Begin at the beginning.
place #Begin at the middle.
I hope this helps.
EDIT:
As mentioned in the comments, apparently there is a tab in between : and the next character. Since the code splits the string using \s, the tab can't be preserved, but it can be restored by replacing : with :\t in the final result.
return " ".join(result).replace(":", ":\t")

Removing whitespace after splitting a string in python

1)Prompt the user for a string that contains two strings separated by a comma.
2)Report an error if the input string does not contain a comma. Continue to prompt until a valid string is entered. Note: If the input contains a comma, then assume that the input also contains two strings.
3)Using string splitting, extract the two words from the input string and then remove any spaces. Output the two words.
4)Using a loop, extend the program to handle multiple lines of input. Continue until the user enters q to quit.
I wrote a program with these instructions although I cannot work out how to remove extra spaces that may be attached to the outputted words. For example if you enter "Billy, Bob" it works fine, but if you enter "Billy,Bob" you will get an IndexError: list index out of range, or if you enter "Billy , Bob" Billy will be outputted with an extra space attached to the string. Here is my code.
usrIn=0
while usrIn!='q':
usrIn = input("Enter input string: \n")
if "," in usrIn:
tokens = usrIn.split(", ")
print("First word:",tokens[0])
print("Second word:",tokens[1])
print('')
print('')
else:
print("Error: No comma in string.")
How do I remove spaces from the outputs so I can just use a usrIn.split(",") ?
You can use .trim() method which will remove leading and trailing white-spaces.
usrIn.trim().split(","). After this is done you can split them again using the white-space regex, for instance, usrIn.split("\\s+")
The \s will look for white space while the + operator will look for repeated white-spaces.
Hope this helps :)

replacing unigrams and n-grams in python without changing words

This seems like it should be straightforward, but it is not, I want to implement string replacement in python, the strings to be replaced can be unigrams or n-grams, but I do not want to replace a string contained within a word.
So for example:
x='hello world'
x.replace('llo','ll)
returns:
'hell world'
but I dont want that to happen.
Splitting the string on whitespace works for inidividual words (unigrams) but I also want to replace n-grams
so:
'this world is a happy place to be'
to be converted to:
'this world is a miserable cesspit to be'
and splitting on whitespace does not work.
Is there an in-built function in Python3 that allows me to do this?
I could do:
if len(new_string.split(' '))>1:
x.replace(old_string,new_string)
else:
x_array=x.split(' ')
x_array=[new_string if y==old_string else y for y in x_array]
x=' '.join(x_array)
you could do this:
import re
re_search = '(?P<pre>[^ ])llo(?P<post>[^ ])'
re_replace = '\g<pre>ll\g<post>'
print(re.sub(re_search, re_replace, 'hello world'))
print(re.sub(re_search, re_replace, 'helloworld'))
output:
hello world
hellworld
note how you need to add pre and post again.
now i see the comments... \b may work nicer.

Resources