Removing Narrow 'No-Break Space' Unicode Characters (U+00A0) in python nlp - python-3.x

Non-breaking spaces are printed as whitespace, but handled internally as \xa0. How do I remove all these characters at once?
So far I've replaced it directly:
text = text.replace('\u202f','')
text = text.replace('\u200d','')
text = text.replace('\xa0','')
But each time I scrape the text sentences from external source, These characters are different. How do I remove it all at once?

You can use regular expression substitution instead.
If you want to replace all whitespace, you can just use:
import re
text = re.sub(r'\s', '', text)
This includes all unicode whitespace, as described in the answer to this question.
From that answer, you can see that (at the time of writing), the unicode constants recognized as whitespace (e.g. \s) in Python regular expressions are these:
0x0009
0x000A
0x000B
0x000C
0x000D
0x001C
0x001D
0x001E
0x001F
0x0020
0x0085
0x00A0
0x1680
0x2000
0x2001
0x2002
0x2003
0x2004
0x2005
0x2006
0x2007
0x2008
0x2009
0x200A
0x2028
0x2029
0x202F
0x205F
0x3000
This looks as if this will fit your needs.

Related

Convert bytes to string while replacing SOH character

Using python 3.6. I have a bytes array that is coming over a socket (a FIX message) that contains hex character 1, start of header, as a delimiter. The raw bytes look like
b'8=FIX.4.4\x019=65\x0135=0\x0152=20220809-21:37:06.893\x0149=TRADEWEB\x0156=ABFIXREPO\x01347=UTF-8\x0110=045\x01'
I want to store this bytes array to a file for logging. I have seen FIX messages where this delimiter is converted to ^A control character. The final string I would like to have is -
8=FIX.4.4^A9=65^A35=0^A52=20220809-21:37:06.893^A49=TRADEWEB^A56=ABFIXREPO^A347=UTF-8^A10=045^A
I have tried various different ways to achieve this but could not, for example, tried repr(bytes) and (ord(b) in bytes).
Any pointers are highly appreciated.
Thanks.
One way I can think of doing this is by splitting the original decoded byte array , and insert control character using a loop.
data = b'8=FIX.4.4\x019=65\x0135=0\x0152=20220809-21:37:06.893\x0149=TRADEWEB\x0156=ABFIXREPO\x01347=UTF-8\x0110=045\x01'
data = data.decode().split("\x01")
print(data)
ctl_char = "^A"
string =""
for i in data:
string+=i
string+=ctl_char
print(string)
Please note there can be a better way of doing it.

Regex: Match x if not surrounded with y [duplicate]

I'm very new in programming and regex so apologise if this's been asked before (I didn't find one, though).
I want to use Python to summarise word frequencies in a literal text. Let's assume the text is formatted like
Chapter 1
blah blah blah
Chapter 2
blah blah blah
....
Now I read the text as a string, and I want to use re.findall to get every word in this text, so my code is
wordlist = re.findall(r'\b\w+\b', text)
But the problem is that it matches all these Chapters in each chapter title, which I don't want to include in my stats. So I want to ignore what matches Chapter\s*\d+. What should I do?
Thanks in advance, guys.
Solutions
You could remove all Chapter+space+digits first :
wordlist = re.findall(r'\b\w+\b', re.sub(r'Chapter\s*\d+\s*','',text))
If you want to use just one search , you can use a negative lookahead to find any word that isn't preceded by "Chapter X" and does not begin with a digit :
wordlist = re.findall(r'\b(?!Chapter\s+\d+)[A-Za-z]\w*\b',text)
If performance is an issue, loading a huge string and parsing it with a Regex wouldn't be the correct method anyway. Just read the file line by line, toss any line that matches r'^Chapter\s*\d+' and parse each remaining line separately with r'\b\w+\b' :
import re
lines=open("huge_file.txt", "r").readlines()
wordlist = []
chapter = re.compile(r'^Chapter\s*\d+')
words = re.compile(r'\b\w+\b')
for line in lines:
if not chapter.match(line):
wordlist.extend(words.findall(line))
print len(wordlist)
Performance
I wrote a small ruby script to write a huge file :
all_dicts = Dir["/usr/share/dict/*"].map{|dict|
File.readlines(dict)
}.flatten
File.open('huge_file.txt','w+') do |txt|
newline=true
txt.puts "Chapter #{rand(1000)}"
50_000_000.times do
if rand<0.05
txt.puts
txt.puts
txt.puts "Chapter #{rand(1000)}"
newline = true
end
txt.write " " unless newline
newline = false
txt.write all_dicts.sample.chomp
if rand<0.10
txt.puts
newline = true
end
end
end
The resulting file has more than 50 million words and is about 483MB big :
Chapter 154
schoolyard trashcan's holly's continuations
Chapter 814
assure sect's Trippe's bisexuality inexperience
Dumbledore's cafeteria's rubdown hamlet Xi'an guillotine tract concave afflicts amenity hurriedly whistled
Carranza
loudest cloudburst's
Chapter 142
spender's
vests
Ladoga
Chapter 896
petition's Vijayawada Lila faucets
addendum Monticello swiftness's plunder's outrage Lenny tractor figure astrakhan etiology's
coffeehouse erroneously Max platinum's catbird succumbed nonetheless Nissan Yankees solicitor turmeric's regenerate foulness firefight
spyglass
disembarkation athletics drumsticks Dewey's clematises tightness tepid kaleidoscope Sadducee Cheerios's
The two-step process took 12.2s to extract the wordlist on average, the lookahead method took 13.5s and Wiktor's answer also took 13.5s. The lookahead method I first wrote used re.IGNORECASE, and it took around 18s.
There's basically no difference in performance between all the Regexen methods when reading the whole file.
What surprised me though is that the readlines script took around 20.5s, and didn't use much less memory than the other scripts. If you have any idea how to improve the script, please comment!
Match what you do not need and capture what you need, and use this technique with re.findall that only returns captured values:
re.findall(r'\bChapter\s*\d+\b|\b(\w+)\b',s)
Details:
\bChapter\s*\d+\b - Chapter as a whole word followed with 0+ whitespaces and 1+ digits
| - or
\b(\w+)\b - match and capture into Group 1 one or more word chars
To avoid getting empty values in the resulting list, filter it (see demo):
import re
s = "Chapter 1: Black brown fox 45"
print(filter(None, re.findall(r'\bChapter\s*\d+\b|\b(\w+)\b',s)))

clean whitespaces using chatterbot preprocessors

I write "hi hello", after cleaning whitespace using chatterbot.preprocessors.clean_whitespace I want to show my input as "hihello", bt chatterbot replies me another answer. How can I print my input after preprocessing?
The clean_whitespace preprocessor in ChatterBot won't remove all white space, just preceding, trailing, and consecutive spaces. It cleans the white space, it doesn't remove it entirely.
It sounds like you want to create your own preprocessor. This can be as simple as creating a function like this:
def remove_whitespace(chatbot, statement):
for character in ['\n', '\r', '\t', ' ']:
statement.text = statement.text.replace(character, '')
return statement

How to find special characters from Python Data frame

I need to find special characters from entire dataframe.
In below data frame some columns contains special characters, how to find the which columns contains special characters?
Want to display text for each columns if it contains special characters.
You can setup an alphabet of valid characters, for example
import string
alphabet = string.ascii_letters+string.punctuation
Which is
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
And just use
df.col.str.strip(alphabet).astype(bool).any()
For example,
df = pd.DataFrame({'col1':['abc', 'hello?'], 'col2': ['ÃÉG', 'Ç']})
col1 col2
0 abc ÃÉG
1 hello? Ç
Then, with the above alphabet,
df.col1.str.strip(alphabet).astype(bool).any()
False
df.col2.str.strip(alphabet).astype(bool).any()
True
The statement special characters can be very tricky, because it depends on your interpretation. For example, you might or might not consider # to be a special character. Also, some languages (such as Portuguese) may have chars like ã and é but others (such as English) will not.
To remove unwanted characters from dataframe columns, use regex:
def strip_character(dataCol):
r = re.compile(r'[^a-zA-Z !##$%&*_+-=|\:";<>,./()[\]{}\']')
return r.sub('', dataCol)
df[resultCol] = df[dataCol].apply(strip_character)
# Whitespaces also could be considered in some cases.
import string
unwanted = string.ascii_letters + string.punctuation + string.whitespace
print(unwanted)
# This helped me extract '10' from '10+ years'.
df.col = df.col.str.strip(unwanted)

Decode an Unicode escaped character from PyQt5 QLabel widget?

I am trying to read in a text sequence from a QLineEdit that might contain Unicode escape sequence and print it out to a QLabel and display the proper character in PyQt5 and Python 3.4.
I tried many different things that I read here on stackoverflow but couldn't find a working solution for Python 3.
def on_pushButton_clicked(self):
text = self.lineEdit.text()
self.label.setText(text)
Now if I do something like this:
decodedText = str("dsfadsfa \u2662 \u8f1d \u2662").encode("utf-8")
self.label.setText(text.decode("utf-8")
This does print out the proper characters. If I apply the same to the above method I get the escaped sequences.
I don't get what is the difference between the str() returned by QLineEdit's text() and the str("\u2662"). Why does the one encode the characters properly and the other one doesn't?
The difference is that "\u2662" isn't a string with a Unicode escape, it's a string literal with a Unicode escape. A string with the same Unicode escape would be "\\u2662".
3>> codecs.getdecoder('unicode-escape')('\\u2662')
('♢', 6)

Resources