How to find special characters from Python Data frame - python-3.x

I need to find special characters from entire dataframe.
In below data frame some columns contains special characters, how to find the which columns contains special characters?
Want to display text for each columns if it contains special characters.

You can setup an alphabet of valid characters, for example
import string
alphabet = string.ascii_letters+string.punctuation
Which is
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
And just use
df.col.str.strip(alphabet).astype(bool).any()
For example,
df = pd.DataFrame({'col1':['abc', 'hello?'], 'col2': ['ÃÉG', 'Ç']})
col1 col2
0 abc ÃÉG
1 hello? Ç
Then, with the above alphabet,
df.col1.str.strip(alphabet).astype(bool).any()
False
df.col2.str.strip(alphabet).astype(bool).any()
True
The statement special characters can be very tricky, because it depends on your interpretation. For example, you might or might not consider # to be a special character. Also, some languages (such as Portuguese) may have chars like ã and é but others (such as English) will not.

To remove unwanted characters from dataframe columns, use regex:
def strip_character(dataCol):
r = re.compile(r'[^a-zA-Z !##$%&*_+-=|\:";<>,./()[\]{}\']')
return r.sub('', dataCol)
df[resultCol] = df[dataCol].apply(strip_character)

# Whitespaces also could be considered in some cases.
import string
unwanted = string.ascii_letters + string.punctuation + string.whitespace
print(unwanted)
# This helped me extract '10' from '10+ years'.
df.col = df.col.str.strip(unwanted)

Related

How to extract the exact word from a string while reducing false positive discovery

I want to extract the exact word from a string. My code causes false discovery by considering the search item as a substring. Here is the code:
import re
text="Hello I am not react-dom"
item_search=['react', 'react-dom']
Found_item=[]
for i in range(0, len(item_search)):
Q=re.findall(r'\b%s\b'%item_search[i], text, flags=re.IGNORECASE | re.MULTILINE | re.UNICODE)
Found_item.append(Q)
print(Found_item)
The output is: [['react'], ['react-dom']]. So, In the result, I dont want to see the react as item.
The expected Output is: [[''], ['react-dom']]
\b is used to to indicate the boundary between types. eg between words and punctuations etc. so \b will be present between t from react and -. Thus here since we need whole words, we just use a lookbehind and lookahead to ensure that there no non-space between the two( THis is not the same as saying there is a space between the two). Thus you could use:
re.findall(rf"(?<!\S)({'|'.join(item_search)})(?!\S)", text)
['react-dom']
Edit:
If you were to include other non-word stuff like periods in the sentence, check #DYZ comment then you could use:
(?<!\S)({'|'.join(item_search)})\W*(?!\S)

Combining several replace-statements into one in pandas­

There is a DataFrame in pandas, see image below
Basically it is a table scraped from Wikipedia's article: https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9Fst%C3%A4dte_in_Deutschland#Tabelle
For further processing, I am trying to clean up the data. So, these statements work well
df['Name'] = df['Name'].str.replace('\d+', '')
df['Name'] = df['Name'].str.strip()
df['Name'] = df['Name'].str.replace(',', '')
df['Name'] = df['Name'].str.replace('­-', '')
But how can I bring all these four statements into one? Probably using regular expressions.
I tried with df['Name'] = df['Name'].str.replace(r'[\d\-,]+', '') but it did not work. Maybe because of the word wrap character that was used.
My desired output is " Ber,li-n2 "-> "Berlin".
The unknown circumstances are going around 'Mönchen­gladbach1, 5'.
You are removing the data, so you may join the patterns you remove into a single pattern like the one you have. r'[\d,-]+' is a bit better stylistically.
You may remove any dash punctuation + soft hyphen (\u00AD) using [\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D], so you may want to add these codes to the regex.
Remember to assign the cleaned data back to the column and add .str.stip().
You may use
df['Name'] = df['Name'].str.replace(r'[\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+', '').str.strip()
If you do not want to add str.strip(), add ^\s+ and \s+$ alternatives to the regex:
df['Name'] = df['Name'].str.replace(r'^\s+|[\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+|\s+$', '')
Details
^\s+ - 1+ whitespaces at the start of the string
| - or
[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+ - 1 or more Unicode dashes, digits, commas or - chars
| - or
\s+$ - 1+ whitespaces at the end of the string.
You can go with
df['Name'] = df['Name'].str.replace('(\d+|,|­<|>|-)', '')
Put the items you want to sort out into a group, and seperate different options using the pipe |

Python: How to Remove range of Characters \x91\x87\xf0\x9f\x91\x87 from File

I have this file with some lines that contain some unicode literals like:
"b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arrieta\xe2\x80\x99s no-hitter."
I want to remove those xe2\x80\x99 like characters.
I can remove them if I declare a string that contains these characters but my solutions don't work when reading from a CSV file. I used pandas to read the file.
SOLUTIONS TRIED
1.Regex
2.Decoding and Encoding
3.Lambda
Regex Solution
line = "b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arrieta\xe2\x80\x99s no-hitter."
code = (re.sub(r'[^\x00-\x7f]',r'', line))
print (code)
LAMBDA SOLUTION
stripped = lambda s: "".join(i for i in s if 31 < ord(i) < 127)
code2 = stripped(line)
print(code2)
ENCODING SOLUTION
code3 = (line.encode('ascii', 'ignore')).decode("utf-8")
print(code3)
HOW FILE WAS READ
df = pandas.read_csv('file.csv',encoding = "utf-8")
for index, row in df.iterrows():
print(stripped(row['text']))
print(re.sub(r'[^\x00-\x7f]',r'', row['text']))
print(row['text'].encode('ascii', 'ignore')).decode("utf-8"))
SUGGESTED METHOD
df = pandas.read_csv('file.csv',encoding = "utf-8")
for index, row in df.iterrows():
en = row['text'].encode()
print(type(en))
newline = en.decode('utf-8')
print(type(newline))
print(repr(newline))
print(newline.encode('ascii', 'ignore'))
print(newline.encode('ascii', 'replace'))
Your string is valid utf-8. Therefore it can be directly converted to a python string.
You can then encode it to ascii with str.encode(). It can ignore non-ascii characters with 'ignore'.
Also possible: 'replace'
line_raw = b'Who\xe2\x80\x99s he?'
line = line_raw.decode('utf-8')
print(repr(line))
print(line.encode('ascii', 'ignore'))
print(line.encode('ascii', 'replace'))
'Who’s he?'
b'Whos he?'
b'Who?s he?'
To come back to your original question, your 3rd method was correct. It was just in the wrong order.
code3 = line.decode("utf-8").encode('ascii', 'ignore')
print(code3)
To finally provide a working pandas example, here you go:
import pandas
df = pandas.read_csv('test.csv', encoding="utf-8")
for index, row in df.iterrows():
print(row['text'].encode('ascii', 'ignore'))
There is no need to do decode('utf-8'), because pandas does that for you.
Finally, if you have a python string that contains non-ascii characters, you can just strip them by doing
text = row['text'].encode('ascii', 'ignore').decode('ascii')
This converts the text to ascii bytes, strips all the characters that cannot be represented as ascii, and then converts back to text.
You should look up the difference between python3 strings and bytes, that should clear things up for you, I hope.

Adding a space character (" ") 2 character spaces after a period character (".")?

We have a legacy system that is exporting reports as .txt files, but in almost all instances when a date is supplied, it is after a currency denomination, and looks like this example:
25.0002/14/18 (25 bucks on feb 14th) or 287.4312/08/17.
Is there an easy way to parse for . and add a space character two spaces to the right to separate the string in Python? Any help is greatly appreciated!
The code below would add a space between the currency and the data given a string.
import re
my_file_text = "This is some text 287.4312/08/17"
new_text = re.sub("(\d+\.\d{2})(\d{2}/\d{2}/\d{2})", r"\1 \2", my_file_text)
print(new_text)
OUTPUT
'This is some text 287.43 12/08/17'
REGEX
(\d+\.\d{2}): This part of the regex captures the currency in it's own group, it assumes that it will have any number of digits (>1) before the . and then only two digits after, so something like (1000.25) would be captured correctly, while (1000.205) and (.25) won't.
(\d{2}/\d{2}/\d{2}): This part captures the date, it assumes that the day, month and year portion of the dates will always be represented using two digits each and separated by a /.
Perhaps more efficient methods, but an easy way could be:
def fix(string):
if '.' in string:
part_1, part_2 = string.split('.')
part_2_fixed = part_2[:2] + ' ' + part_2[2:]
string = part_1 + '.' + part_2_fixed
return string
In [1]: string = '25.0002/14/18'
In [2]: fix(string)
Out[2]: '25.00 02/14/18'

regex - Making all letters in a text lowercase using re.sub in python but exclude specific string?

I am writing a script to convert all uppercase letters in a text to lower case using regex, but excluding specific strings/characters such as "TEA", "CHI", "I", "#Begin", "#Language", "ENG", "#Participants", "#Media", "#Transcriber", "#Activities", "SBR", "#Comment" and so on.
The script I have is currently shown below. However, it does not provide the desired outputs. For instance when I input "#Activities: SBR", the output given is "#Activities#activities: sbr#activities: sbrSBR". The intended output is "#Activities": "SBR".
I am using Python 3.5.2
Can anyone help to provide some guidance? Thank you.
import os
from itertools import chain
import re
def lowercase_exclude_specific_string(line):
line = line.strip()
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
return filtered_line
First, let's see why you're getting the wrong output.
For instance when I input "#Activities: SBR", the output given is
"#Activities#activities: sbr#activities: sbrSBR".
This is because your code
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
is doing negated character class matching, meaning it will match all characters that are not in the list and replace them with line.lower() (which is "#activities: sbr"). You can see the matched characters in this regex demo.
The code will match ":" and " " (whitespace) and replace both with "#activities: sbr", giving you the result "#Activities#activities: sbr#activities: sbrSBR".
Now to fix that code. Unfortunately, there is no direct way to negate words in a line and apply substitution on the other words on that same line. Instead, you can split the line first into individual words, then apply re.sub on it using your PATTERN. Also, instead of a negated character class, you should use a negative lookahead:
(?!...)
Negative lookahead assertion. This is the opposite of the positive assertion; it succeeds if the contained expression doesn’t match at
the current position in the string.
Here's the code I got:
def lowercase_exclude_specific_string(line):
line = line.strip()
words = re.split("\s+", line)
result = []
for word in words:
PATTERN = r"^(?!TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment).*$"
lword = re.sub(PATTERN, word.lower(), word)
result.append(lword)
return " ".join(result)
The re.sub will only match words not in the PATTERN, and replace it with its lowercase value. If the word is part of the excluded pattern, it will be unmatched and re.sub returns it unchanged.
Each word is then stored in a list, then joined later to form the line back.
Samples:
print(lowercase_exclude_specific_string("#Activities: SBR"))
print(lowercase_exclude_specific_string("#Activities: SOME OTHER TEXT SBR"))
print(lowercase_exclude_specific_string("Begin ABCDEF #Media #Comment XXXX"))
print(lowercase_exclude_specific_string("#Begin AT THE BEGINNING."))
print(lowercase_exclude_specific_string("PLACE #Begin AT THE MIDDLE."))
print(lowercase_exclude_specific_string("I HOPe thIS heLPS."))
#Activities: SBR
#Activities: some other text SBR
begin abcdef #Media #Comment xxxx
#Begin at the beginning.
place #Begin at the middle.
I hope this helps.
EDIT:
As mentioned in the comments, apparently there is a tab in between : and the next character. Since the code splits the string using \s, the tab can't be preserved, but it can be restored by replacing : with :\t in the final result.
return " ".join(result).replace(":", ":\t")

Resources