Remove non meaningful characters in pandas dataframe - python-3.x

I am trying to remove all
\xf0\x9f\x93\xa2, \xf0\x9f\x95\x91\n\, \xe2\x80\xa6,\xe2\x80\x99t
type characters from the below strings in Python pandas column. Although the text starts with b' , it's a string
Text
_____________________________________________________
"b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6
"b'I doubt if climate emergency 8s real, I think people will look ba\xe2\x80\xa6 '
"b'No, thankfully it doesn\xe2\x80\x99t. Can\xe2\x80\x99t see how cheap to overtourism in the alan alps can h\xe2\x80\xa6"
"b'Climate Change Poses a WidelllThreat to National Security "
"b""This doesn't feel like targeted propaganda at all. I mean states\xe2\x80\xa6"
"b'berates climate change activist who confronted her in airport\xc2\xa0
The above content is in pandas dataframe as a column..
I am trying
string.encode('ascii', errors= 'ignore')
and regex but without luck. It will be helpful if I can get some suggestions.

Your string looks like byte string but not so encode/decode doesn't work. Try something like this:
>>> df['text'].str.replace(r'\\x[0-9a-f]{2}', '', regex=True)
0 b'Hello! End Climate Silence is looking for v...
1 b'I doubt if climate emergency 8s real, I thin...
2 b'No, thankfully it doesnt. Cant see how cheap...
3 b'Climate Change Poses a WidelllThreat to Nati...
4 b""This doesn't feel like targeted propaganda ...
5 b'berates climate change activist who confront...
Name: text, dtype: object
Note you have to clean your unbalanced single/double quotes and remove the first 'b' character.

You could go through your strings and keep only ascii characters:
my_str = "b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6"
new_str = "".join(c for c in my_str if c.isascii())
print(new_str)
Note that .encode('ascii', errors= 'ignore') doesn't change the string it's applied to but returns the encoded string. This should work:
new_str = my_str.encode('ascii',errors='ignore')
print(new_str)

Related

Python: Using Pandas and Regex to Clean Phone Numbers with Country Code

I'm attempting to use pandas to clean phone numbers so that it returns only the 10 digit phone number and removes the country code if it is present and any special characters.
Here's some sample code:
phone_series = pandas.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
r1 = '[^0-9]+'
phone_series.str.replace(r1, '')
Returns
0 11921674056
1 1233454568
2 1233455678
3 1231231234
As you can see, this regex works well except for the country code. And unfortunately, this system I'm loading into cannot accept the country code. What I'm struggling with, is finding a regex that with strip the country code as well. All the regex's I've found will match the 10 digits I need, and in this case with using pandas, I need to not match them.
I could easily write a function and use .apply but I feel like there is likely a simple regex solution that I'm missing.
Thanks for any help!
I don't think regex is necessary here, which is nice because regex is a pain in the buns.
To append your current solution:
phone_series = pandas.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
r1 = '[^0-9]+'
phone_series.str.replace(r1, '')
phone_series = phone_series.apply(lambda x: x[-10:])
My lazier solution:
>>> phone_series = pd.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
>>> p2 = phone_series.apply(lambda x: ''.join([i for i in x if str.isnumeric(i)])[-10:])
>>> p2
0 1921674056
1 1233454568
2 1233455678
3 1231231234
dtype: object

Identify numbers, in a large data string, that are prefixed to an alphabet upto 2 positions in between other characters

I have a string containing thousands of lines of this data without line break (only a few lines shown for readability with line break)
5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital
7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital
Format is
(entry number)(district)(patient number)(age)(gender)(case of)(symptoms)(comorbidity)(date of death)(place of death)
without spaces, or brackets.
Problem : The data i want to collect is age.
However i cant seem to find a way to single out the age since its clouded by a lot of other numbers in the data. I have tried various iterations of count, limiting it to 1 to 99, separating the data etc, and failed.
My Idea : Since the gender is always either 'M'/'F', and the two numbers before the gender is the age. Isolating the two numbers before the gender seems like an ideal solution.
xxM
xxF
My Goal : I would like to collect all the xx numbers irrespective of gender and store them in a list. How do i go about this?
import re
input_str = '5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital'
ages = [found[-3:-1] for found in re.findall('[0-9]+[M,F]', input_str, re.I)]
print(ages)
# ['62', '65']
This works fine with the sample but if there are districts starting with 'M/F' then entry number will be collected as well.
A workaround is to match exactly seven digits (if the patient number is always 5 digits and and the age is generally 2 digits).
ages = [found[-3:-1] for found in re.findall(r'\d{7}[M,F]', input_str, re.I)]
With the structure you gave I've built a dict of reg expressions to match components. Then put this back into a dict
There are ways I can imagine this will not work
if age < 10, only 1 digit so you will pick up a digit of patient number
there maybe strings that don't match the re expressions which will mean odd results
It's the most structured way I can think to go....
import re
data = "5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital"
md = {
"entrynum": "([0-9]+)",
"district": "([A-Z,a-z]+)",
"patnum_age": "([0-9]+)",
"sex": "([M,F])",
"remainder": "(.*)$"
}
data_dict = {list(md.keys())[i]:tk
for i, tk in
enumerate([tk for tk in re.split("".join(md.values()), data) if tk!=""])
}
print(f"Assumed age:{data_dict['patnum_age'][-2:]}\nparsed:{data_dict}\n")
output
Assumed age:62
parsed:{'entrynum': '5', 'district': 'BengaluruUrban', 'patnum_age': '4598962', 'sex': 'M', 'remainder': 'SARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital'}

What is the easiest way to split a string into a first name and last name?

The dataset has 14k rows and has many titles, etc.
I am a beginner in Pandas and Python and I'd like to know how to proceed with getting the output of first name and last name from this dataset.
Dataset:
0 Pr.Doz.Dr. Klaus Semmler Facharzt für Frauenhe...
1 Dr. univ. (Budapest) Dalia Lax
2 Dr. med. Jovan Stojilkovic
3 Dr. med. Dirk Schneider
4 Marc Scheuermann
14083 Bag Kinderarztpraxis
14084 Herr Ulrich Bromig
14085 Sohn Heinrich
14086 Herr Dr. sc. med. Amadeus Hartwig
14087 Jasmin Rieche
for name in dataset:
first = name.split()[-2]
last = name.split()[-1]
# save here
This will work for most names, not all. For repeatability you may need a list of titles such as (dr., md., univ.) to skip over
As it doesn't contain any structure, you're out of luck. An ad-hoc solution could be to just write down a list of all locations/titles/conjunctions and other noise you've identified and then strip those from the rows. Then, if you notice some other things you'd like to exclude, just add them to your list.
This will not solve the issue of certain rows having their name in reverse order. So it'll require you to manually go over everything and check if the row is valid, but it might be quicker than editing each row by hand.
A simple, brute-force example would be:
excludes = {'dr.', 'herr', 'budapest', 'med.', 'für', ... }
new_entries = []
for title in all_entries:
cleaned_result = []
parts = title.split(' ')
for part in parts:
if part.lowercase() not in excludes:
cleaned_result.append(part)
new_entries.append(' '.join(cleaned_result))

List.Sort() method places single digit last however non single digits are placed first python [duplicate]

I have some files that need to be sorted by name, unfortunately I can't use a regular sort, because I also want to sort the numbers in the string, so I did some research and found that what I am looking for is called natural sorting.
I tried the solution given here and it worked perfectly.
However, for strings like PresserInc-1_10.jpg and PresserInc-1_11.jpg which causes that specific natural key algorithm to fail, because it only matches the first integer which in this case would be 1 and 1, and so it throws off the sorting. So what I think might help is to match all numbers in the string and group them together, so if I have PresserInc-1_11.jpg the algorithm should give me 111 back, so my question is, is this possible ?
Here's a list of filenames:
files = ['PresserInc-1.jpg', 'PresserInc-1_10.jpg', 'PresserInc-1_11.jpg', 'PresserInc-10.jpg', 'PresserInc-2.jpg', 'PresserInc-3.jpg', 'PresserInc-4.jpg', 'PresserInc-5.jpg', 'PresserInc-6.jpg', 'PresserInc-11.jpg']
Google: Python natural sorting.
Result 1: The page you linked to.
But don't stop there!
Result 2: Jeff Atwood's blog that explains how to do it properly.
Result 3: An answer I posted based on Jeff Atwood's blog.
Here's the code from that answer:
import re
def natural_sort(l):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
return sorted(l, key=alphanum_key)
Results for your data:
PresserInc-1.jpg
PresserInc-1_10.jpg
PresserInc-1_11.jpg
PresserInc-2.jpg
PresserInc-3.jpg
etc...
See it working online: ideone
If you don't mind third party libraries, you can use natsort to achieve this.
>>> import natsort
>>> files = ['PresserInc-1.jpg', 'PresserInc-1_10.jpg', 'PresserInc-1_11.jpg', 'PresserInc-10.jpg', 'PresserInc-2.jpg', 'PresserInc-3.jpg', 'PresserInc-4.jpg', 'PresserInc-5.jpg', 'PresserInc-6.jpg', 'PresserInc-11.jpg']
>>> natsort.natsorted(files)
['PresserInc-1.jpg',
'PresserInc-1_10.jpg',
'PresserInc-1_11.jpg',
'PresserInc-2.jpg',
'PresserInc-3.jpg',
'PresserInc-4.jpg',
'PresserInc-5.jpg',
'PresserInc-6.jpg',
'PresserInc-10.jpg',
'PresserInc-11.jpg']

Python: Print entire line of string match and not cut off after the period

See bottom for the solution I came up with.
Hopefully this is a easy question for you guys. Trying to match a string to a list and print just that string matched. I was successful using re, but it is cutting off the rest of the string after the period. The span per re is 0,10 and when i look at the output without using re it is 0,14 not 0,10 so match is cutting off the info after the period. So I would like to learn how to tell it to print the entire span or learn a new way to match a var string to a list and print that exact string. My original attempts printed anything with the TESTPR in it, 3 printed total, the others I do not want printing have a 1 in the front and the last match has an additional R at the end. Here is my current match code:
#OLD See below
for element in catalog:
z = re.match("((TESTPRR )\w+)", element)
if z:
print((z.group()))
Output: TESTPR 105
It should show:
Wanted output: TESTPT 105.465
It will go up to 3 decimal places after the period and no more. I am currently taking a Python class to learn Python and love it so far, but this one has me stumped as I am just now learning about re and matching by reading as we have not gotten to that yet in class.
I am open to learning a different way to search for and match a string and print just that string. For my first attempt that prints 3 results was this:
catalog = [ long list pulled from API then code here to make it a nice column]
prod = 'TESTPR'
print ([s for s in catalog if prod in s])
When I add a space at the end of prod i can get rid of the match with the extra char at the end, but I cannot add a space to do the same thing with the match that has an extra char at the front. This is for the code above and not for the re match code. Thanks!
Answer below!
Since you are interested in learning about ways to match strings and solve your problem: try fuzzywuzzy.
In your case you could try:
from fuzzywuzzy import process
catalog = [long list pulled from API then code here to make it a nice column]
prod = "TESTPR"
hit = process.extractOne(prod, catalog, score_cutoff = 75) #you can adjust this to suit how close the match should be
print(hit[0]) #hit will be sth like ("TESTPT 105.465", 75)
Output: TESTPT 105.465
For information on different ways of using fuzzywuzzy, check out this link.
You can use different ways of matching such as:
fuzz.partial_ratio
fuzz.ratio
token_sort_ratio
fuzz.token_set_ratio
for this from fuzzywuzzy import fuzz
Kept at it with re.match and got the correct regex so the entire match prints and it does not cut off numbers after the period.
my original match as you can see above was re.match("((TESTPRR )\w+)", element), some of the ( were unneeded and needed to add a few more expressions and now it prints the correct match. See above for old code and below for the new code that works.
# New code, replaced w+ with w*\d*[.,]?\d*$
for element in catalog:
z = re.match("STRING\w*\d*[.,]?\d*$", element)
if z:
print(z.group())

Resources