How to customize unidecode? - python-3.x

I'm using unidecode module for replacing utf-8 characters. However, there are some characters, for example greek letters and some symbols like Å, which I want to preserve. How can I achieve this?
For example,
from unidecode import unidecode
test_str = 'α, Å ©'
unidecode(test_str)
gives the output a, A (c), while what I want is α, Å (c).

Run unidecode on each character individually. Have a whitelist set of characters that you use to bypass the unidecode.
>>> import string
>>> whitelist = set(string.printable + 'αÅ')
>>> test_str = 'α, Å ©'
>>> ''.join(ch if ch in whitelist else unidecode.unidecode(ch) for ch in test_str)
'α, Å (c)'

Related

How to remove both number and text from a parenthesis using regrex in python?

In the following text, I want to remove everything inside the parenthesis including number and string. I use the following syntax but I got result of 22701 instead of 2270. What would be a way to show 2270 only using re.sub? Thanks
import regex as re
import numpy as np
import pandas as pd
text = "2270 (1st xyz)"
text_new = re.sub(r"[a-zA-Z()\s]","",text)
text_new
Does the text always follow the same pattern? Try:
import re
import numpy as np
import pandas as pd
text = "2270 (1st xyz)"
text_new = re.sub(r"\s\([^)]*\)","",text)
print(text_new)
Output:
2270
Simply use the regex pattern \(.*?\):
import re
text = "2270 (1st xyz)"
text_new = re.sub("\(.*?\)", "", text)
print(text_new)
Output:
2270
Explanation on the pattern \(.*?\):
The \ behind each parenthesis is to tell re to treat the parenthesis as a regular character, as they are by default special characters in re.
The . matches any character except the newline character.
The * matches zero or more occurrences of the pattern immediately specified before the *.
The ? tells re to match as little text as possible, thus making it non-greedy.
Note the trailing space in the output. To remove it, simply add it to the pattern:
import re
text = "2270 (1st xyz)"
text_new = re.sub(" \(.*?\)", "", text)
print(text_new)
Output:
2270

How to convert iso 8859-1 in simple letter using python

I m trying to clean my sqlite database using python. At first I loaded using this code:
import sqlite3, pandas as pd
con = sqlite3.connect("DATABASE.db")
import sqlite3, pandas as pd
df = pd.read_sql_query("SELECT TITLE from DOCUMENT", con)
So I got the dirty words. for example this "Conciliaci\363n" I want to get "Conciliacion". I used this code:
df['TITLE']=df['TITle'].apply(lambda x: x.decode('iso-8859-1').encode('utf8'))
I got b'' in blank cells. and got 'Conciliaci\\363n' too. So maybe I'm doing wrong. how can I solve this problem. Thanks in advance.
It's unclear, but if your string contains a literal backslash and numbers like this:
>>> s= r"Conciliaci\363n" # A raw string to make a literal escape code
>>> s
'Conciliaci\\363n' # debug display of string shows an escaped backslash
>>> print(s)
Conciliaci\363n # printing prints the escape
Then this will decode it correctly:
>>> s.encode('ascii').decode('unicode-escape') # convert to byte string, then decode
'Conciliación'
If you want to lose the accent mark as your question shows, then decomposing the Unicode string, converting to ASCII ignoring errors, then converting back to a Unicode string will do it:
>>> s2 = s.encode('ascii').decode('unicode-escape')
>>> s2
'Conciliación'
>>> import unicodedata as ud
>>> ud.normalize('NFD',s2) # Make Unicode decomposed form
'Conciliación' # The ó is now an ASCII 'o' and a combining accent
>>> ud.normalize('NFD',s2).encode('ascii',errors='ignore').decode('ascii')
'Conciliacion' # accent isn't ASCII, so is removed

Python String Replacement using only one line of code (replace "H" and "h" by "*")

Python: In the string "Hello, how are you?" , how can I replace both "H" and "h" by "*" ? I'd like to do it in only one line of code ...
You can use re, for example:
import re
old_text="Hello, how are you?"
new_text = re.sub(r'h', '2', old_text, flags=re.IGNORECASE)
print (new_text)
With regular expressions:
import re
s = "Hello, how are you?"
s_replaced = re.sub('(H|h)', '*', s)
You can check the regex explanation here.
There are some ways to do this:
You can chain .replace() several times as it operates on and returns a string:
>>> print('Hello, how are you?'.replace('H', '*').replace('h', '*'))
*ello, *ow are you?
Or use Regex:
>>> import re
>>> re.sub('[Hh]', '*', 'Hello, how are you?')
'*ello, *ow are you?'

Return emoji name instead of emoji

I have this: '1⃣' (without the single quotes) in Python 3, which is :one:. Is there a way I could get the emoji (like the one above) and print the corresponding emoji (in this case :one:) name instead?
I'm getting the emoji from a discord.py reaction object.
In your case, that emoji is a two-character string. You can get the number by getting the first character of the string:
char = '1⃣'
print(char[0]) # 1
With another emoji that isn't just two characters, you can use the unicodedata module:
import unicodedata
char = '❤'
name = unicodedata.name(char)
print(name) # HEAVY BLACK HEART
In most cases, the name of the emote is the last word of the unicode name:
import unicodedata
char = '1⃣'
name = unicodedata.name(char[0])
name = name.split(' ')[-1]
print(f':{name.lower()}:')
# :one:

Getting a value error: invalid literal for int() with base 10: '56,990'

So I am trying to scrap a website containing price of a laptop.However it is a srting and for comparison purposes I need to convert it to int.But on using the same I get a none type error: invalid literal for int() with base 10: '56,990'
Below is the code:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.flipkart.com/apple-macbook-air-core-i5-5th-gen-8-gb-128-gb-ssd-mac-os-sierra-mqd32hn-a-a1466/p/itmevcpqqhf6azn3?pid=COMEVCPQBXBDFJ8C&srno=s_1_1&otracker=search&lid=LSTCOMEVCPQBXBDFJ8C5XWYJP&fm=SEARCH&iid=2899998f-8606-4b81-a303-46fd62a7882b.COMEVCPQBXBDFJ8C.SEARCH&qH=9e3635d7234e9051")
data = r.text
soup = BeautifulSoup(data,"lxml")
data=soup.find('div',{"class":"_1vC4OE _37U4_g"})
cost=(data.text[1:].strip())
print(int(cost))
PS:I used text[1:] toremove the currency character
I get error in the last line.Basically I need to get the int value of the cost.
The value has a comma in it. So you need to replace the comma with empty character before converting it to integer.
print(int(cost.replace(',','')))
python does not understand , group separators in integers, so you'll need to remove them. Try:
cost = data.text[1:].strip().translate(None,',')
Rather than invent a new solution for every character you don't want (strip() function for whitespace, [1:] index for the currency, something else for the digit separator) consider a single solution to gather what you do want:
>>> import re
>>> text = "\u20B956,990\n"
>>> cost = re.sub(r"\D", "", text)
>>> print(int(cost))
56990
The re.sub() replaces anything that isn't a digit with nothing.

Resources