In Excel there is a function called clean(), which removes all nonprintable characters from text. Reference https://support.microsoft.com/en-us/office/clean-function-26f3d7c5-475f-4a9c-90e5-4b8ba987ba41#:~:text=Removes%20all%20nonprintable%20characters%20from,files%20and%20cannot%20be%20printed.
I am wondering if there is any direct function/method in python to achieve the same.
Also, how can I mimic clean() function in python just using Regular expression?
Any pointer will be very helpful
The CLEAN function in Excel removes only "the first 32 nonprinting characters in the 7-bit ASCII code (values 0 through 31)", according to the documentation you link to, so to mimic it, you can filter characters of a given string whose ord values are less than 32:
def clean(s):
return ''.join(c for c in s if ord(c) < 32)
Or you can use a regular expression substitution to remove characters with hex values between \x00 and \x1f:
import re
def clean(s):
return re.sub(r'[\x00-\x1f]+', '', s)
Related
I want to write a function that converts the given string T and group them into three blocks.
However, I want to split the last block into two if it can't be broken down to three numbers.
For example, this is my code
import re
def num_format(T):
clean_number = re.sub('[^0-9]+', '', T)
formatted_number = re.sub(r"(\d{3})(?=(\d{3})+(?!\d{3}))", r"\1-", clean_number)
return formatted_number
num_format("05553--70002654")
this returns : '055-537-000-2654' as a result.
However, I want it to be '055-537-000-26-54'.
I used the regular expression, but have no idea how to split the last remaining numbers into two blocks!
I would really appreciate helping me to figure this problem out!!
Thanks in advance.
You can use
def num_format(T):
clean_number = ''.join(c for c in T if c.isdigit())
return re.sub(r'(\d{3})(?=\d{2})|(?<=\d{2})(?=\d{2}$)', r'\1-', clean_number)
See the regex demo.
Note you can get rid of all non-numeric chars using plain Python comprehension, the solution is borrowed from Removing all non-numeric characters from string in Python.
The regex matches
(\d{3}) - Group 1 (\1): three digits...
(?=\d{2}) - followed with two digits
| - or
(?<=\d{2})(?=\d{2}$) - a location between any two digit sequence and two digits that are at the end of string.
See the Python demo:
import re
def num_format(T):
clean_number = ''.join(c for c in T if c.isdigit())
return re.sub(r'(\d{3})(?=\d{2})|(?<=\d{2})(?=\d{2}$)', r'\1-', clean_number)
print(num_format("05553--70002654"))
# => 055-537-000-26-54
As example: I want remove the first 2 letters from the string "ПРИВЕТ" and "HELLO." one of these are containing only two-byted unicode symbols.
Trying to use string.sub("ПРИВЕТ") and string.sub("HELLO.")
Got "РИВЕТ" and "LLO.".
string.sub() removed 2 BYTES(not chars) from these strings. So i want to know how to get the removing of the chars
Something, like utf8.sub()
The key standard function for this task is utf8.offset(s,n), which gives the position in bytes of the start of the n-th character of s.
So try this:
print(string.sub(s,utf8.offset(s,3),-1))
You can define utf8.sub as follows:
function utf8.sub(s,i,j)
i=utf8.offset(s,i)
j=utf8.offset(s,j+1)-1
return string.sub(s,i,j)
end
(This code only works for positive j. See http://lua-users.org/lists/lua-l/2014-04/msg00590.html for the general case.)
There is https://github.com/Stepets/utf8.lua, a pure lua library, which expand standard function to support utf8 string.
I found a simpler solution (the the solution using offset() didnt work for me for all cases):
function utf8.sub(s, i, j)
return utf8.char(utf8.codepoint(s, i, j))
end
I am trying to write a function that takes a string txt and returns an int of that string's character's ascii numbers. It also takes a second argument, n, that is an int that specified the number of digits that each character should translate to. The default value of n is 3. n is always > 3 and the string input is always non-empty.
Example outputs:
string_to_number('fff')
102102102
string_to_number('ABBA', n = 4)
65006600660065
My current strategy is to split txt into its characters by converting it into a list. Then, I convert the characters into their ord values and append this to a new list. I then try to combine the elements in this new list into a number (e.g. I would go from ['102', '102', '102'] to ['102102102']. Then I try to convert the first element of this list (aka the only element), into an integer. My current code looks like this:
def string_to_number(txt, n=3):
characters = list(txt)
ord_values = []
for character in characters:
ord_values.append(ord(character))
joined_ord_values = ''.join(ord_values)
final_number = int(joined_ord_values[0])
return final_number
The issue is that I get a Type Error. I can write code that successfully returns the integer of a single-character string, however when it comes to ones that contain more than one character, I can't because of this type error. Is there any way of fixing this. Thank you, and apologies if this is quite long.
Try this:
def string_to_number(text, n=3):
return int(''.join('{:0>{}}'.format(ord(c), n) for c in text))
print(string_to_number('fff'))
print(string_to_number('ABBA', n=4))
Output:
102102102
65006600660065
Edit: without list comprehension, as OP asked in the comment
def string_to_number(text, n=3):
l = []
for c in text:
l.append('{:0>{}}'.format(ord(c), n))
return int(''.join(l))
Useful link(s):
string formatting in python: contains pretty much everything you need to know about string formatting in python
The join method expects an array of strings, so you'll need to convert your ASCII codes into strings. This almost gets it done:
ord_values.append(str(ord(character)))
except that it doesn't respect your number-of-digits requirement.
For example if I had the value '00010010' how would a simple function just print it as "H"?
Other answer seem to be rather complicated or don't work at all
You can use the chr and ord classes to convert between numbers and characters. In this case, given a binary number, you'll also need to use the int class to convert from a binary string to a Python integer.
For example:
>>> chr(int("00010010", 2))
'\x12'
This gives the ascii character of the given input. Note that the binary "00010010" does not correspond to a "H" character in ASCII; the value of "H" can be found with the ord function:
>>> bin(ord("H"))
'0b1001000'
On Python 3 printing unicode characters can be printed like this:
print('\uFFFF')
But how can I print higher unicode characters like 001FFFFF? print('\u001FFFFF') will just print 001F as unicode character and then 4 times F. Trying to use print('\u001F\uFFFF') will result in 2 unicode characters instead of the wanted one. Is it possible to print somehow the unicode character 001FFFFF in Python 3?
Use an upper-case U.
print('\U001FFFFF')
There is another way in Python 3, using the built-in function chr(i), which
Return the string representing a character whose Unicode code point is
the integer i.
and
The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16).
so there are no limitation for the hex digit value.
print(chr(97))
print(chr(0xFFFF))
print(chr(0x10080))