How to use f'string bytes'string together? [duplicate] - python-3.x

I'm looking for a formatted byte string literal. Specifically, something equivalent to
name = "Hello"
bytes(f"Some format string {name}")
Possibly something like fb"Some format string {name}".
Does such a thing exist?

No. The idea is explicitly dismissed in the PEP:
For the same reason that we don't support bytes.format(), you may
not combine 'f' with 'b' string literals. The primary problem
is that an object's __format__() method may return Unicode data
that is not compatible with a bytes string.
Binary f-strings would first require a solution for
bytes.format(). This idea has been proposed in the past, most
recently in PEP 461. The discussions of such a feature usually
suggest either
adding a method such as __bformat__() so an object can control how it is converted to bytes, or
having bytes.format() not be as general purpose or extensible as str.format().
Both of these remain as options in the future, if such functionality
is desired.

In 3.6+ you can do:
>>> a = 123
>>> f'{a}'.encode()
b'123'

You were actually super close in your suggestion; if you add an encoding kwarg to your bytes() call, then you get the desired behavior:
>>> name = "Hello"
>>> bytes(f"Some format string {name}", encoding="utf-8")
b'Some format string Hello'
Caveat: This works in 3.8 for me, but note at the bottom of the Bytes Object headline in the docs seem to suggest that this should work with any method of string formatting in all of 3.x (using str.format() for versions <3.6 since that's when f-strings were added, but the OP specifically asks about 3.6+).

From python 3.6.2 this percent formatting for bytes works for some use cases:
print(b"Some stuff %a. Some other stuff" % my_byte_or_unicode_string)
But as AXO commented:
This is not the same. %a (or %r) will give the representation of the string, not the string iteself. For example b'%a' % b'bytes' will give b"b'bytes'", not b'bytes'.
Which may or may not matter depending on if you need to just present the formatted byte_or_unicode_string in a UI or if you potentially need to do further manipulation.

As noted here, you can format this way:
>>> name = b"Hello"
>>> b"Some format string %b World" % name
b'Some format string Hello World'
You can see more details in PEP 461
Note that in your example you could simply do something like:
>>> name = b"Hello"
>>> b"Some format string " + name
b'Some format string Hello'

This was one of the bigger changes made from python 2 to python3. They handle unicode and strings differently.
This s how you'd convert to bytes.
string = "some string format"
string.encode()
print(string)
This is how you'd decode to string.
string.decode()
I had a better appreciation for the difference between Python 2 versus 3 change to unicode through this coursera lecture by Charles Severence. You can watch the entire 17 minute video or fast forward to somewhere around 10:30 if you want to get to the differences between python 2 and 3 and how they handle characters and specifically unicode.
I understand your actual question is how you could format a string that has both strings and bytes.
inBytes = b"testing"
inString = 'Hello'
type(inString) #This will yield <class 'str'>
type(inBytes) #this will yield <class 'bytes'>
Here you could see that I have a string a variable and a bytes variable.
This is how you would combine a byte and string into one string.
formattedString=(inString + ' ' + inBytes.encode())

Related

How to use python to convert a backslash in to forward slash for naming the filepaths in windows OS?

I have a problem in converting all the back slashes into forward slashes using Python.
I tried using the os.sep function as well as the string.replace() function to accomplish my task. It wasn't 100% successful in doing that
import os
pathA = 'V:\Gowtham\2019\Python\DailyStandup.txt'
newpathA = pathA.replace(os.sep,'/')
print(newpathA)
Expected Output:
'V:/Gowtham/2019/Python/DailyStandup.txt'
Actual Output:
'V:/Gowtham\x819/Python/DailyStandup.txt'
I am not able to get why the number 2019 is converted in to x819. Could someone help me on this?
Your issue is already in pathA: if you print it out, you'll see that it already as this \x81 since \201 means a character defined by the octal number 201 which is 81 in hexadecimal (\x81). For more information, you can take a look at the definition of string literals.
The quick solution is to use raw strings (r'V:\....'). But you should take a look at the pathlib module.
Using the raw string leads to the correct answer for me.
import os
pathA = r'V:\Gowtham\2019\Python\DailyStandup.txt'
newpathA = pathA.replace(os.sep,'/')
print(newpathA)
OutPut:
V:/Gowtham/2019/Python/DailyStandup.txt
Try this, Using raw r'your-string' string format.
>>> import os
>>> pathA = r'V:\Gowtham\2019\Python\DailyStandup.txt' # raw string format
>>> newpathA = pathA.replace(os.sep,'/')
Output:
>>> print(newpathA)
V:/Gowtham/2019/Python/DailyStandup.txt

How does ruamel.yaml determine the encoding of escaped byte sequences in a string?

I am having trouble figuring out where to modify or configure ruamel.yaml's loader to get it to parse some old YAML with the correct encoding. The essence of the problem is that an escaped byte sequence in the document seems to be interpreted as latin1, and I have no earthly clue where it is doing that, after some source diving here. Here is a code sample that demonstrates the behavior (this in particular was run in Python 3.6):
from ruamel.yaml import YAML
yaml = YAML()
yaml.load('a:\n b: "\\xE2\\x80\\x99"\n') # Note that this is a str (that is, unicode) with escapes for the byte escapes in the YAML document
# ordereddict([('a', ordereddict([('b', 'â\x80\x99')]))])
Here are the same bytes decoded manually, just to show what it should parse to:
>>> b"\xE2\x80\x99".decode('utf8')
'’'
Note that I don't really have any control over the source document, so modifying it to produce the correct output with ruamel.yaml is out of the question.
ruamel.yaml doesn't interpret individual strings, it interprets the
stream it gets hanled, i.e. the argument to .load(). If that
argument is a byte-stream or a file like object then its encoding is
determined based on the BOM, defaulting to UTF-8. But again: that is
at the stream level, not at individual scalar content after
interpreting escapes. Since you hand .load() Unicode (as this is
Python 3) that "stream" needs no further decoding. (Although
irrelevant for this question: it is done in the reader.py:Reader methods stream and
determine_encoding)
The hex escapes (of the form \xAB), will just put a specific hex
value in the type the loader uses to construct the scalar, that is
value for key 'b', and that is a normal Python 3 str i.e. Unicode in
one of its internal representations. That you get the â in your
output is because of how your Python is configured to decode it str
tyes.
So you won't "find" the place where ruamel.yaml decodes that
byte-sequence, because that is already assumed to be Unicode.
So the thing to do is that you double decode your double quoted
scalars (you only have to address those as plain, single quoted,
literal/folded scalars cannot have the hex escapes). There are various
points at which you can try to do that, but I think
constructor.py:RoundTripConsturtor.construct_scalar and
scalarstring.py:DoubleQuotedScalarString are the best candidates. The former of those might take some digging to find, but the latter is actually the type you'll get if you inspect
that string after loading when you add the option to preserve quotes:
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n b: "\\xE2\\x80\\x99"\n')
print(type(data['a']['b']))
which prints:
<class 'ruamel.yaml.scalarstring.DoubleQuotedScalarString'>
knowing that you can inspect that rather simple wrapper class:
class DoubleQuotedScalarString(ScalarString):
__slots__ = ()
style = '"'
def __new__(cls, value, anchor=None):
# type: (Text, Any) -> Any
return ScalarString.__new__(cls, value, anchor=anchor)
"update" the only method there (__new__) to do your double
encoding (you might have to put in additional checks to not double encode all
double quoted scalars0:
import sys
import codecs
import ruamel.yaml
def my_new(cls, value, anchor=None):
# type information only needed if using mypy
# value is of type 'str', decode to bytes "without conversion", then encode
value = value.encode('latin_1').decode('utf-8')
return ruamel.yaml.scalarstring.ScalarString.__new__(cls, value, anchor=anchor)
ruamel.yaml.scalarstring.DoubleQuotedScalarString.__new__ = my_new
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n b: "\\xE2\\x80\\x99"\n')
print(data)
which gives:
ordereddict([('a', ordereddict([('b', '’')]))])

Python instructional code to print long strings with line breaks

What's wrong with this code? Except for the print statement, it is the direct answer code from a Udacity learning python lesson. It suggests br as an html response, but to me that didn't make sense in python. The python run results print the letters <BR> between every letter of the string.
def breakify(strings):
return "<br>".join(strings)
print(breakify("Haiku frogs in snow" "A limerick came from Nantucket" "Tetrametric drum-beats thrumming,"))
Output:
H<br>a<br>i<br>k<br>u<br> <br>f<br>r<br>o<br>g<br>s<br> <br>i<br>n<br> <br>s<br>n<br>o<br>w<br>A<br> <br>l<br>i<br>m<br>e<br>r<br>i<br>c<br>k<br> <br>c<br>a<br>m<br>e<br> <br>f<br>r<br>o<br>m<br> <br>N<br>a<br>n<br>t<br>u<br>c<br>k<br>e<br>t<br>T<br>e<br>t<br>r<br>a<br>m<br>e<br>t<br>r<br>i<br>c<br> <br>d<br>r<br>u<br>m<br>-<br>b<br>e<br>a<br>t<br>s<br> <br>t<br>h<br>r<br>u<br>m<br>m<br>i<br>n<br>g<br>,
The strings are being concatenated due to string literal concatenation.
Simply put them in a list (or tuple) and separate them with commas.
Example with shorter strings for readability:
print(breakify(["Haiku", "limerick", "drum"]))
Output:
Haiku<br>limerick<br>drum
You got the output you did because str.join takes any iterable, and a string is an iterable. For example:
>>> '.'.join('hello')
'h.e.l.l.o'

How do I convert a Python 3 byte-string variable into a regular string? [duplicate]

This question already has answers here:
Convert bytes to a string
(22 answers)
Closed 2 years ago.
I have read in an XML email attachment with
bytes_string=part.get_payload(decode=False)
The payload comes in as a byte string, as my variable name suggests.
I am trying to use the recommended Python 3 approach to turn this string into a usable string that I can manipulate.
The example shows:
str(b'abc','utf-8')
How can I apply the b (bytes) keyword argument to my variable bytes_string and use the recommended approach?
The way I tried doesn't work:
str(bbytes_string, 'utf-8')
You had it nearly right in the last line. You want
str(bytes_string, 'utf-8')
because the type of bytes_string is bytes, the same as the type of b'abc'.
Call decode() on a bytes instance to get the text which it encodes.
str = bytes.decode()
How to filter (skip) non-UTF8 charachers from array?
To address this comment in #uname01's post and the OP, ignore the errors:
Code
>>> b'\x80abc'.decode("utf-8", errors="ignore")
'abc'
Details
From the docs, here are more examples using the same errors parameter:
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "strict")
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
The errors argument specifies the response when the input string can’t be converted according to the encoding’s rules. Legal values for this argument are 'strict' (raise a UnicodeDecodeError exception), 'replace' (use U+FFFD, REPLACEMENT CHARACTER), or 'ignore' (just leave the character out of the Unicode result).
UPDATED:
TO NOT HAVE ANY b and quotes at first and end
How to convert bytes as seen to strings, even in weird situations.
As your code may have unrecognizable characters to 'utf-8' encoding,
it's better to use just str without any additional parameters:
some_bad_bytes = b'\x02-\xdfI#)'
text = str( some_bad_bytes )[2:-1]
print(text)
Output: \x02-\xdfI
if you add 'utf-8' parameter, to these specific bytes, you should receive error.
As PYTHON 3 standard says, text would be in utf-8 now with no concern.

Python 3.4: str : AttributeError: 'str' object has no attribute 'decode

I have this code part of a function that replace badly encoded foreign characters from a string :
s = "String from an old database with weird mixed encodings"
s = str(bytes(odbc_str.strip(), 'cp1252'))
s = s.replace('\\x82', 'é')
s = s.replace('\\x8a', 'è')
(...)
print(s)
# b"String from an old database with weird mixed encodings"
I need here a "real" string, not bytes. But whend i want to decode them, i have an exception :
s = "String from an old database with weird mixed encodings"
s = str(bytes(odbc_str.strip(), 'cp1252'))
s = s.replace('\\x82', 'é')
s = s.replace('\\x8a', 'è')
(...)
print(s.decode("utf-8"))
# AttributeError: 'str' object has no attribute 'decode'
Do you know why s is bytes here ?
Why can't i decode it to a real string ?
Do you know how to do it the clean way ? (today i return s[2:][:-1]. Working but very ugly, and i would like to understand this behavior)
Thanks in advance !
EDIT :
pypyodbc in python3 use all unicode by default. That confused me. On connect, you can tell him to use ANSI.
con_odbc = pypyodbc.connect("DSN=GP", False, False, 0, False)
Then, i can convert the returned stuffs into cp850, which is the initial codepage of the database.
str(odbc_str, "cp850", "replace")
No more need to manualy replace each special character.
Thank you very much pepr
The printed b"String from an old database with weird mixed encodings" is not the representation of the string content. It is the value of the string content. As you did not pass the encoding argument to str()... (see the doc https://docs.python.org/3.4/library/stdtypes.html#str)
If neither encoding nor errors is given, str(object) returns object.__str__(), which is the “informal” or nicely printable string representation of object. For string objects, this is the string itself. If object does not have a __str__() method, then str() falls back to returning repr(object).
This is what happened in your case. The b" are actually two characters that are the part of the string content. You can also try:
s1 = 'String from an old database with weird mixed encodings'
print(type(s1), repr(s1))
by = bytes(s1, 'cp1252')
print(type(by), repr(by))
s2 = str(by)
print(type(s2), repr(s2))
and it prints:
<class 'str'> 'String from an old database with weird mixed encodings'
<class 'bytes'> b'String from an old database with weird mixed encodings'
<class 'str'> "b'String from an old database with weird mixed encodings'"
This is the reason why s[2:][:-1] works for you.
If you think more about it, then (in my opinion) or you want to get bytes or bytearray from the database (if possible), and to fix the bytes (see bytes.translate https://docs.python.org/3.4/library/stdtypes.html?highlight=translate#bytes.translate) or you successfully get the string (being lucky that there was no exception when constructing that string), and you want to replace the wrong characters by the correct characters (see also str.translate() https://docs.python.org/3.4/library/stdtypes.html?highlight=translate#str.translate).
Possibly, the ODBC used internally the wrong encoding. (That is the content of the database may be correct, but it was misinterpreted by the ODBC, and you are not able to tell the ODBC what is the correct encoding.) Then you want to encode the string back to bytes using that wrong encoding, and then decode the bytes using the right encoding.

Resources