Bytes in an string to bytes [duplicate] - python-3.x

I've downloaded some web pages with requests and saved the content in a postgres database [in a text field] using Django's ORM. For some sudocode of what's going on, here ya go:
art = Article()
page = requests.get("http://example.com")
art.raw_html = page.content
art.save()
I verified that page.content is a bytes object, and I guess I assumed that this object would automatically be decoded upon saving, but it doesn't seem to be... it has been converted to some weird string representation of a bytes object, ostensibly by Django. It looks like this in the interpreter when I call art.raw_html:
'b\'<!DOCTYPE html>\\n<html lang="en" class="pb-page"
And if I call it with print I get this:
b'<!DOCTYPE html>\n<html lang="en" class="pb-page"
And for the life of me I can't re-encode it to a bytes object, even if I trim off the leading b' and trailing '.
I feel like there's an easy solution to this and I feel like an idiot... but after lots of experiments and googling, I'm not figuring it out.
Incidentally, if I manually copy what's returned from the print statement (like with my cursor), I can convert the clipboard contents back to a bytes object just fine and then decode it into some readably-formatted html.
Clearly there is a better way. (And yes, going forward I'll stop saving the content like this in the first place.)

You can use eval or ast.literal_eval as below.
data = "b'gAAAAABc1arg48DmsOwQEbeiuh-FQoNSRnCOk9OvXXOE2cbBe2A46gmP6SPyymDft1yp5HsoHEzXe0KljbsdwTgPG5jCyhMmaA=='"
eval(data)
b'gAAAAABc1arg48DmsOwQEbeiuh-FQoNSRnCOk9OvXXOE2cbBe2A46gmP6SPyymDft1yp5HsoHEzXe0KljbsdwTgPG5jCyhMmaA=='
Using ast.literal_eval
import ast
ast.literal_eval(data)
thanks to #juanpa.arrivillaga. I just added to answer.

Related

Creating marker on map using Folium results in blank HTML page

I tried to create a Map using folium library in python3. It work's fine until i add a Marker to the map. On adding Marker the output results in just a blank HTML page.
import folium
map = folium.Map(location=[20.59,78.96], tiles = "Mapbox Bright")
folium.Marker([26.80,80.76],popup="Hey, It's Moradabad").add_to(map)
map.save("mapOutput.html")
#MCO is absolutely correct.
I like to leverage html.escape() to handle the problem characters like so
import folium
import html
map = folium.Map(location=[20.59,78.96], tiles = "Mapbox Bright")
folium.Marker([26.80,80.76],popup=html.escape("Hey, It's Moradabad")).add_to(map)
map.save("mapOutput.html")
in my experience, folium is very sensitive to single quotes (').
The reason being, folium generates javascript, and transforms your strings into javascript strings, which it encloses with single quotes. It does, however, not escape any single quotes contained in the string (not even sure if that's possible in js), so having a single quote in the string has the same effect as not closing a string in python.
Solution: Replace all single quotes in your strings with backticks (`) or (better) use #Bob Haffner's answer.
Edit: out of sheer coincidence i stumbled over another solution just now. Apparently folium.Popup has an argument parse_html. Use it like this:
folium.Marker([x,y], popup=folium.Popup("Foo'Bar",parse_html=True)).add_to(map)
however, i could not make it work without running into a unicodeescape error. nonetheless it exists, supposedly for this purpose, and might work for you.
Edit 2: As i've been told on github, this issue will be fixed in the next release. Ref: folium#962

Google Protocol Buffers (protobuf) in Python3 - trouble with ParseFromString (encoding?)

I've got Google Protocol buffers 80% working in Python3. My .proto file works, I'm encoding data, life is almost good.
The problem is that I can't ParseFromString the result of SerializeToString.
When I print SerializeToString it looks like what I'd expect, a fairly compact binary representation (preceded by b').
My guess is that perhaps this is a difference in the way Python2 and Python3 handle strings. The putput of SerializeToString is Bytes, not a string.
Printed output of SerializeToString (Python type is ):
b'\x10\xd7\xeb\x8e\xcd\x04\x1a\x0cnamegoeshere2#\x08\x80\xf8\xde\xc3\x9f\xb0\x81\x89\x14\x11\x00\x00\x00\x00\x00\x80d\xc0\x19\x00\x00\x00\x00\x00\xc0m#!\x00\x00\x00\x00\x00\x80R\xc0)\x00\x00\x00\x00\x00x\xb7\xc01\x00\x00\x00\x00\x00\x8c\x95#9\x00\x00\x00\x00\x00\x16\xb2#'
result of ParseFromString(message):
None
No error is provided...
So - my best guess is that all I need to do is .decode() the bytes object generated, the problem is that I have no clue what the encoding is. I've tried UTF-8, -16, Latin-1, and a few others without success. My Google-Fu is strong but I haven't found anything on this.
Any help would be appreciated.
ParseFromString is a method -- it does not return anything, but rather fills in self with the parsed content. Use it like:
message = MyMessageType()
message.ParseFromString(data)
print(message.some_field)

How to convert between bytes and strings in Python 3?

This is a Python 101 type question, but it had me baffled for a while when I tried to use a package that seemed to convert my string input into bytes.
As you will see below I found the answer for myself, but I felt it was worth recording here because of the time it took me to unearth what was going on. It seems to be generic to Python 3, so I have not referred to the original package I was playing with; it does not seem to be an error (just that the particular package had a .tostring() method that was clearly not producing what I understood as a string...)
My test program goes like this:
import mangler # spoof package
stringThing = """
<Doc>
<Greeting>Hello World</Greeting>
<Greeting>你好</Greeting>
</Doc>
"""
# print out the input
print('This is the string input:')
print(stringThing)
# now make the string into bytes
bytesThing = mangler.tostring(stringThing) # pseudo-code again
# now print it out
print('\nThis is the bytes output:')
print(bytesThing)
The output from this code gives this:
This is the string input:
<Doc>
<Greeting>Hello World</Greeting>
<Greeting>你好</Greeting>
</Doc>
This is the bytes output:
b'\n<Doc>\n <Greeting>Hello World</Greeting>\n <Greeting>\xe4\xbd\xa0\xe5\xa5\xbd</Greeting>\n</Doc>\n'
So, there is a need to be able to convert between bytes and strings, to avoid ending up with non-ascii characters being turned into gobbledegook.
The 'mangler' in the above code sample was doing the equivalent of this:
bytesThing = stringThing.encode(encoding='UTF-8')
There are other ways to write this (notably using bytes(stringThing, encoding='UTF-8'), but the above syntax makes it obvious what is going on, and also what to do to recover the string:
newStringThing = bytesThing.decode(encoding='UTF-8')
When we do this, the original string is recovered.
Note, using str(bytesThing) just transcribes all the gobbledegook without converting it back into Unicode, unless you specifically request UTF-8, viz., str(bytesThing, encoding='UTF-8'). No error is reported if the encoding is not specified.
In python3, there is a bytes() method that is in the same format as encode().
str1 = b'hello world'
str2 = bytes("hello world", encoding="UTF-8")
print(str1 == str2) # Returns True
I didn't read anything about this in the docs, but perhaps I wasn't looking in the right place. This way you can explicitly turn strings into byte streams and have it more readable than using encode and decode, and without having to prefex b in front of quotes.
This is a Python 101 type question,
It's a simple question but one where the answer is not so simple.
In python3, a "bytes" object represents a sequence of bytes, a "string" object represents a sequence of unicode code points.
To convert between from "bytes" to "string" and from "string" back to "bytes" you use the bytes.decode and string.encode functions. These functions take two parameters, an encoding and an error handling policy.
Sadly there are an awful lot of cases where sequences of bytes are used to represent text, but it is not necessarily well-defined what encoding is being used. Take for example filenames on unix-like systems, as far as the kernel is concerned they are a sequence of bytes with a handful of special values, on most modern distros most filenames will be UTF-8 but there is no gaurantee that all filenames will be.
If you want to write robust software then you need to think carefully about those parameters. You need to think carefully about what encoding the bytes are supposed to be in and how you will handle the case where they turn out not to be a valid sequence of bytes for the encoding you thought they should be in. Python defaults to UTF-8 and erroring out on any byte sequence that is not valid UTF-8.
print(bytesThing)
Python uses "repr" as a fallback conversion to string. repr attempts to produce python code that will recreate the object. In the case of a bytes object this means among other things escaping bytes outside the printable ascii range.
TRY THIS:
StringVariable=ByteVariable.decode('UTF-8','ignore')
TO TEST TYPE:
print(type(StringVariable))
Here 'StringVariable' represented as a string. 'ByteVariable' represent as Byte. Its not relevent to question Variables..

artist working with base 64

Hi Im fairly new to base64 code and could do with some info/help etc. I have recently used it while converting my photos into code as part of an experiment regarding my sculptures, I converted a photo into 152 pages of base 64 using an online converter, just to get started. I was wondering how I can reverse this, and also if I edited out some code would the image change or would it not convert at all!? Like I said I am interested in using this to help my experiments. thanks for any info .
Base64 decode functions will decode it. If you remove some of the base64 code it may not translate back into a valid image or translate into valid base64 code at all depending on how much you remove. Depending on how the image was originally compressed, it MAY give you something, but the results will be incredibly unpredictable... So you'll never know what the result of the change will be on the final product. Part of the base64 standard involves a padded width (which is why some strings end in up to 2 = signs).

Groovy says my Unicode string is too long

As part of my probably wrong and cumbersome solution to print out a form I have taken a MS-Word document, saved as XML and I'm trying to store that XML as a groovy string so that I can ${fillOutTheFormProgrammatically}
However, with MS-Word documents being as large as they are, the String is 113100 unicode characters and Groovy says its limited to 65536. Is there some way to change this or am I stuck with splitting up the string?
Groovy - need to make a printable form
That's what I'm trying to do.
Update: to be clear its too long of a Groovy String.. I think a regular string might be all good. Going to change strategy and put some strings in the file I can easily find like %!%variable_name%!% and then do .replace(... uh i feel a new question coming on here...
Are you embedding this string directly in your groovy code? The jvm itself has a limit on the length of string constants, see the VM Spec if you are interested in details.
A ugly workaround might be to split the string in smaller parts and concatenate them at runtime. A better solution would be to save the text in an external file and read the contents from your code. You could also package this file along with your code and access it from the classpath using Class#getResourceAsStream.

Resources