Python3, how to encode this string correctly? - python-3.x

disclaimer, I've already done a long research to solve that alone but most of the questions I found here concern Python 2.7 or doesn't solve my problem
Let's say I've the following (That example comes from BeautifulSoup doc, I'm trying to solve a bigger issue):
>>> markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> print(markup)
'Sacré bleu!'
For me, markup should be assigned to a bytes, so I could do:
>>> markup = b"<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> print(str(markup, 'utf-8'))
<h1>Sacré bleu!</h1>
Yeah ! but how do I do that transition between "<h1>Sacr\xc3\xa9 bleu!</h1>" which is wrong into b"<h1>Sacr\xc3\xa9 bleu!</h1>" ?
Because if I do:
>>> markup = b"<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> bytes(markup, "utf-8")
b'<h1>Sacr\xc3\x83\xc2\xa9 bleu!</h1>'
You see? It inserted \x83\xc2 for free.
>>> print(bytes(markup))
TypeError: string argument without an encoding

If you have the Unicode string "<h1>Sacr\xc3\xa9 bleu!</h1>", something has already gone wrong. Either your input is broken, or you did something wrong when processing it. For example, here, you've copied a Python 2 example into a Python 3 interpreter.
If you have your broken string because you did something wrong to get it, then you should really fix whatever it was you did wrong. If you need to convert "<h1>Sacr\xc3\xa9 bleu!</h1>" to b"<h1>Sacr\xc3\xa9 bleu!</h1>" anyway, then encode it in latin-1:
bytestring = broken_unicode.encode('latin1')

Related

Sum of two numbers Python

This is my code, but the Spyder keeps saying there is an indexError of list index out of range for a = int(tokens[0]). Please advise.
import sys
input_ = sys.stdin.read()
tokens = input_.split()
a = int(tokens[0])
b = int(tokens[1])
print(a+b)
The below also works, but I see someone running the above code in Linux and worked, and I am on windows, wondering what is the cause of the above not running properly. Thanks all!
def sum_of_two_digits(first_digit, second_digit):
return first_digit + second_digit
if __name__ == '__main__':
a, b = map(int, input().split())
print(sum_of_two_digits(a, b))
To prove you're getting the input you expect, you can print(len(tokens)) or simply print(input_). But suffice to say this is not a Linux/Windows issue. Rather, your tokens variable is empty by the time you index into it (tokens[0]).
You're not getting anything into the input_ value. This may be because you're using read() and are inputing the values in an unexpected way (I am guessing?). input() will probably serve you better - note that the 'Linux' version you refer to uses input(). read() will block until you send an escape sequence, though that probably has happened if you get to the list index error.

pandas reading data from column in as float or int and not str despite dtype setting

i have an issue with pandas (0.23.4) on python 3.7 where the data is being read in as scientific notation instead of just a string despite setting the dtype setting. Here is an example of the data that is being read in
-------------------
codes
-------------------
001234544
00023455
123456789
A1253532
780E9000
00678E10
The problem comes with lines 5 and 6 of the above because they contain, i think, 'E' characters and they are being turned into scientific notation.
My reader is setup as follows.
accounts = pd.read_excel('gym_accounts.xlsx', sheet_name='Sheet1', dtype=str)
despite that dtype=str setting, it appears that pandas using something called ... a "sniffer" that detects the data type automatically and its being changed back to what I assume is float or int, and then changing it to scientific notation. One suggestion in another thread says to use something called a converter statement within the read_csv like the following
pd.read_csv('my.csv', converters = {i: str for i in range(0, 100)})
I am curious if this is a possible solution to my problem, but also i have no idea how long that range should be as it changes often. Is there any way to query the length of the column and feed that as a variable into that range call?
I looks like i can do something like len(accounts.index) ... but i cant do this till after the reader has read the file so something like this below doesnt work
accounts = pd.read_excel('gym_accounts.xlsx', sheet_name='Sheet1', converters = {i: str for i in range(0, gym_length)}))
gym_length = len(accounts.index)
the length check is after the .. i guess you call it ... data reader, so it doesnt work obviously.

covert ascii to decimal python

I have a data pandas DataFrame, where one of the columns is filled with ascii characters. I'm trying to convert this column from ascii to decimal, where, for example, the following string should be converted from in Hex:
313533313936393239382e323834303638
to:
1531969298.284068
I've tried
outf['data'] = outf['data'].map`(`lambda x: bytearray.fromhex(x).decode())
as well as
outf['data'] = outf['data'].map(lambda x: ascii.fromhex(x).decode())
The error that I get is as follows:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 8: invalid start byte
I'm not sure where the problem manifests itself. I have a txt file and a sample of its contents are as follows:
data time
313533313936393239382e32373737343800 1531969299.283273000
313533313936393239382e32373838303400 1531969299.284253000
313533313936393239382e32373938353700 1531969299.285359000
When the data was normal integers the the lambda would work fine where I used:
outf['data'] = outf['data'].astype(str)
outf['data'] = outf['data'].str[:-2:]
outf['data'] = outf['data'].map(lambda x: bytearray.fromhex(x).decode())
outf['data'] = outf['data'].astype(int)
, however now it says there's something wrong with the encoding.
I've looked on Stackoverflow, but perhaps I wasn't able to find something similar.
However, it hasn't worked. If someone where to help me out, I would very much appreciate it.
You can use map with a lambda function for bytearray.fromhex and astype to float.
out['data'].map(lambda x: bytearray.fromhex(x).decode()).astype(float)
Such lambda would do the trick:
>>> f = lambda v: float((bytearray.fromhex(v)))
>>> f('313533313936393239382e323834303638')
1531969298.284068
Note that the use of numpy's astype hinted by Scott Boston in the comment section may be better performance-wise.

Python 3 Create md5 hash

I must decode the following data:
b'E\x00\x00 <\xed\x00\x00>\x01\x15\xe2\xac\x140f\xa1C\xacP\x00\x00\xf8V\x00\x07\x00\x07\x00\x01\x07\x9a'
into an understandable string. For that, we were told to use hashlib and md5. But I don't know how to use it to decypher this message.
I've tried something like this:
message.hashlib().md5().decode()
But I do not obtain any result
You can't get there from here. A hash is a small refactoring of data that destroys virtually all of the information in the data. It is used to identify a revision of the data and can be used later to see if the data has changed. A good hash algorithm changes its output dramatically with even a 1 character change in the data. Consider a Midsummer Night's Dream on gutenberg.org. Its about 100,000 characters and its md5 hash is 16 bytes. You are not going to get the original back from that!
>>> import hashlib
>>> import requests
>>> night = requests.get("http://www.gutenberg.org/ebooks/1514.txt.utf-8")
>>> len(night.text)
112127
>>> print(night.text[20000:20200])
h power to say, Behold!
The jaws of darkness do devour it up:
So quick bright things come to confusion.
HERMIA
If then true lovers have ever cross'd,
It stands as an edict in destiny:
Then let
>>> print(night.text[20000:20300])
h power to say, Behold!
The jaws of darkness do devour it up:
So quick bright things come to confusion.
HERMIA
If then true lovers have ever cross'd,
It stands as an edict in destiny:
Then let us teach our trial patience,
Because it is a customary cross;
As due to love as thoughts, and dre
>>> hash = hashlib.md5(night.text.encode("utf-8")).hexdigest()
>>> print(hash)
cce0d35b8b2c4dafcbde3deb983fec0a
The hash can be very useful to see if the text has changed:
>>> hash2 = hashlib.md5(requests.get("http://www.gutenberg.org/ebooks/1514.txt.utf-8").text.encode("utf-8")).hexdigest()
>>> hash == hash2
True
I suggest you to read The Official Documentation of hashlib
Documentation.
Simple example:
import hashlib
text = 'Some text 2'
m = hashlib.md5()
m.update(b"Some text") # OR
m.update(text.encode('UTF-8'))
print(m.hexdigest())

Can't pull out the information from object using Beautiful Soup 4

I am working (for the first time) with scraping a website. I am trying to pull the latitude (in decimal degrees) from a website. I have managed to pull out the correct parent node that contains the information, but I am stuck on how to pull out the actual number from this. All of the searching I have done has only told me how to pull it out if I know the string (which I don't) or if the string is in a child node, which it isn't. Any help would be great.
Here is my code:
a_string = soup.find(string="Latitude in decimal degrees")
a_string.find_parents("p")
Out[46]: [<p><b>Latitude in decimal degrees</b><font size="-2">
(<u>see definition</u>)
</font><b>:</b> 35.7584895</p>]
test = a_string.find_parents("p")
print(test)
[<p><b>Latitude in decimal degrees</b><font size="-2"> (<u>see definition</u>)</font>
<b>:</b> 35.7584895</p>]
I need to pull out the 35.7584895 and save it as an object so I can append it to a dataset.
I am using Beautiful Soup 4 and python 3
The first thing to notice is that, since you have used the find_parents method (plural), test is a list. You need only the first item of it.
I will simulate your situation by doing this.
>>> import bs4
>>> HTML = '<p><b>Latitude in decimal degrees</b><font size="-2"> (<u>see definition</u>)</font><b>:</b> 35.7584895</p>'
>>> item_soup = bs4.BeautifulSoup(HTML, 'lxml')
The simplest way of recovering the textual content of this is to do this:
>>> item_soup.text
'Latitude in decimal degrees (see definition): 35.7584895'
However, you want the number. You can get this in various ways, two of which come to my mind. I assign the result of the previous statement to str so that I can manipulate the result.
>>> str = item_soup.text
One way is to search for the colon.
>>> str[1+str.rfind(':'):].strip()
'35.7584895'
The other is to use a regex.
>>> bs4.re.search(r'(\d+\.\d+)', str).groups(0)[0]
'35.7584895'

Resources