Python 3 Create md5 hash - python-3.x

I must decode the following data:
b'E\x00\x00 <\xed\x00\x00>\x01\x15\xe2\xac\x140f\xa1C\xacP\x00\x00\xf8V\x00\x07\x00\x07\x00\x01\x07\x9a'
into an understandable string. For that, we were told to use hashlib and md5. But I don't know how to use it to decypher this message.
I've tried something like this:
message.hashlib().md5().decode()
But I do not obtain any result

You can't get there from here. A hash is a small refactoring of data that destroys virtually all of the information in the data. It is used to identify a revision of the data and can be used later to see if the data has changed. A good hash algorithm changes its output dramatically with even a 1 character change in the data. Consider a Midsummer Night's Dream on gutenberg.org. Its about 100,000 characters and its md5 hash is 16 bytes. You are not going to get the original back from that!
>>> import hashlib
>>> import requests
>>> night = requests.get("http://www.gutenberg.org/ebooks/1514.txt.utf-8")
>>> len(night.text)
112127
>>> print(night.text[20000:20200])
h power to say, Behold!
The jaws of darkness do devour it up:
So quick bright things come to confusion.
HERMIA
If then true lovers have ever cross'd,
It stands as an edict in destiny:
Then let
>>> print(night.text[20000:20300])
h power to say, Behold!
The jaws of darkness do devour it up:
So quick bright things come to confusion.
HERMIA
If then true lovers have ever cross'd,
It stands as an edict in destiny:
Then let us teach our trial patience,
Because it is a customary cross;
As due to love as thoughts, and dre
>>> hash = hashlib.md5(night.text.encode("utf-8")).hexdigest()
>>> print(hash)
cce0d35b8b2c4dafcbde3deb983fec0a
The hash can be very useful to see if the text has changed:
>>> hash2 = hashlib.md5(requests.get("http://www.gutenberg.org/ebooks/1514.txt.utf-8").text.encode("utf-8")).hexdigest()
>>> hash == hash2
True

I suggest you to read The Official Documentation of hashlib
Documentation.
Simple example:
import hashlib
text = 'Some text 2'
m = hashlib.md5()
m.update(b"Some text") # OR
m.update(text.encode('UTF-8'))
print(m.hexdigest())

Related

Remove non meaningful characters in pandas dataframe

I am trying to remove all
\xf0\x9f\x93\xa2, \xf0\x9f\x95\x91\n\, \xe2\x80\xa6,\xe2\x80\x99t
type characters from the below strings in Python pandas column. Although the text starts with b' , it's a string
Text
_____________________________________________________
"b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6
"b'I doubt if climate emergency 8s real, I think people will look ba\xe2\x80\xa6 '
"b'No, thankfully it doesn\xe2\x80\x99t. Can\xe2\x80\x99t see how cheap to overtourism in the alan alps can h\xe2\x80\xa6"
"b'Climate Change Poses a WidelllThreat to National Security "
"b""This doesn't feel like targeted propaganda at all. I mean states\xe2\x80\xa6"
"b'berates climate change activist who confronted her in airport\xc2\xa0
The above content is in pandas dataframe as a column..
I am trying
string.encode('ascii', errors= 'ignore')
and regex but without luck. It will be helpful if I can get some suggestions.
Your string looks like byte string but not so encode/decode doesn't work. Try something like this:
>>> df['text'].str.replace(r'\\x[0-9a-f]{2}', '', regex=True)
0 b'Hello! End Climate Silence is looking for v...
1 b'I doubt if climate emergency 8s real, I thin...
2 b'No, thankfully it doesnt. Cant see how cheap...
3 b'Climate Change Poses a WidelllThreat to Nati...
4 b""This doesn't feel like targeted propaganda ...
5 b'berates climate change activist who confront...
Name: text, dtype: object
Note you have to clean your unbalanced single/double quotes and remove the first 'b' character.
You could go through your strings and keep only ascii characters:
my_str = "b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6"
new_str = "".join(c for c in my_str if c.isascii())
print(new_str)
Note that .encode('ascii', errors= 'ignore') doesn't change the string it's applied to but returns the encoded string. This should work:
new_str = my_str.encode('ascii',errors='ignore')
print(new_str)

How to use porterstemmer

I have a data set includes restaurant reviews. I've processed my data and this is how my data set look like(0 and 1 shows is it positive or negative review):
0 ['wow', 'loved', 'place'] 1
1 ['crust', 'good'] 0
2 ['not', 'tasty', 'texture', 'nasty'] 0
3 ['stopped', 'late', 'may', 'bank', 'holiday', ... 1
4 ['the', 'selection', 'menu', 'great', 'prices'] 1
To be brief, i want to use PorterStemmer and this is how i studied to use it:
for i in range(1000):
for word in df['Review'][i]:
word = stemmer.stem(word)
I studied to use porterstemmer to stemming but it did not work. Any word did not stem(for example, in first data i expected the 'loved' word should become a 'love'). My data is still same with the dataframe which i shared above and i could not fix this.
Your code - which would be much easier to run/debug if it were a minimal reproducible example - is missing one line to replace the original word with the result of the stemming:
for i in range(1000):
for word in df['Review'][i]:
word = stemmer.stem(word)
df['Review'][i] = word ########## added
if you also add:
print( f"{word=}" )
the stemming output is:
word='wow'
word='love'
word='place'
word='crust'
word='good'
word='not'
word='tasti'
word='textur'
word='nasti'
word='stop'
word='late'
word='may'
word='bank'
word='holiday'
word='the'
word='select'
word='menu'
word='great'
word='price'
Next time you ask a question you should make the code a minimal reproducible example - this does two things: 1. getting your code to be minimal and confirming that the code you post still shows the same problem often helps to find/fix the problem, and 2) it's much easier for readers of your question to test for themselves. Never forget you're asking people to use their own time and effort to try help you with the only reward being at best a point or two of reputation; providing good runnable short code and a clear description of what's wrong with it will help you get an answer.

List.Sort() method places single digit last however non single digits are placed first python [duplicate]

I have some files that need to be sorted by name, unfortunately I can't use a regular sort, because I also want to sort the numbers in the string, so I did some research and found that what I am looking for is called natural sorting.
I tried the solution given here and it worked perfectly.
However, for strings like PresserInc-1_10.jpg and PresserInc-1_11.jpg which causes that specific natural key algorithm to fail, because it only matches the first integer which in this case would be 1 and 1, and so it throws off the sorting. So what I think might help is to match all numbers in the string and group them together, so if I have PresserInc-1_11.jpg the algorithm should give me 111 back, so my question is, is this possible ?
Here's a list of filenames:
files = ['PresserInc-1.jpg', 'PresserInc-1_10.jpg', 'PresserInc-1_11.jpg', 'PresserInc-10.jpg', 'PresserInc-2.jpg', 'PresserInc-3.jpg', 'PresserInc-4.jpg', 'PresserInc-5.jpg', 'PresserInc-6.jpg', 'PresserInc-11.jpg']
Google: Python natural sorting.
Result 1: The page you linked to.
But don't stop there!
Result 2: Jeff Atwood's blog that explains how to do it properly.
Result 3: An answer I posted based on Jeff Atwood's blog.
Here's the code from that answer:
import re
def natural_sort(l):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
return sorted(l, key=alphanum_key)
Results for your data:
PresserInc-1.jpg
PresserInc-1_10.jpg
PresserInc-1_11.jpg
PresserInc-2.jpg
PresserInc-3.jpg
etc...
See it working online: ideone
If you don't mind third party libraries, you can use natsort to achieve this.
>>> import natsort
>>> files = ['PresserInc-1.jpg', 'PresserInc-1_10.jpg', 'PresserInc-1_11.jpg', 'PresserInc-10.jpg', 'PresserInc-2.jpg', 'PresserInc-3.jpg', 'PresserInc-4.jpg', 'PresserInc-5.jpg', 'PresserInc-6.jpg', 'PresserInc-11.jpg']
>>> natsort.natsorted(files)
['PresserInc-1.jpg',
'PresserInc-1_10.jpg',
'PresserInc-1_11.jpg',
'PresserInc-2.jpg',
'PresserInc-3.jpg',
'PresserInc-4.jpg',
'PresserInc-5.jpg',
'PresserInc-6.jpg',
'PresserInc-10.jpg',
'PresserInc-11.jpg']

Python3, how to encode this string correctly?

disclaimer, I've already done a long research to solve that alone but most of the questions I found here concern Python 2.7 or doesn't solve my problem
Let's say I've the following (That example comes from BeautifulSoup doc, I'm trying to solve a bigger issue):
>>> markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> print(markup)
'Sacré bleu!'
For me, markup should be assigned to a bytes, so I could do:
>>> markup = b"<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> print(str(markup, 'utf-8'))
<h1>Sacré bleu!</h1>
Yeah ! but how do I do that transition between "<h1>Sacr\xc3\xa9 bleu!</h1>" which is wrong into b"<h1>Sacr\xc3\xa9 bleu!</h1>" ?
Because if I do:
>>> markup = b"<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> bytes(markup, "utf-8")
b'<h1>Sacr\xc3\x83\xc2\xa9 bleu!</h1>'
You see? It inserted \x83\xc2 for free.
>>> print(bytes(markup))
TypeError: string argument without an encoding
If you have the Unicode string "<h1>Sacr\xc3\xa9 bleu!</h1>", something has already gone wrong. Either your input is broken, or you did something wrong when processing it. For example, here, you've copied a Python 2 example into a Python 3 interpreter.
If you have your broken string because you did something wrong to get it, then you should really fix whatever it was you did wrong. If you need to convert "<h1>Sacr\xc3\xa9 bleu!</h1>" to b"<h1>Sacr\xc3\xa9 bleu!</h1>" anyway, then encode it in latin-1:
bytestring = broken_unicode.encode('latin1')

Python3: dictionary of dictionaries for table-content in file

the task on hand where I got stuck is, that I have to put the table content of a file in a dictionary of dictionaries structure.
The file contains something like this: (first six lines of ascii-file)
Name-----------|Alt name-------|------RA|-----DEC|-----z|---CR|----FX|---FX*|Error|---LX|--NH|ID-|Ref#----
RXCJ0000.1+0816 UGC12890 0.0295 8.2744 0.0396 0.26 5.80 5.39 12.4 0.37 5.9 1,3
RXCJ0001.9+1204 A2692 0.4877 12.0730 0.2033 0.08 1.82 1.81 17.9 3.24 5.1 1
RXCJ0004.9+1142 UGC00032 1.2473 11.7006 0.0761 0.17 3.78 3.68 12.7 0.93 5.3 2,4
RXCJ0005.3+1612 A2703 1.3440 16.2105 0.1164 0.24 4.96 4.94 11.8 2.88 3.7 B 2,5
RXCJ0006.3+1052 a) 1.5906 10.8677 0.1698 0.15 3.28 3.28 19.3 4.05 5.6 1
I can provide a file sample if necessary.
The following code works fine till it comes to storing each line-dict into a second dict.
#!/usr/bin/env python3
from collections import *
from re import *
obsrun = {}
objects = {}
re = compile('\d+.\d\d\d\d')
filename = 'test.asc'
with open(filename, 'r') as f:
lines = f.readlines()
for l in line[2:]:
#split the read lines into a list
o_bject = l.split()
#print(o_bject)
#interate over each entry and people the line-dictionary with values of interest
#what's needed (in col of table): identifier, common name, rightascension, declination
for k in o_bject:
objects.__setitem__('id', o_bject[0])
objects.__setitem__('common_name', o_bject[1])
# sometimes the common name has blanks, multiple entries or replacements
if re.match(o_bject[2]):
objects.__setitem__('ra', float(o_bject[2] ) )
objects.__setitem__('dec', float(o_bject[3] ) )
else:
objects.__setitem__('ra', float(o_bject[3] ) )
objects.__setitem__('dec', float(o_bject[4] ) )
#extract the identifier (name of the object) for use as key
name = objects.get('id')
#print(name)
print(objects) #*
# as documented in http://stackoverflow.com/questions/1024847/add-to-a-dictionary-in-python
obsrun[name] = objects
#print(obsrun)
#getting an ordered dictionary sorted by keys
OrderedDict(sorted(obsrun.items(), key= lambda t: t[0] ) ) #t[0] keys,t[1] values
What one can see from the output on console is, that the inner for-loop does what's supposed to do. It's confirmed by the print(objects) at *.
But when it comes to storing the row-dicts as value in the second dict, it's people with the same values. The keys are correctly built.
What I don't understand is, that the print() command displays the correct content of "objects" but they are not stored into "obsrun" correctly.
Does the error lie in the dict view nature or what did I do wrong?
How should I improve the code?
Thanks in advance,
Christian
You created only one dictionary, so each time through the loop you are modifying the same one.
Move the line
objects = {}
into the for l in line[2:]: loop. This will create a separate dict for each line of the file.
Also, using __setitem__ directly is unnecessary and makes the code harder to read. Change the lines from objects.__setitem__('id', o_bject[0]) to objects['id'] = o_bject[0].
It's worth pointing out that you don't really need a dict-of-dicts unless you are trying to look up the entries by name. (You don't explain much what the use case is, here.)
The one thing that leaps out from your code is that you're using setitem a lot - I think maybe you are coming from C++ or Java, where dictionaries do not have language support built in. In Python, this is not the case- you can say d[key]=value to add an item to a dictionary.
Here's some code to create a list (array) of dictionaries. It would be pretty trivial to make Table a dictionary keyed on one of the fields. I'll leave that for you to figure out. :)
Alternatively, a list is much easier to iterate over than a dict, if your problem is going to be performing computations on the data. So if you have to add up or average up or find the min/max, you probably want this version.
#!/usr/bin/env python3 -tt
data = open('test.asc')
header = data.readline().replace('-', '')
Field_names = header.split('|')
Table = []
# Read in the remaining lines, one at a time
for line in data:
fields = line.split()
Table.append(dict(zip(Field_names, fields)))
from pprint import pprint
pprint(Table)
So you say, that giving "objects" to obsrun is just linking "objects" and not copying the content? So I have to keep each inner dict since it's just linked.
You're right about setitem. I used it to make it more clear to me, what exactly I'm doing there.
I will try moving objects = {} into the inner for-loop.
Thanks for the answer. Will get back to report if that did the trick.
Update: That did it! Thanks so much, I really got stuck there, but I learned something import about dictionaries and that, in this cased, they are just linked, so it's memory saving already.
cheers,
Christian

Resources