Python3 utf-8 decoding/encoding problems with data hiding - python-3.x

I'm trying to take the text from a file (the text is Russian), hide it in an image, and then later be able to retrieve it from the image. However, I keep getting binascii.Error: Odd-length string when I try to retrieve the data from the image I hid it in.
I feel like the problem may lie within what I use to hide the text. When I do someString = file.read() on the file, and print someString everything comes out fine. But when I run:
file = open(<text file path>, 'r', encoding='utf-8')
entireText = file.read()
print(codecs.encode(entireText,'utf-8'))
I get the following:
b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xbf\xd0\xb5\xd0\xb2:\n\xd0\x9e\xd1\x87\xd0\xb8 \xd1\x87\xd1\x91\xd1\x80\xd0\xbd\xd1\x8b\xd0\xb5, \xd0\xbe\xd1\x87\xd0\xb8 \xd0
That is only a piece of it, but the theme is shown; it has colons, spaces, commas, and \n all throughout the 'bytes' which is what type the codecs.encode returns. If i use codecs to decode it, then I get the original text back in perfect format.
if it helps, here are the functions I use to make it happen:
def stringToBinary(msg):
return bin(int(binascii.hexlify(msg.encode('utf-8')), 16))[2:]
def binaryToString(bNum):
return binascii.unhexlify('%x' % (int('0b' + bNum, 2))).decode('utf-8')
If that is not enough, the entire file is here: http://pastebin.com/f541DpzS
EDIT: I think I'm getting that issue because the image I'm trying to hide the text in didn't have enough pixels for me to hide the complete message, so it was trying to convery the binary number to a string without all of the bits, thus throwing binascii.Error: Odd-length string.

Related

Why does page 323 from Automate The boring Stuff generate an int of 21?

I'm going back through the book "Automate the boring stuff" (which has been a great book btw)as I need to brush up on CSV parsing for a project and I'm trying to understand why each output is generated. Why does this code from page 323 create an output of '21', when it's four words, 16 characters, and three commas. Not to mention that I'm entering strings and it outputs numbers.
#%%
import csv
outputFile = open('output.csv', 'w', newline='')
outputWriter = csv.writer(outputFile)
outputWriter.writerow(['spam', 'eggs', 'bacon', 'ham'])
First I thought ok it's the number of characters, but that adds up to 16. Then I thought ok each word maybe has a space plus one at the beginning and end of the CSV file? Which does technically maybe explain but nothing explicit, it's more "oh it's obvious because " but it's not explicitly stated. I'm not seeing a reference to the addition of or how that number is created.
There seems like a plausible explanation but I don't understand why it's 21.
I've tried breakpoint or pdb but I'm still learning how to use those, to get the following breakdown which I don't see containing anything that answers it. No counting or summation that I can see.
The docs state that csv.csvwriter.csvwriterow returns "the return value of the call to the write method of the underlying file object.".
In your example
outputFile = open('output.csv', 'w', newline='')
is your underlying file object which you then hand to csv.writer().
If we look a bit deeper we can find the type of outputFile with print(type(outputFile)).
<class '_io.TextIOWrapper'>
While the docs don't explicitly define the write method for TextIOWrapper, it does state that it inherits from TextIOBase, which defines it's write() method as "Write the string s to the stream and return the number of characters written.".
If we look at the text file written:
spam,eggs,bacon,ham
We see that it indeed has 21 characters.

Skip processing fenced code blocks when processing Markdown files line by line

I'm a very inexperienced Python coder so it's quite possible that I'm approaching this particular problem in completely the wrong way but I'd appreciate any suggestions/help.
I have a Python script that goes through a Markdown file line by line and rewrites [[wikilinks]] as standard Markdown [wikilink](wikilink) style links. I'm doing this using two regexes in one function as shown below:
def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.
:param file_obj: Path to file
:return: List object containing modified text. Newlines will be returned as '\n' strings.
"""
file = file_obj
linelist = []
logging.debug("Going to open file %s for processing now.", file)
try:
with open(file, encoding="utf8") as infile:
for line in infile:
linelist.append(re.sub(r"(\[\[)((?<=\[\[).*(?=\]\]))(\]\])(?!\()", r"[\2](\2.md)", line))
# Finds references that are in style [[foo]] only by excluding links in style [[foo]](bar).
# Capture group $2 returns just foo
linelist_final = [re.sub(r"(\[\[)((?<=\[\[)\d+(?=\]\]))(\]\])(\()((?!=\().*(?=\)))(\))",
r"[\2](\2 \5.md)", line) for line in linelist]
# Finds only references in style [[foo]](bar). Capture group $2 returns foo and capture group $5
# returns bar
except EnvironmentError:
logging.exception("Unable to open file %s for reading", file)
logging.debug("Finished processing file %s", file)
return linelist_final
This works fine for most Markdown files. However, I can occasionally get a Markdown file that has [[wikilinks]] within fenced code blocks such as the following:
# Reference
Here is a reference to “the Reactome Project” using smart quotes.
Here is an image: ![](./images/Screenshot.png)
[[201802150808]](Product discovery)
```
[[201802150808 Product Prioritization]]
def foo():
print("bar")
```
In the above case I should skip processing the [[201802150808 Product Prioritization]] inside the fenced code block. I have a regex that identifies the fenced code block correctly namely:
(?<=```)(.*?)(?=```)
However, since the existing function is running line by line, I have not been able to figure out a way to skip the entire section in the for loop. How do I go about doing this?
You need to use a full Markdown parser to be able to cover all of the edge cases. Of course, most Markdown parsers convert Markdown directly to HTML. However, a few will use a two step process where step one converts the raw text to an Abstract Syntax Tree (AST) and step two renders the AST to the output format. It is not uncommon to find a Markdown renderer (outputs Markdown) which can replace the default HTML renderer.
You would simply need to modify either the parser step (using a plugin to add support for the wikilink syntax) or modify the AST directly. Then pass the AST to a Markdown renderer, which will give you a nicely formatted and normalized Markdown document. If you are looking for a Python solution, mistunePandoc Filters might be a good place to start.
But why go through all that when a few well crafted regular expressions can be run on the source text? Because Markdown parsing is complicated. I know, it seems easy at first. After all Markdown is easy to read for a human (which was one of its defining design goals). However, parsing is actually very complicated with parts of the parser reliant on previous steps.
For example, in addition to fenced code blocks, what about indented code blocks? But you can't just check for indentation at the beginning of a line, because a single line of a nested list could look identical to an indented code block. You want to skip the code block, but not the paragraph nested in a list. And what if your wikilink is broken across two lines? Generally when parsing inline markup, Markdown parsers will treat a single line break no different than a space. The point of all of this is that before you can start parsing inline elements, the entire document needs to first be parsed into its various block-level elements. Only then can you step through those and parse inline elements like links.
I'm sure there are other edge cases I haven't thought of. The only way to cover them all is to use a full-fledged Markdown parser.
I was able to create a reasonably complete solution to this problem by making a few changes to my original function, namely:
Replace the python re built-in with the regex module available on PyPi.
Change the function to read the entire file into a single variable instead of reading it line by line.
The revised function is as follows:
import regex
def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.
:param file_obj: Path to file
:return: String containing modified text. Newlines will be returned as '\\n' in the string.
"""
file = file_obj
try:
with open(file, encoding="utf8") as infile:
line = infile.read()
# Read the entire file as a single string
linelist = regex.sub(r"(?V1)"
r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
# Ignore fenced & inline code blocks. V1 engine allows in-line flags so
# we enable newline matching only here.
r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
# Ignore code blocks beginning with 4 spaces/1 tab
r"|(\[\[(.*)\]\](?!\s\(|\())", r"[\3](\3.md)", line)
# Finds references that are in style [[foo]] only by excluding links in style [[foo]](bar) or
# [[foo]] (bar). Capture group $3 returns just foo
linelist_final = regex.sub(r"(?V1)"
r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
# Refer comments above for this portion.
r"|(\[\[(\d+)\]\](\s\(|\()(.*)(?=\))\))", r"[\3](\3 \5.md)", linelist)
# Finds only references in style [[123]](bar) or [[123]] (bar). Capture group $3 returns 123 and capture
# group $5 returns bar
except EnvironmentError:
logging.exception("Unable to open file %s for reading", file)
return linelist_final
The above function handles [[wikilinks]] in inline code blocks, fenced code blocks and code blocks indented with 4 spaces. There is currently one false positive scenario where it ignores a valid [[wiklink]] which is when the link appears on the 3rd level or deeper of a Markdown list, i.e.:
* Level 1
* Level 2
* [[wikilink]] #Not recognized
* [[wikilink]] #Not recognized.
However my documents do not have wikilinks nested at that level in lists so it's not a problem for me.

np.save is converting floats to weird characters

I am attempting to append results to an ongoing csv file. Each result comes out as an nd.array:
[IN]: Print(savearray)
[OUT]: [[ 0.55219001 0.39838119]]
Initially I tried
np.savetxt('flux_ratios.csv', savearray,delimiter=",")
But this overwrites the old data every time I save, so instead I am attempting to append the data like this:
f = open('flux_ratios.csv', 'ab')
np.save(f, 'a',savearray)
f.close()
This is (in a sense) appending, however it is saving the numerical data as weird characters, as can be seen in this screenshot:
I have no idea why or how this is happening so any help would be greatly appreciated!
First off, np.save does not write text whereas np.savetxt does. You are trying to combine binary with text, which is why you get the odd characters when you try to read the file.
You could just change np.save(f, 'a', savearray) to np.savetxt(f, savearray, delimiter=',').
Otherwise you could also consider using pandas.to_csv in append mode.

Arabic text replaced with escape sequences when creating CSV files using python

I am trying to create a CSV file that contains Arabic tweets collected using tweepy for a project I am doing. All is fine gathering the data, however, when i am writing to the CSV file all Arabic results are escaped with \xXXXX sequences
as follows:
b'#\xd8\xa7\xd9\x84\xd9\x8a\xd9\x88\xd9\x85_\xd8\xa7\xd9\x84\xd8\xb9\xd8\xa7\xd9\x84\xd9\x85\xd9\x8a_\xd9\x84\xd9\x84\xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd9\x87_2017 \xd8\xa7\xd9\x84\xd8\xa5\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9 \xd8\xa7\xd9\x84\xd8\xad\xd9\x82\xd9\x8a\xd9\x82\xd9\x8a\xd8\xa9 \xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9 \xd8\xa7\xd9\x84\xd9\x81\xd9\x83\xd8\xb1 \xd9\x88\xd9\x84\xd9\x8a\xd8\xb3\xd8\xaa \xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9
I looked at many previously asked questions and all I could find was suggestions for python 2 or answers similar to the one I am writing. When I was creating JSON files instead I was using ensure_ascii=False but I couldn't find anything similar for CSV. Below is my code:
with codecs.open('tweets.csv', 'a', encoding='utf-8') as file:
fieldnames = ['tweet', 'country']
writer = csv.DictWriter(file, fieldnames=fieldnames)
data = {'tweet': status.text, 'country': status.place.full_name}
writer.writerow(data)
I tried adding .encoding='utf-8' to status.text and status.place as well but that also didn't work. Any suggestions?
You have to make sure the Arabic string you have is decoded into UTF-8 before you write it. Assuming status.text is of type bytes you should type text=status.text.decode('utf-8'). (Maybe you have to do this for status.place.full_name too.) But if it's of type str then it won't have an decode() method. To avoid escape sequences in your file, a str object should be written anyway.
If you try to specify the encoding of a bytes object (like the one you presumably have) as 'utf-8' that won't work because the text is already in UTF-8 bytes. So in order to get UTF-8 characters you must call decode() on the bytes object. That way it writes the UTF-8 characters and not the UTF-8 bytes.

read login data from text file into dictionary error

Using the answer on Stack Overflow shown on this link: https://stackoverflow.com/a/4804039, I have attempted to read in the file contents into a dictionary. There is an error that I cannot seem to fix.
Code
def login():
print("====Login====")
userinfo={}
with open("userinfo.txt","r") as f:
for line in f:
(key,val)=line.split()
userinfo[key]=val
print(userinfo)
File Contents
{'user1': 'pass'}
{'user2': 'foo'}
{'user3': 'boo'}
Error:
(key,val)=line.split()
ValueError: not enough values to unpack (expected 2, got 0)
I have a question to which I would very much appreciate a two fold answer
What is the best and most efficient way to read in file contents, as shown, into a dictionary, noting that it has already been stored in dictionary format.
Is there a way to WRITE to a dictionary to make this "reading" easier? My code for writing to the userinfo.txt file in the first place is shown below
Write code
with open("userinfo.txt","a",newline="")as fo:
writer=csv.writer(fo)
writer.writerow([{username:password}])
Could any answers please attempt the following
Provide a solution to the error using the original code
Suggest the best method to do the same thing (simplest for teaching purposes) Note, that I do not wish to use pickle, json or anything other than very basic file handling (so only reading from a text file or csv reader/writer tools). For instance, would it be best to read the file contents into a list and then convert the list into a dictionary? Or is there any other way?
Is there a method of writing a dictionary to a text file using csv reader or other basic txt file handling, so that the reading of the file contents into a dictionary could be done more effectively on the other end.
Update:
Blank line removed, and the code works but produces the erroneous output:
{"{"Vjr':": "'open123'}", "{'mvj':": "'mvv123'}"}
I think I need to understand the split and strip commands and how to use them in this context to produce the desired result (reading the contents into a dictionary userinfo)
Well let's start with the basics first. The error message:
ValueError: not enough values to unpack (expected 2, got 0)
means a line was empty, so do you have a blank line in the file?
Yes, there are other options on saving your dictionary out and bringing it back, but first you should understand this, and may work just fine for you. :-) The split() is acting on the string you read from the file, and by default will split on the space, so that is what you are seeing. You could format your text file like 'username:pass' instead and then use split(':").
File Contents
user1:pass
user2:foo
user3:boo
Code
def login():
print("====Login====")
userinfo={}
with open("userinfo.txt","r") as f:
for line in f:
(key,val)=line.split(':')
userinfo[key]=val.strip()
print(userinfo)
if __name__ == '__main__':
login()
This simple format may be best if you want to be able to edit the text file by hand, and I like to keep it simple as possible. ;-)

Resources