Writing to files in ASCII with Python3, not UTF8

Writing to files in ASCII with Python3, not UTF8 - python-3.x

I have a program that I created with two sections.
The first one copies a text file with an integer in the middle of the file name in this format.
file = "Filename" + "str(int)" + ".txt"
the user can create as many copies of the file that they would like.
The second part of the program is what I am having the problem with. There is an integer at the very bottom of the file that is to correspond with the integer in the file name. After the first part is done, I open each file one at a time in "r+" read/write format. So I can file.seek(1000) to about where the integer is in the file.
Now in my opinion the next part should be easy. I should just simply have to write str(int) into the file right here. But it wasn't that easy. It worked just fine doing it like that in Linux at home, but at work on Windows it proved difficult. What I ended up having to do after file.seek(1000) is write to the file using Unicode UTF-8. I accomplished this with this code snippet of the rest of the program. I will document it so that it is able to be understood what is going on. Instead of having to write this in Unicode, I would love to be able to write this in good old regular English ASCII characters. Eventually this program will be expanded to include a lot more data at the bottom of each file. Having to write the data in Unicode is going to make things extremely difficult. If I just write the data without turning it into Unicode this is the result. This string is supposed to say #2 =1534, instead it says #2 =ㄠ㌵433.
If someone can show me what I am doing wrong that would be great. I would love to just use something like file.write('1534') to write the data to the file instead of having to do it in Unicode UTF-8.
while a1 < d1 :
file = "file" + str(a1) + ".par"
f = open(file, "r+")
f.seek(1011)
data = f.read() #reads the data from that point in the file into a variable.
numList= list(str(a1)) # "a1" is the integer in the file name. I had to turn the integer into a list to accomplish the next task.
replaceData = '\x00' + numList[0] + '\x00' + numList[1] + '\x00' + numList[2] + '\x00' + numList[3] + '\x00' #This line turns the integer into Utf 8 Unicode. I am by no means a Unicode expert.
currentData = data #probably didn't need to be done now that I'm looking at this.
data = data.replace(currentData, replaceData) #replaces the Utf 8 string in the "data" variable with the new Utf 8 string in "replaceData."
f.seek(1011) # Return to where I need to be in the file to write the data.
f.write(data) # Write the new Unicode data to the file
f.close() #close the file
f.close() #make sure the file is closed (sometimes it seems that this fails in Windows.)
a1 += 1 #advances the integer, and then return to the top of the loop

This is an example of writing to a file in ASCII. You need to open the file in byte mode, and using the .encode method for strings is a convenient way to get the end result you want.
s = '12345'
ascii = s.encode('ascii')
with open('somefile', 'wb') as f:
f.write(ascii)
You can obviously also open in rb+ (read and write byte mode) in your case if the file already exists.
with open('somefile', 'rb+') as f:
existing = f.read()
f.write(b'ascii without encoding!')
You can also just pass string literals with the b prefix, and they will be encoded with ascii as shown in the second example.

Related

How to convert Hex to original file format?

I have a .tgz file that was formatted as shell code, it looks like this (Hex):
"\x1F\x8B\x08\x00\x44\x7A\x91\x4F\x00\x03\xED\x59\xED\x72.."
It was generated this way (python3):
import os
def main():
dump_src = "MyPlugin.tgz"
fc = ""
try:
with open(dump_src, 'rb') as fd:
fcr = fd.read()
for byte in bytearray(fcr):
fc += "\\x{:02x}".format(byte)
except:
fcr = dump_src
for byte in bytearray(fcr):
fc += "\\x{:02x}".format(byte)
print(fc)
# failed attempt:
fcback = bytes(int(fc[i+2:i+4], 16) for i in range(0, len(fc), 4))
print (fcback)
if __name__ == "__main__":
main()
How can I convert this back to the original tgz archive?
Edit: failed attempt in the last section outputs this:
b'\x8b\x00\x10]\x03\x93o0\x85%\xe2!\xa4H\xf1Fi\xa7\x15\xf61&\x13N\xd9[\xfag\x11V\x97\xd3\xfb%\xf7\xe3\\\xae\xc2\xff\xa4>\xaf\x11\xcc\x93\xf1\x0c\x93\xa4\x1b\xefxj\xc3?\xf9\xc1\xe8\xd1\xd9\x01\x97qB"\x1a\x08\x9cO\x7f\xe9\x19\xe3\x9c\x05\xf2\x04a\xaa\x00A,\x15"RN-\xb6\x18K\x85\xa1\x11\x83\xac/\xffR\x8a\xa19\xde\x10\x0b\x08\x85\x93\xfc]\x8a^\xd2-T\x92\x9a\xcc-W\xc7|\xba\x9c\xb3\xa6V0V H1\x98\xde\x03#\x14\'\n 1Y\xf7R\x14\xe2#\xbe*:\xe0\xc8\xbb\xc9\x0bo\x8bm\xed.\xfd\xae\xef\x9fT&\xa1\xf4\xcf\xa7F\xf4\xef\xbb"8"\xb5\xab,\x9c\xbb\xfc3\x8b\xf5\x88\xf4A\x0ek%5eO\xf4:f\x0b\xd6\x1bi\xb6\xf3\xbf\xf7\xf9\xad\xb5[\xdba7\xb8\xf9\xcd\xba\xdd,;c\x0b\xaaT"\xd4\x96\x17\xda\x07\x87& \xceH\xd6\xbf\xd2\xeb\xb4\xaf\xbd\xc2\xee\xfc\'3zU\x17>\xde\x06u\xe3G\x7f\x1e\xf3\xdf\xb6\x04\x10A\x04\x10A\x04\x10A\x04\x10A\xff\x9f\xab\xe8(\x00'
And when I output it to a file (e.g. via python3 main.py > MyFile.tgz) the file is corrupted.

Since you know the format of the data (each byte is encoded as a string of 4 characters in the format "\xAB") it's easy to revert the conversion and get the original bytes again. It'll only take one line of Python code:
data = bytes(int(fc[i+2:i+4], 16) for i in range(0, len(fc), 4))
This uses:
range(start, stop, step) with step 4 to iterate in groups of 4 characters through your string
slicing to get each group of 2 hexadecimal digits
int(x, base) to convert the hexadecimal string to an integer
a generator expression to immediately pass the converted elements to:
bytes() to create a bytes object with the data
The variable data is now of type bytes and you could directly write it to a file (to decompress with an external zip program), or pass it to zlib.decompress() (to further process it in Python).
UPDATE (follow-up on the comments and updated question):
Firstly, I have tested the above code and it does result in the same bytes as the input. Are you really sure that the example output in your question is the actual result of the code in your question? Please try to be careful when copying code and/or output. A few remarks:
Your code is not properly formatted, so I cannot run it without making modifications. And when I have made modifications to the code, I might run different code than you do, yielding different results. So next time please copy-paste your exact (working, tested) code without modifications.
The format string in your code uses lowercase hexadecimal format, and your first example output uses uppercase. So that output cannot be from this code.
I don't have access to your file "MyPlugin.tgz", but when I test your code with another .tgz file (after fixing the IndentationErrors), my output is correct. It starts with \x1f\x8b as expected (this is the magic number in the gzip header). I can't explain why your output is different...
Secondly, it seems like you don't fully understand how bytes and string representations work. When you write print(fcback), a string representation of the Python object fcback (in this case a bytes object) is printed. The string representation of a bytes object is not the same as the binary data! When printing a bytes object, each byte that corresponds to a printable ASCII character is replaced by that character, other bytes are escaped (similar to the formatted string that your code generates). Also, it starts with b' and ends with '.
You cannot print binary data to your terminal and then pipe the output to a file. This will result in a different file. The correct way to write the data to a file is using file.write(data) in your Python code.
Here's a fully working example:
def binary_to_text(data):
"""Convert a bytes object to a formatted text string."""
text = ""
for byte in data:
text += "\\x{:02x}".format(byte)
return text
def text_to_binary(text):
"""Convert a formatted text string to a bytes object."""
return bytes(int(text[i+2:i+4], 16) for i in range(0, len(text), 4))
def main():
# Read the binary data from input file:
with open('MyPlugin.tgz', 'rb') as input_file:
input_data = input_file.read()
# Convert binary to text (based on your original code):
text = binary_to_text(input_data)
print(text[0:100])
# Convert the text back to binary:
output_data = text_to_binary(text)
print(output_data[0:100])
# Write the binary data back to a file:
with open('MyPlugin-restored.tgz', 'wb') as output_file:
output_file.write(output_data)
if __name__ == '__main__':
main()
Note that I only print the first 100 elements to keep the output short. Also notice that the second print-statement prints a much longer text. This is because the first print gets 100 characters (which are printed "as is"), while the second print gets 100 bytes (of which most bytes are escaped, causing the output to be longer).

How can I decode a .bin into a .pdf

I extracted an embedded object from an excel spreadsheet that was a pdf but the excel zip file saves embedded objects as binary files.
I am trying to read the binary file and return it to it's original format as a pdf. I took some code from another question with a similar issue but when i try opening the pdf adobe gives error "can't open because file is damaged...not decoded correctly.."
Does anyone know of a way to do this?
with open('oleObject1.bin','rb') as f:
binaryData = f.read()
print(binaryData)
with open(os.path.expanduser('test1.pdf'), 'wb') as fout:
fout.write(base64.decodebytes(binaryData))
Link to the object file on github

Thanks Ryan, I was able to see what you are talking about. Here is solution for future reference.
str1 = b'%PDF-' # Begin PDF
str2 = b'%%EOF' # End PDF
with open('oleObject1.bin', 'rb') as f:
binary_data = f.read()
print(binary_data)
# Convert BYTE to BYTEARRAY
binary_byte_array = bytearray(binary_data)
# Find where PDF begins
result1 = binary_byte_array.find(str1)
print(result1)
# Remove all characters before PDF begins
del binary_byte_array[:result1]
print(binary_byte_array)
# Find where PDF ends
result2 = binary_byte_array.find(str2)
print(result2)
# Subtract the length of the array from the position of where PDF ends (add 5 for %%OEF characters)
# and delete that many characters from end of array
print(len(binary_byte_array))
to_remove = len(binary_byte_array) - (result2 + 5)
print(to_remove)
del binary_byte_array[-to_remove:]
print(binary_byte_array)
with open(os.path.expanduser('test1.pdf'), 'wb') as fout:
fout.write(binary_byte_array)

The bin file contains a valid PDF. There is no decoding required. The bin file though does have bytes before and after the PDF that need to be trimmed.
To get the first byte look for the first occurrence of string %PDF-
To get the final byte look for the last %%EOF.
Note, I do not know what "format" the leading/trailing bytes are, that are added by Excel. The solution above obliviously would not work if either of the ascii strings above could also be in the leading/trailing data.

You should try using a python library that allows you to write pdf files like reportlab or pyPDF

Erasing part of a text file in Python

I have a text file in my hard disk which is really big. It has around 8 million json files which are separated by comma and I want to remove the last json ; however, because it is really big I cannot do it via regular editors (Notepad++, Sublime, Visual Studio Code, ...). So, I decided to use Python, but I have no clue how to erase part of an existing file using python. Any kind of help would be appreciated.
P.S: My file has such a structure:
json1, json2, json3, ...
when each json looks like {"a":"something", "b":"something", "c":"something"}

The easiest way would be to make the file content valid JSON by enclosing it with [ and ] so it becomes a list of dicts, and after removing the last item from the list, you can dump it back into a string and then remove its first and the last characters, which will be [ and ], which your original text file does not want:
import json
with open('file.txt', 'r') as r, open('newfile.txt', 'w') as w:
w.write(json.dumps(json.loads('[%s]' % r.read())[:-1])[1:-1])

Since you only want the last JSON object removed from the file, a much more efficient method would be to identify the first valid JSON object at the end of the file and truncate the file from where that JSON object's preceding comma is positioned.
This can be accomplished by seeking and reading backwards from the end of the file, one relatively small chunk at a time, split the buffer by { (since it marks the beginning of a JSON object), and prepend the fragments one at a time to a buffer until the buffer is parsable as a JSON object (this makes the code able to handle nested dict structures), at which point you should find the preceding comma from the preceding fragment and prepend the comma to the buffer, so that finally, you can seek the file to where the buffer starts and truncate the file:
import json
chunk_size = 1024
with open('file.txt', 'rb+') as f:
f.seek(-chunk_size, 2)
buffer = ''
while True:
fragments = f.read(chunk_size).decode().split('{')
f.seek(-chunk_size * 2, 1)
i = len(fragments)
for fragment in fragments[:0:-1]:
i -= 1
buffer = '{%s%s' % (fragment, buffer)
try:
json.loads(buffer)
break
except ValueError:
pass
else:
buffer = fragments[0] + buffer
continue
break
next_fragment = fragments[i - 1]
# if we don't have a comma in the preceding fragment and it is already the first
# fragment, we need to read backwards a little more
if i == 1 and ',' not in fragments[0]:
f.seek(-2, 1)
next_fragment = f.read(2).decode() + next_fragment
buffer = next_fragment[next_fragment.rindex(','):] + buffer
f.seek(-len(buffer.encode()), 2)
f.truncate()

The output values in one line.(python3/csv.write)

I write a list of dics into a csv file. But the output is in one line. How could witer each value in new lines?
f = open(os.getcwd() + '/friend1.csv','w+',newline='')
for Member in MemberList:
f.write(str(Member))
f.close()

Take a look at the writing example in the csv module of the standard library and this question. Either that, or simply append a newline ("\n") after each write: f.write(str(Member)) + "\n").

python3 opening files and reading lines

Can you explain what is going on in this code? I don't seem to understand
how you can open the file and read it line by line instead of all of the sentences at the same time in a for loop. Thanks
Let's say I have these sentences in a document file:
cat:dog:mice
cat1:dog1:mice1
cat2:dog2:mice2
cat3:dog3:mice3
Here is the code:
from sys import argv
filename = input("Please enter the name of a file: ")
f = open(filename,'r')
d1ct = dict()
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
for line in f:
if '\n' == line[-1]:
line = line[:-1]
(AnimalId, Timestamp, StationId,) = line.split(':')
key = (AnimalId,StationId,)
if key not in d1ct:
d1ct[key] = 0
d1ct[key] += 1

The magic is at:
for line in f:
if '\n' == line[-1]:
line = line[:-1]
Python file objects are special in that they can be iterated over in a for loop. On each iteration, it retrieves the next line of the file. Because it includes the last character in the line, which could be a newline, it's often useful to check and remove the last character.

As Moshe wrote, open file objects can be iterated. Only, they are not of the file type in Python 3.x (as they were in Python 2.x). If the file object is opened in text mode, then the unit of iteration is one text line including the \n.
You can use line = line.rstrip() to remove the \n plus the trailing withespaces.
If you want to read the content of the file at once (into a multiline string), you can use content = f.read().
There is a minor bug in the code. The open file should always be closed. I means to use f.close() after the for loop. Or you can wrap the open to the newer with construct that will close the file for you -- I suggest to get used to the later approach.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Writing to files in ASCII with Python3, not UTF8 - python-3.x

Related

How to convert Hex to original file format?

How can I decode a .bin into a .pdf

Erasing part of a text file in Python

The output values in one line.(python3/csv.write)

python3 opening files and reading lines

Categories

Resources