I encoded a video file to a text file using base64:
import base64
with open("VIDEO_FILE.mp4", "rb") as videoFile:
text = base64.b64encode(videoFile.read())
file = open("TEXT_FILE.txt", "wb")
file.write(text)
file.close()
Now, I would like to read the file and extract the base64 string for a specific frame. Is this possible? Do you know how the frames are separated in the text file?
Many thanks!
Related
I am studying NLP techniques and while I have some experience with .txt files, using .docx has been troublesome. I am trying to use regex on strings, and since I am using a word document, this is my approach:
I will use textract to get a docx to txt and get the bytes to strings:
import textract
my_text = textract.process("1337.docx")
my_text = text.decode("utf-8")
I read the file:
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
I then try and do some regexs such as remove all numbers and etc, and when executing it in the main:
def regextest(doc):
...
...
text = load_doc(my_text)
tokens = regextest(text)
print(tokens)
I get the exception:
OSError: [Errno 36] File name too long: Are you buying a Tesla?\n\n\n\n - I believe the pricing is...(and more text from te file)
I know I am transforming my docx file to a text file and then, when I read the "filename", it is actually the whole text. How can I preserve the file and make it work? How would you guys approach this?
It seems that you are using the contents of the file - my_text as the filename parameter to load_doc and hence the error.
I would think that you rather want to use one of the actual file names as a parameter, possibly '1337.docx' and not the contents of this file.
I am trying to read file from directory and convert its content to base64 encoding but I am getting result as blank. I am explaining my code below.
const configAvtar = _.get(req, ['files', 'configFile']);
configAvtar.mv('./uploads/' + configAvtar.name);
const contents = fs.readFileSync(`${process.env['root_dir']}/uploads/${configAvtar.name}`, {encoding: 'base64'});
console.log('contents', contents);
Here first I am uploading file and then covert those file content to base64 encoding but in console I am getting the blank output. I need to convert the files content into base64 encoding and get the encoded output.
I have a file in CSV format which contains NULL bytes (may be 0x84) in each line. I need to read this file using c engine of pd.read_csv() .
This values causes an error while reading - 'utf-8' codec can't decode byte 0x84 in position 14 .
Is there any way out to fix it without changing the file ?
Try these options if it helps:
Option 1:
Set the engine as python.
pd.read_csv(filename, engine='python')
Option 2:
Try utf-16 encoding, because the error could also mean the file is encoded in UTF-16. Or change the encoding to the correct format example
encoding = "cp1252"
encoding = "ISO-8859-1"
Option 3:
Read the file as bytes
with open(filename, 'rb') as f:
data = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.
Alternatively you can use open method from the codecs module to read in the file:
import codecs
with codecs.open(filename, 'r', encoding='utf-8', errors='ignore') as f:
This will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
https://docs.python.org/3/howto/unicode.html#the-string-type
I extracted an embedded object from an excel spreadsheet that was a pdf but the excel zip file saves embedded objects as binary files.
I am trying to read the binary file and return it to it's original format as a pdf. I took some code from another question with a similar issue but when i try opening the pdf adobe gives error "can't open because file is damaged...not decoded correctly.."
Does anyone know of a way to do this?
with open('oleObject1.bin','rb') as f:
binaryData = f.read()
print(binaryData)
with open(os.path.expanduser('test1.pdf'), 'wb') as fout:
fout.write(base64.decodebytes(binaryData))
Link to the object file on github
Thanks Ryan, I was able to see what you are talking about. Here is solution for future reference.
str1 = b'%PDF-' # Begin PDF
str2 = b'%%EOF' # End PDF
with open('oleObject1.bin', 'rb') as f:
binary_data = f.read()
print(binary_data)
# Convert BYTE to BYTEARRAY
binary_byte_array = bytearray(binary_data)
# Find where PDF begins
result1 = binary_byte_array.find(str1)
print(result1)
# Remove all characters before PDF begins
del binary_byte_array[:result1]
print(binary_byte_array)
# Find where PDF ends
result2 = binary_byte_array.find(str2)
print(result2)
# Subtract the length of the array from the position of where PDF ends (add 5 for %%OEF characters)
# and delete that many characters from end of array
print(len(binary_byte_array))
to_remove = len(binary_byte_array) - (result2 + 5)
print(to_remove)
del binary_byte_array[-to_remove:]
print(binary_byte_array)
with open(os.path.expanduser('test1.pdf'), 'wb') as fout:
fout.write(binary_byte_array)
The bin file contains a valid PDF. There is no decoding required. The bin file though does have bytes before and after the PDF that need to be trimmed.
To get the first byte look for the first occurrence of string %PDF-
To get the final byte look for the last %%EOF.
Note, I do not know what "format" the leading/trailing bytes are, that are added by Excel. The solution above obliviously would not work if either of the ascii strings above could also be in the leading/trailing data.
You should try using a python library that allows you to write pdf files like reportlab or pyPDF
I want to write data to cp1250 encoded file and zip it without temporary storing it on filesystem.
I figured out that I need someting like this
f = io.TextIOBase(newline='', encoding='cp1250')
writer = csv.writer(f, delimiter=';', dialect='excel', quoting=csv.QUOTE_ALL)
writer.writerow([3,3,3,4])
with ZipFile('cvs.zip', 'w') as zip_file:
zip_file.writestr('test.cvs', f.getvalue())
But now on third line I got:
io.UnsupportedOperation: write
This is probably because of use io.TextIOBase, but with any stringIO i can't set encoding