Base64 encoded file says "GZIP", but decoding it in Python outputs corrupt HTML

Base64 encoded file says "GZIP", but decoding it in Python outputs corrupt HTML - base64

I'm having trouble reading data from files I have from an old backup (Windows system).
Example how the content looks like:
GZIP
-}_HTML>
<H AD>
<META HTTP-EQUV="Conten-Type" CO!TENT="tex/html; chrset=wind&ws-1252">
It's almost proper HTML... but some characters are corrupted.
In Base64, it looks like this:
R1pJUAwAAAAKAAAALX0AAF9IVE1MPg0KPEggQUQ+DQo8TUVUQSBIVFRQLUVRVQ5WPSJDb250ZW4ZLVR5cGUiIENPIVRFTlQ9InRleBwvaHRtbDsgY2gTcnNldD13aW5kJndzLTEyNTIi
Since it says "GZIP" at the top, I tried decompressing it with gzip in Python.
import zlib
import base64
s = "R1pJUAwAAAAKAAAALX0AAF9IVE1MPg0KPEggQUQ+DQo8TUVUQSBIVFRQLUVRVQ5WPSJDb250ZW4ZLVR5cGUiIENPIVRFTlQ9InRleBwvaHRtbDsgY2gTcnNldD13aW5kJndzLTEyNTIi"
s = base64.b64decode(s.encode('Latin1'))
zlib.decompress(s, 31)
Though I'm getting the error:
zlib.error: Error -3 while decompressing data: incorrect header check
Same with this code:
import gzip
s = gzip.decompress(s)
s = str(s,'utf-8')
print(s)
gzip.BadGzipFile: Not a gzipped file (b'GZ')
Any idea how I can recover data from this file?

It is neither gzip nor any sort of compression at all. Despite the word "GZIP" at the top. It is what you see.

Related

Not able to read file in Pypandoc

I am trying to covert a pdf to html using Pandoc. I have installed pandoc binary , added the environment variable path and then using
import pypandoc
import os
os.environ.setdefault('PYPANDOC_PANDOC', 'C://Program Files//Pandoc//pandoc.exe')
file_path = r"D:/46580375_1593783098922.pdf"
output = pypandoc.convert_file("46580375_1593783098922.pdf", to='html', outputfile= 'test.html')
It is giving me an error :
RuntimeError: Invalid input format! Got "pdf" but expected one of
these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2,
gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown,
markdown_github, markdown_mmd, markdown_phpextra, markdown_strict,
mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki,
twiki, vimwiki
What am I missing?

As the error said, you can't convert PDF to HTML via pandoc.

Download xml file from the server with Python3

am trying to download a xml file from public data bank
http://api.worldbank.org/v2/en/indicator/SP.POP.TOTL?downloadformat=xml
I tried to do it with requests:
import requests
response = requests.get(url)
response.encoding = 'utf-8' #or response.apparent_encoding
print(response.content)
and wget
import wget
wget.download(url, './my.xml')
But both of the ways provide mess instead of a correct file (it looks like a broken encoding, but I cannot fix it)
If I try to download the file via web browser I get correct a UTF-8 xml file.
What am I doing wrong in the code?

Decode binary file in python3 does not work

Im trying to decode a binary file.
Here is a sniped from the file:
Content-Length: 122092
Content-Type: application/octet-stream
b'\x00\x01\x00\x00\x00\x0e\x00\x80\x00\x03\x00`FFTMm\x02\xd2u\x00\x00\x00\xec\x00\x00\x00\x1cGDEF\x00\'\x023\x00\x00\x01\x08\x00\x00\x00\x1eOS/2\x886z\x01\x00\x00\x01(\x00\x00\x00`cmap\xc9\x03\xa0\xac\x00\x00\x01\x88\x00\x00\x02\xc2gasp\xff\xff\x00\x03\x00\x00\x04L\x00\x00\x00\x08glyf\x05.G/\x00\x00\x04T\x00\x01\xb1\xechead\t\xe6\x15\x97\x00\x01\xb6#\x00\x00\x006hhea\x0e\xf9\n(\x00\x01\xb6x\x00\x00\x00$hmtx:b\x13<\x00\x01\xb6\x9c\x00\x00\x08\xaaloca\x18\xce\x84\xc4\x00\x01\xbfH\x00\x00\x04\\maxp\x02\x96\x02\x1c\x00\x01\xc3\xa4\x00\x00\x00
...
\xf52\xb5{ \x9e\xa8V\xb1\x93\x89m8\xca\xca\xdf\x14K\xb1g\x1dX3\xe4\xa3\xa5e\x195>\x9a&\x15\xa8\xa5eZ\xfe\x00c\x80\x06\xef\xb6\x18M\x06\x9d\xc1\xc0\xeb\x95\xc5\x8e\xce?\xcdh\xbetV\xfb\xbe\x99\x03\xbb\x
I used following code to decode the file:
file_data = open('font2.txt', 'rb').read()
file_data = file_data.decode('utf-16')
print(file_data)
Unfortunately it seems that it will not be decoded correctly. The output of decode is the same as in the encoded file.
Do you have any suggestions how to solve it?
I am grateful for any hints. :)

open large gzip file (~1gb) in python

I am beginner in python and trying to learn python. I have written few line of code to open a large gzip file (size of ~ 1gb) and want to extract some content, however I am getting memory related error. could somebody please guide me how open the gzip with limited memory. I have put a part of code that is throwing error.
import os
import gzip
with gzip.open("test.gz","rb") as peak:
for line in peak:
file_content = line.read().decode("utf-8")
print(file_content)
Error: File "/software/anaconda3/lib/python3.7/gzip.py", line 276, in read
return self._buffer.read(size)

I am trying to recreate your issue but I am unable to. Using fallocate I create a big file, then gzip it, but hit no error in Python
$ fallocate -l 2G tempfile.img
$ gzip tempfile.img
$ ipython
>>> import gzip
>>> with gzip.open('tempfile.img.gz', 'rb') as fIn:
>>> content = fIn.read()
If you hit an exception, it should have some name like OSError or something more specific. My guess is that you have a 32-bit installation of Python which would impose memory limits in the gigabyte range. This SO thread covers a way to check if you're running 32- or 64-bit.
If you post the name of the exception or a reproducible example, then I can update this answer.

File downloaded larger than original

I'm working on a little python3 server and I want to download a sqlite database from this server. But when I tried that, I discovered that the downloaded file is larger than the original : the original file size is 108K, the downloaded file size is 247K. I've tried this many times, and each time I had the same result. I also checked the sum with sha256, which have different results.
Here is my downloader.py file :
import cgi
import os
print('Content-Type: application/octet-stream')
print('Content-Disposition: attachment; filename="Library.db"\n')
db = os.path.realpath('..') + '/Library.db'
with open(db,'rb') as file:
print(file.read())
Thanks in advance !
EDIT :
I tried that :
$ ./downloader > file
file's size is also 247K.

Well, I've finally found the solution. The problem (which I didn't see first) was that the server sent plain text to client. Here is one way to send binary data :
import cgi
import os
import shutil
import sys
print('Content-Type: application/octet-stream; file="Library.db"')
print('Content-Disposition: attachment; filename="Library.db"\n')
sys.stdout.flush()
db = os.path.realpath('..') + '/Library.db'
with open(db,'rb') as file:
shutil.copyfileobj(file, sys.stdout.buffer)
But if someone has a better syntax, I would be glad to see it ! Thank you !

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Base64 encoded file says "GZIP", but decoding it in Python outputs corrupt HTML - base64

It is neither gzip nor any sort of compression at all. Despite the word "GZIP" at the top. It is what you see.

Related

Not able to read file in Pypandoc

Download xml file from the server with Python3

Decode binary file in python3 does not work

open large gzip file (~1gb) in python

File downloaded larger than original

Categories

Resources