The XML File I'm trying to read starts with b':
b'<?xml version="1.0" encoding="UTF-8" ?><root><property_id type="dict"><n53987 type="int">54522</n53987><n65731 type="int">66266</n65731><n44322 type="int">44857</n44322><n11633 type="int">12148</n11633><n28192 type="int">28727</n28192><n69053 type="int">69588</n69053><n26529 type="int">27064</n26529><n4844 type="int">4865</n4844><n7625 type="int">7646</n7625><n54697 type="int">55232</n54697><n6210 type="int">6231</n6210><n26710 type="int">27245</n26710><n57915 type="int">58450</n57915
import xml.etree.ElementTree as etree
tree = etree.decode("UTF-8").parse("./property.xml")
How can I decode this file? And read the dict type afterwards?
so you can try this, but this returns an Element Instance
import ast
import xml.etree.ElementTree as etree
tree = None
with open("property.xml", "r") as xml_file:
f = xml_file.read()
# convert string representation of bytes back to bytes
raw_xml_bytes= ast.literal_eval(f)
# read XML from raw bytes
tree = etree.fromstring(raw_xml_bytes)
Another way is to read the file and convert it fully to a string file and then reread it again, this returns an ElementTree instance. You can achieve this using the following:
tree = None
with open("property.xml", "r") as xml_file:
f = xml_file.read()
# convert string representation of bytes back to bytes
raw_xml_bytes= ast.literal_eval(f)
# save the converted string version of the XML file
with open('output.xml', 'w') as file_obj:
file_obj.write(raw_xml_bytes.decode())
# read saved XML file
with open('output.xml', 'r') as xml_file:
tree = etree.parse(f)
Opening and reading an xml file will return data of type bytes, which has a .decode() method (cf. https://docs.python.org/3/library/stdtypes.html#bytes.decode). You can do the following, using the appropriate encoding name:
my_xml_text = xml_file.read().decode('utf-8')
Related
I need to create a zip file from multiple txt files generated from strings.
import zipfile
from io import StringIO
def zip_files(file_arr):
# file_arr is an array of [(fname, fbuffer), ...]
f = StringIO()
z = zipfile.ZipFile(f, 'w', zipfile.ZIP_DEFLATED)
for f in file_arr:
z.writestr(f[0], f[1])
z.close()
return f.getvalue()
file1 = ('f1.txt', 'Question1\nQuestion2\n\nQuestion3')
file2 = ('f2.txt', 'Question4\nQuestion5\n\nQuestion6')
f_arr = [file1, file2]
return zip_files(f_arr)
This throws the error TypeError: string argument expected, got 'bytes' on writestr(). I have tried to use BytesIO instead of string IO, but get the same error. This is based on this answer which is able to do this for python 2.
I can't seem to find anything online about using zipfile for multiple files stored
Zip files are binary files, so you should use an io.BytesIO stream instead of an io.StringIO one.
I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
Now, the text data is stored like this:
"b'Lorem Ipsum\xc2\xa0Assignment '"
I tried to decode this using this code (there is more data in other columns, text is in 3rd column):
with open('data.csv','rt',encoding='utf-8') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print(row[3])
But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[3]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!
How can I decode the text data?
Edit: Here's a sample line from the csv file:
67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6 | #abcde',52,18
Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.
The easiest way is as below. Try it out.
import csv
from io import StringIO
byte_content = b"iam byte content"
content = byte_content.decode()
file = StringIO(content)
csv_data = csv.reader(file, delimiter=",")
If your input file really contains strings with Python syntax b prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval() function as #Ry suggested — although I would use it in a slightly different manner, as shown below.
This will provide a safe way to parse strings in the file which are prefixed with a b indicating they are byte-strings. The rest will be passed through unchanged.
Note that this doesn't require reading the entire CSV file into memory.
import ast
import csv
def _parse_bytes(field):
"""Convert string represented in Python byte-string literal b'' syntax into
a decoded character string - otherwise return it unchanged.
"""
result = field
try:
result = ast.literal_eval(field)
finally:
return result.decode() if isinstance(result, bytes) else result
def my_csv_reader(filename, /, **kwargs):
with open(filename, 'r', newline='') as file:
for row in csv.reader(file, **kwargs):
yield [_parse_bytes(field) for field in row]
reader = my_csv_reader('bytes_data.csv', delimiter=',')
for row in reader:
print(row)
You can use ast.literal_eval to convert the incorrect fields back to bytes safely:
import ast
def _parse_bytes(bytes_repr):
result = ast.literal_eval(bytes_repr)
if not isinstance(result, bytes):
raise ValueError("Malformed bytes repr")
return result
When I use xmltodict to load the xml file below I get an error:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1
Here is my file:
<?xml version="1.0" encoding="utf-8"?>
<mydocument has="an attribute">
<and>
<many>elements</many>
<many>more elements</many>
</and>
<plus a="complex">
element as well
</plus>
</mydocument>
Source:
import xmltodict
with open('fileTEST.xml') as fd:
xmltodict.parse(fd.read())
I am on Windows 10, using Python 3.6 and xmltodict 0.11.0
If I use ElementTree it works
tree = ET.ElementTree(file='fileTEST.xml')
for elem in tree.iter():
print(elem.tag, elem.attrib)
mydocument {'has': 'an attribute'}
and {}
many {}
many {}
plus {'a': 'complex'}
Note: I might have encountered a new line problem.
Note2: I used Beyond Compare on two different files.
It crashes on the file that is UTF-8 BOM encoded, and works om the UTF-8 file.
UTF-8 BOM is a sequence of bytes (EF BB BF) that allows the reader to identify a file as being encoded in UTF-8.
I think you forgot to define the encoding type.
I suggest that you try to initialize that xml file to a string variable:
import xml.etree.ElementTree as ET
import xmltodict
import json
tree = ET.parse('your_data.xml')
xml_data = tree.getroot()
#here you can change the encoding type to be able to set it to the one you need
xmlstr = ET.tostring(xml_data, encoding='utf-8', method='xml')
data_dict = dict(xmltodict.parse(xmlstr))
In my case the file was being saved with a Byte Order Mark as is the default with notepad++
I resaved the file without the BOM to plain utf8.
Python 3
One Liner
data: dict = xmltodict.parse(ElementTree.tostring(ElementTree.parse(path).getroot()))
Helper for .json and .xml
I wrote a small helper function to load .json and .xml files from a given path.
I thought it might come in handy for some people here:
import json
import xml.etree.ElementTree
def load_json(path: str) -> dict:
if path.endswith(".json"):
print(f"> Loading JSON from '{path}'")
with open(path, mode="r") as open_file:
content = open_file.read()
return json.loads(content)
elif path.endswith(".xml"):
print(f"> Loading XML as JSON from '{path}'")
xml = ElementTree.tostring(ElementTree.parse(path).getroot())
return xmltodict.parse(xml, attr_prefix="#", cdata_key="#text", dict_constructor=dict)
print(f"> Loading failed for '{path}'")
return {}
Notes
if you want to get rid of the # and #text markers in the json output, use the parameters attr_prefix="" and cdata_key=""
normally xmltodict.parse() returns an OrderedDict but you can change that with the parameter dict_constructor=dict
Usage
path = "my_data.xml"
data = load_json(path)
print(json.dumps(data, indent=2))
# OUTPUT
#
# > Loading XML as JSON from 'my_data.xml'
# {
# "mydocument": {
# "#has": "an attribute",
# "and": {
# "many": [
# "elements",
# "more elements"
# ]
# },
# "plus": {
# "#a": "complex",
# "#text": "element as well"
# }
# }
# }
Sources
ElementTree.tostring()
ElementTree.parse()
xmltodict
json.dumps()
I had the same problem and I solved just specifying the encoding to the open function.
In this case it would be something like:
import xmltodict
with open('fileTEST.xml', encoding='utf8') as fd:
xmltodict.parse(fd.read())
In my case, the issue was with the first 3 characters. So removing them worked:
import xmltodict
from xml.parsers.expat import ExpatError
with open('your_data.xml') as f:
data = f.read()
try:
doc = xmltodict.parse(data)
except ExpatError:
doc = xmltodict.parse(data[3:])
xmltodict seems to not be able to parse <?xml version="1.0" encoding="utf-8"?>
If you remove this line, it works.
Not specific to the original post but for those who are also running into the same error at a different line, I was able to fix it by correcting the XML/XHTML error.
In my case, the document I was working with had a text description with an andpercent symbol "&" instead of "&" so to fix my issue, I had to edit the file first before running through the parser.
filePath = "Black1.xml"
xml_file = open(filePath, 'r')
tree = etree.parse(filePath, etree.XMLParser(encoding="gb2312"))
root = tree.getroot()
print (etree.tostring(root, encoding = 'gb2312'))
I use above code print gb2312 xml file, but It output messy code.Is there a proper method for print gb2312 xml file?
the origin file content is:
but the print content is:
How to start creating my own filetype in Python ? I have a design in mind but how to pack my data into a file with a specific format ?
For example I would like my fileformat to be a mix of an archive ( like other format such as zip, apk, jar, etc etc, they are basically all archives ) with some room for packed files, plus a section of the file containing settings and serialized data that will not be accessed by an archive-manager application.
My requirement for this is about doing all this with the default modules for Cpython, without external modules.
I know that this can be long to explain and do, but I can't see how to start this in Python 3.x with Cpython.
Try this:
from zipfile import ZipFile
import json
data = json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
with ZipFile('foo.filetype', 'w') as myzip:
myzip.writestr('digest.json', data)
The file is now a zip archive with a json file (thats easy to read in again in many lannguages) for data you can add files to the archive with myzip write or writestr. You can read data back with:
with ZipFile('foo.filetype', 'r') as myzip:
json_data_read = myzip.read('digest.json')
newdata = json.loads(json_data_read)
Edit: you can append arbitrary data to the file with:
f = open('foo.filetype', 'a')
f.write(data)
f.close()
this works for winrar but python can no longer process the zipfile.
Use this:
import base64
import gzip
import ast
def save(data):
data = "[{}]".format(data).encode()
data = base64.b64encode(data)
return gzip.compress(data)
def load(data):
data = gzip.decompress(data)
data = base64.b64decode(data)
return ast.literal_eval(data.decode())[0]
How to use this with file:
open(filename, "wb").write(save(data)) # save data
data = load(open(filename, "rb").read()) # load data
This might look like this is able to be open with archive program
but it cannot because it is base64 encoded and they have to decode it to access it.
Also you can store any type of variable in it!
example:
open(filename, "wb").write(save({"foo": "bar"})) # dict
open(filename, "wb").write(save("foo bar")) # string
open(filename, "wb").write(save(b"foo bar")) # bytes
# there's more you can store!
This may not be appropriate for your question but I think this may help you.
I have a similar problem faced... but end up with some thing like creating a zip file and then renamed the zip file format to my custom file format... But it can be opened with the winRar.