how to parse Unicode xml data with encoding header declare

how to parse Unicode xml data with encoding header declare - python-3.x

I have a xml str which encoding header declare as gb2312, I want to parse it with lxml.parse and use xpath to get value, the xml str content contain unicode character, currently my works as below:
from lxml import etree
from io import BytesIO
xml_data = """<?xml version="1.0" encoding="GB2312"?>
<CHARGEINFO>
<ITEM>
<DETAILEDCHARGESID>5E8C4E14E9C711EBB56E00155D96E1E0</DETAILEDCHARGESID>
<UNIT>元</UNIT>
<CHARGENUM>1</CHARGENUM>
<CHARGEDATE>2021-07-13 14:18:47</CHARGEDATE>
<CHARGESTATE>0</CHARGESTATE>
</ITEM>
</CHARGEINFO>"""
xml_root = etree.parse(BytesIO(xml_data.encode())).getroot()
xml_root.xpath("/CHARGEINFO/ITEM")
the output error is :
Traceback (most recent call last):
File "/Users/gui/Documents/workspace/PIS_middle_layer/venv/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3441, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-11-527d4b9dab58>", line 14, in <module>
xml_root = etree.parse(BytesIO(xml_data.encode())).getroot()
File "src/lxml/etree.pyx", line 3521, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1784, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
File "<string>", line 1
XMLSyntaxError: switching encoding: encoder error, line 1, column 38
Is there any better way to parse this xml data?

You've declared the XML to be encoded in gb2312, but BytesIO(xml_data.encode()) defaults to encoding in utf8, hence the "switching encoding" error. The declaration and the encoding of the data need to agree.
Change the xml_root assignment line to:
xml_root = etree.parse(BytesIO(xml_data.encode('gb2312'))).getroot()
OR change the xml_data assignment line to:
xml_data = """<?xml version="1.0" encoding="UTF-8"?>

Related

python pandas, unicode decode error on read_csv

When importing a csv file I am getting an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 15: invalid start byte
traceback:
Traceback (most recent call last):
File "<ipython-input-2-99e71d524b4b>", line 1, in <module>
runfile('C:/AppData/FinRecon/py_code/python3/DataJoin.py', wdir='C:/AppData/FinRecon/py_code/python3')
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/AppData/FinRecon/py_code/python3/DataJoin.py", line 500, in <module>
M5()
File "C:/AppData/FinRecon/py_code/python3/DataJoin.py", line 221, in M5
s3 = pd.read_csv(working_dir+"S3.csv", sep=",") #encode here encoding='utf-16
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 435, in _read
data = parser.read(nrows)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1176, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1315, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1553, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 15: invalid start byte
What i've tried:
`s3 = pd.read_csv(working_dir+"S3.csv", sep=",", encoding='utf-16')`
I get error UnicodeError: UTF-16 stream does not start with BOM
What can be done to get this file to be read properly?

Try using s3 = pd.read_csv(working_dir+"S3.csv", sep=",", encoding='Latin-1')
Mostly encoding issues arise with the characters within the data. While utf-8 supports all languages according to pandas' documentation, utf-8 has a byte structure that must be respected at all times. Some of the values not included in utf-8 are latin small letters i with diaeresis, right-pointing double angle quotation mark, inverted question mark. This are mapped as 0xef, 0xbb and 0xbf bytes respectively. Hence your error.

Failing to parse xml file with lxml

I am trying to parse an xml file which has &includes; in it but it is failing
File starts with
<!DOCTYPE gdml [
<!ENTITY materials SYSTEM "materialsOptical.xml">
<!ENTITY solids SYSTEM "solids.xml">
<!ENTITY matrices SYSTEM "matrices.xml">
]>
<gdml xmlns:gdml="http://cern.ch/2001/Schemas/GDML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema/gdml.xsd">
<define>
<constant name="PI" value="1.*pi"/>
&matrices;
And code being used is
from lxml import etree
#root = etree.fromstring(currentString)
parser = etree.XMLParser(resolve_entities=True)
root = etree.parse(filename, parser=parser)
But I get an error
File "/usr/share/freecad/Mod/GDML/importGDML.py", line 702, in processGDML
root = etree.parse(filename, parser=parser)
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile
File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFile
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
<class 'lxml.etree.XMLSyntaxError'>: Failure to process entity matrices, line 14, column 11 (detector.gdml, line 14)

Okay I had download the matrices file as matrices.gdml where as the parent file referred to it as matrices.xml.
Works now

NLTK download returns a parse error regarding xml

I have python3 on my windows which i installed using anaconda. I am trying to run this piece of code downloading punkt from nltk but it returns the following error (image).
import nltk
nltk.download('punkt')
sent_segmenter = nltk.data.load("tokenizers/punkt/english.pickle")
sentences = sent_segmenter.tokenize(st)
print(sentences)
Error:
Traceback (most recent call last):
File "C:\Users\G751\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3267, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-7-2eb0f40de2a3>", line 2, in <module>
nltk.download('punkt')
File "C:\Users\G751\Anaconda3\lib\site-packages\nltk\downloader.py", line 787, in download
for msg in self.incr_download(info_or_id, download_dir, force):
File "C:\Users\G751\Anaconda3\lib\site-packages\nltk\downloader.py", line 636, in incr_download
info = self._info_or_id(info_or_id)
File "C:\Users\G751\Anaconda3\lib\site-packages\nltk\downloader.py", line 609, in _info_or_id
return self.info(info_or_id)
File "C:\Users\G751\Anaconda3\lib\site-packages\nltk\downloader.py", line 1019, in info
self._update_index()
File "C:\Users\G751\Anaconda3\lib\site-packages\nltk\downloader.py", line 962, in _update_index
ElementTree.parse(urlopen(self._url)).getroot()
File "C:\Users\G751\Anaconda3\lib\xml\etree\ElementTree.py", line 1197, in parse
tree.parse(source, parser)
File "C:\Users\G751\Anaconda3\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
File "<string>", line unknown
ParseError: not well-formed (invalid token): line 1, column 0

Please use
nltk.download(force=True)
This should work.

character causing syntax issue with statsmodel

I'm trying to fit a linear model to some data using the code below. I'm getting the error below. I think the error has an issue with the '%' in the field name. I have many fields in my data with this naming convention. Does anyone know how to solve this issue with statsmodel?
code:
mod = ols('fieldA%'+'~'+'fieldB',data=smp_df).fit()
error:
Traceback (most recent call last):
File "C:\Users\username\AppDataPython\envs\py36\lib\site-packages\IPython\core\interactiveshell.py", line 3267, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-24-9e4f478cefb9>", line 3, in <module>
mod = ols('fieldA%'+' ~'+'fieldB',data=smp_df).fit()
File "C:\Users\username\AppDataPython\envs\py36\lib\site-packages\statsmodels\base\model.py", line 155, in from_formula
missing=missing)
File "C:\Users\username\AppDataPython\envs\py36\lib\site-packages\statsmodels\formula\formulatools.py", line 65, in handle_formula_data
NA_action=na_action)
File "C:\Users\username\AppDataPython\envs\py36\lib\site-packages\patsy\highlevel.py", line 310, in dmatrices
NA_action, return_type)
File "C:\Users\username\AppDataPython\envs\py36\lib\site-packages\patsy\highlevel.py", line 165, in _do_highlevel_design
NA_action)
File "C:\Users\username\AppDataPython\envs\py36\lib\site-packages\patsy\highlevel.py", line 70, in _try_incr_builders
NA_action)
File "C:\Users\username\AppDataPython\envs\py36\lib\site-packages\patsy\build.py", line 689, in design_matrix_builders
factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)
File "C:\Users\username\AppDataPython\envs\py36\lib\site-packages\patsy\build.py", line 354, in _factors_memorize
which_pass = factor.memorize_passes_needed(state, eval_env)
File "C:\Users\username\AppDataPython\envs\py36\lib\site-packages\patsy\eval.py", line 474, in memorize_passes_needed
subset_names = [name for name in ast_names(self.code)
File "C:\Users\username\AppDataPython\envs\py36\lib\site-packages\patsy\eval.py", line 474, in <listcomp>
subset_names = [name for name in ast_names(self.code)
File "C:\Users\username\AppDataPython\envs\py36\lib\site-packages\patsy\eval.py", line 105, in ast_names
for node in ast.walk(ast.parse(code)):
File "C:\Users\username\AppDataPython\envs\py36\lib\ast.py", line 35, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
CIM_ID_SALES %
^
SyntaxError: invalid syntax

Converting a supposed excel file in csv in python

I am having an issue trying to use a code for converting a file into csv.
I am using the code below as a start
directory = 'C:\OI Data'
filename = 'OpenInterest08-24-16'
data_xls = pd.read_excel(os.path.join(directory,filename), 'Sheet1', index_col=None)
data_xls.to_csv(os.path.join(directory,filename +'.csv'), encoding='utf-8')
and I am getting the following error:
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Public/Documents/Python Scripts/work.py", line 26, in
data_xls = pd.read_excel(os.path.join(directory,filename), 'Sheet1', index_col=None)
File "C:\Anaconda2\lib\site-packages\pandas\io\excel.py", line 170, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Anaconda2\lib\site-packages\pandas\io\excel.py", line 227, in init
self.book = xlrd.open_workbook(io)
File "C:\Anaconda2\lib\site-packages\xlrd__init__.py", line 441, in open_workbook
ragged_rows=ragged_rows,
File "C:\Anaconda2\lib\site-packages\xlrd\book.py", line 91, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Anaconda2\lib\site-packages\xlrd\book.py", line 1230, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Anaconda2\lib\site-packages\xlrd\book.py", line 1224, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\n\n\n\n\n '
I am struggling to figure out the file format I am using
https://www.theice.com/marketdata/reports/icefuturesus/PreliminaryOpenInterest.shtml?futuresExcel=&tradeDate=8%2F24%2F16
opening the file myself I get the following
enter image description here
I am still a beginner at python and some help would be much appreciated.
Thanks

You can start by fixing this part:
data_xls.to_csv(os.path.join(directory,filename,'.csv'), encoding='utf-8')
What happens when you do that is:
'C:\OI Data\\OpenInterest08-24-16\\.csv'
Which is not what you want. Instead do:
os.path.join(directory,filename+'.csv')
Which will give you:
'C:\OI Data\\OpenInterest08-24-16.csv'
Also, this is not a problem here, but in general be careful with this because a single backslash and a character can indicate an escape sequence, e.g. \n is a newline:
directory = 'C:\OI Data'
Instead escape the backslash like so:
directory = 'C:\\OI Data'

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to parse Unicode xml data with encoding header declare - python-3.x

Related

python pandas, unicode decode error on read_csv

Failing to parse xml file with lxml

NLTK download returns a parse error regarding xml

character causing syntax issue with statsmodel

Converting a supposed excel file in csv in python

Categories

Resources