Handling an invalid XML attribute in an Excel document - excel

I'm using openpyxl to read an Excel document. For reasons that I don't understand at all, two of the cell-style names have a ctrl-d in them in xl/styles.xml in the ZIP archive storing the spreadsheet:
<cellStyle name="^D" xfId="20" builtinId="53" customBuiltin="true"/>
<cellStyle name="^D 2" xfId="21" builtinId="53" customBuiltin="true"/>
(That's a ctrl-D in both names.) Openpyxl's load_workbook function quite reasonably chokes with the following error:
lxml.etree.XMLSyntaxError: invalid character in attribute value, line 2, column 11879
Approaches that I've considered:
Preprocess and replace styles.xml
Ignore styles altogether somehow
Manually remove the cell styles in oocalc (or Excel)
Any ideas/advice?

Shoot whoever or whatever produced the file because this is invalid XML! ;-) Submit a bug upstream.
If you can clean it up in MS Excel then that's going to be eaiser otherwise you can write your own preprocessor using openpyxl's code: styles/stylesheet.py will let you read the source without having to worry about namespaces but otherwise you should be able to change the elements inplace. Stylesheets are almost never that big (some libraries do produce some massive ones with junk in them).

Related

Python-docx: Editing a pre-existing Run

import docx
doc = docx.Document('CLT.docx')
test = doc.paragraphs[12].runs[2].text
print(test)
doc.save(input('Name of docx file? Make sure to add file extension '))
I've trying to figure out some way to add/edit text to a pre-existing run using python-docx. I've tried test.clear() just to see if I can remove it, but that doesn't seem to work. Additionally, I tried test.add_run('test') and that didn't work either. I know how to add a new run but it will only add it at the end of the paragraph which doesn't help me much. Currently, 'print' will output the text i'd like to alter within the document, "TERMOFINTERNSHIP". Is there something i'm missing?
The text of a run can be edited in its entirety. So to replace "ac" with "abc" you just do something like this:
>>> run.text
"ac"
>>> run.text = "abc"
>>> run.text
"abc"
You cannot simply insert characters at some location; you need to extract the text, edit that str value using Python str methods, and replace it entirely. In a way of thinking, the "editing" is done outside python-docx and you're simply using python-docx for the "before" and "after" versions.
But note that while this is quite true, it's not likely to benefit you much in the general case because runs break at seemingly random locations in a line. So there is no guarantee your search string will occur within a single run. You will need an algorithm that locates all the runs containing any part of the search string, and then allocate your edits accordingly across those runs.
An empty run is valid, so run.text == "" may be a help when there are extra bits in the middle somewhere. Also note that runs can be formatted differently, so if part of your search string is bold and part not, for example, your results may be different than you might want.

How to encode blob names that end with a period?

Azure docs:
Avoid blob names that end with a dot (.), a forward slash (/), or a
sequence or combination of the two.
I cannot avoid such names due to legacy s3 compatibility and so I must encode them.
How should I encode such names?
I don't want to use base64 since that will make it very hard to debug when looking in azure's blob console.
Go has https://golang.org/pkg/net/url/#QueryEscape but it has this limitation:
From Go's implementation of url.QueryEscape (specifically, the
shouldEscape private function), escapes all characters except the
following: alphabetic, decimal digits, '-', '_', '.', '~'.
I don't think there's any universal solution to handle this outside your application scope. Within your application scope, you can do ANY encoding so it falls to personal preference how you like your data to be laid out. There is not "right" way to do this.
Regardless, I believe you should go for these properties:
Conversion MUST be bidirectional and without conflicts in your expected file name space
DO keep file names without ending dots unencoded
with dot-ending files, DO encode just the conflicting dots, keeping the original name readable.
This would keep most (the non-conflicting) files short and with the original intuitive or hopefully meaningful names and should you ever be able to rename or phase out the conflicting files just remove the conversion logic without restructuring all stored data and their urls.
I'll suggest 2 examples for this. Lets suggest you have files:
/someParent/normal.txt
/someParent/extensionless
/someParent/single.
/someParent/double..
Use special subcontainers
You could remove N dots from end of filename and translate them to subcontainer name "dot", "dotdot" etc.
The result urls would like:
/someParent/normal.txt
/someParent/extensionless
/someParent/dot/single
/someParent/dotdot/double
When reading you can remove the "dot"*N folder level and append N dots back to file name.
Obviously this assumes you don't ever need to have such "dot" folders as data themselves.
This is preferred if stored files can come in with any extension but you can make some assumptions on folder structure.
Use discardable artificial extension
Since the conflict is at the end you could just append a never-used dummy extension to given files. For example "endswithdots", but you could choose something more suitable depending on what the expected extensions are:
/someParent/normal.txt
/someParent/extensionless
/someParent/single.endswithdots
/someParent/double..endswithdots
On reading if the file extension is "endswithdots" you remove the "endswithdots" part from end of filename.
This is preferred if your data could have any container structure but you can make some assumptions on incoming extensions.
I would suggest against Base64 or other full-name encoding as it would make file names notably longer and lose any meaningful details the file names may contain.

PexObserve only records 255 characters

I am using Pex from the command line to find input values for test case generation.
I use PexObserve to record certain values during execution.
One of the values that I want to record is an XML-String.
However, when parsing the XML I receive "malformed XML" exceptions, since Pex only writes the first 255 characters into the log.
Is there a way to record the full XML string? or does PexObserve have a different type that will let me record longer texts?
Leaving this here, in case somebody at any point has the same issue.
I've found a solution that helped me.
Unfortunately the 255 character limit is set internally in static readonly fields.
Therefore I needed to use reflection.
My solution works by including the following line in the PUT:
typeof(Microsoft.Pex.Framework.PexObserve.ValueWriterManager).GetField("MaxWrittenElements").SetValue(null, 1000);
Replace the 1000 with any value you like.
BUT: remember that this is a quick-fix solution, that might not work for you.
It may have unwanted side-effects. You're also changing the number of List elements that are written, and perhaps other things.

Import from Excel via foreach loop

I intend to import and work with a variety of Excel files through a foreach loop. The import itself is not working though as Stata won´t recognize `x' as a substitute for the Excel filenames.
local excelfiles "bb_01 bit0_2 bun comp_03 comp_c01m LLU-ck"
foreach item of local excelfiles {
import excel using "D:\...\...\...\Data\Files\`x'.XLS", sheet("DynamicReport") cellrange(A2:AI201) firstrow
keep v1 v2 v3 v4
save "D:\...\...\...\...\`x'.dta", replace
The error I get is file D:\...\...\...\...\Data\Files.XLS not found
There are various problems here.
Your code is inconsistent. You declare item in the foreach statement, but refer to x within the loop. So, as far as Stata is concerned, local macro x is never defined. That is not an error in itself, but Stata replaces references to local macros that do not exist (are empty) with an empty string, with the consequence you report.
Your code would still not work if you replace reference to x with reference to item. See (e.g.) http://www.stata.com/manuals14/u.pdf 18.3.11 and http://www.stata-journal.com/sjpdf.html?articlenum=pr0042 for warnings on following backslashes immediately with local macro references. The problem is that backslash is both an escape character and a separator within full Windows filepaths. The clash should be resolved by using forward slashes in filepaths, even within Windows.
The loop is never closed in the code segment you show.
I can't check your code otherwise, as your code is not reproducible. I am presuming that the triple dots ... are not literal but replace detail that should not be crucial.

Use Alex macros from another file

Is there any way to have an Alex macro defined in one source file and used in other source files? In my case, I have definitions for $LowerCaseLetter and $UpperCaseLetter (these are all letters except e and O, since they have special roles in my code). How can I refer to these macros from other .x files?
Disproving something exists is always harder than finding something that does exist, but I think the info below does show that Alex can only get macro definitions from the .x file it is reading (other than predefinied stuff like $white), and not via includes from other files....
You can get the sourcecode for Alex by doing the following:
> cabal unpack alex
> cd alex-3.1.3
In src/Main.hs, predefined macros are first set in variables called initSetEnv (charset macros $white, $printable, and "."), and initREEnv (regexp macros, there are none). This gets passed into runP, in src/ParseMonad.hs, which is used to hold the current parsing state, including all defined macros. The initial state is set using the values passed in, but macros can be added using a function called newSMac (or newRMac for regular expression macros).
Since this seems to be the only way that macros can be set, it is then only a matter of some grep bookkeeping to verify the only ways that macros can be added is through an actual macro definition in the source .x file. Unsurprisingly, Alex recursively uses its own .x/.y files for .x source file parsing (src/parser.y, src/Scan.x). It is a couple of levels of indirection away, but you can verify that the only way newSMac can be called is through the src/Scan.x macro
#smac = \$ #id | \$ \{ #id \}
<0> #smac #ws? \= { smacdef }
Other than some obvious predefined stuff, I don't believe reuse in lexers is all that typical anyway, because at the token level things are usually pretty simple (often simple tokens like SPACE, WORD, NUMBER, and a few operators, symbols and parens are all that are needed). The complexity comes at the parsing stage, although for technical reasons, parser-includes aren't that common either (see scannerless parsing for a newer technology that does allow reuse through nesting, like javascript embedded in html.... The tools for scannerless parsing are still pretty primitive though).

Resources