Extracting paragraph styles from a DOCX in Python3

Extracting paragraph styles from a DOCX in Python3 - python-3.x

I have been trying to solve this problem for a while in Python3.
I normally use to extract some information from DOCX documents, by using python-docx library.
from docx.document import Document
from docx import Document
document = Document("test.docx")
for paragraph in document.paragraphs:
for run in paragraph.runs:
print(run.font.name)
#returns None
So, as you can see from the above code, this is a very simple python-docx code to extract some information. I can access some properties such as; font name, size, outline levels, etc.
However, all of these properties are returning None. Because they haven't been explicitly defined.
I have checked StackOverflow for a similar problem and found these.
Extracting word document with styles associated to the content
How to get actual style of text in word document using python docx
In the documentation, it also says, if it returns None, then it's the Default style, that is inherited.
Also tried some XML parsing, but could not reach the desired parameters:
words = document._element.xpath('//w:r')
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
for elem in document.element.getiterator():
if elem.tag == WORD_NAMESPACE + 'p':
for i, child in enumerate(elem.getchildren()):
if child.tag == WORD_NAMESPACE + 'pPr':
...
# No idea how to access, all the styles with which
# tags etc.
How do we extract these default styles too? I would want to extract, the indentation levels, bold, italic, font name, size, etc properties from a DOCX. What could be the alternative ways. I want to solve it in Python3.

I believe what you're trying to discover is the effective style, which is to say, after the style hierarchy for this run has been traversed, which character formatting attribute has which of its possible values for this particular run.
This is a non-trivial problem. The style hierarchy work similarly to Cascading Style Sheets (CSS) in that formatting attributes can be set at various levels and the "closest" setting wins. At the same time, a run can have no assigned character formatting or style in which case the "farthest away" or "default of last resort" determines its character formatting.
In order to compute the effective formatting you need to traverse the style hierarchy level by level until you find the setting you're looking for (like say font.name). To do that, you need to know what the style hierarchy is for a particular run and then have access to each level of it.
So that's a pretty big ask. Roughly, just to give an idea, the style hierarchy would typically be:
character formatting applied directly to the run, like bold, italic, font name and size.
character formatting applied using a character style
character formatting applied using a character style attached to a paragraph style.
character formatting associated with the default paragraph style.
an assigned document-level default
A client-internal final fallback, client being something like Word, OpenOffice, etc.
There are exceptions when the text appears in a table and a table-style may fit in that hierarchy somewhere. I expect there are others as well.
There's no API support for this in python-docx at the moment and I haven't seen any successful implementations of it. I don't believe it is impossible to get a pretty useful implementation, it's just hard enough that most folks find some way to avoid it.

Related

How to show whitespace for given scope in a Sublime Text color scheme

Is there any way to set white space visible for a given scope?
I'm working on modifying a color scheme to suite my liking and would like to be able to show spaces within a given scope. I haven't seen anything suggesting it's possible within the color-scheme documentation on Sublime's website.
For my specific case, and I imagine there's other useful cases, I'm working with Markdown and want to highlight a double-space line-break. I'm able to set the background, but this doesn't look quite right. I'm hoping to be able to make whitespace visible for this small scope and change the foreground color to make it stick out.

The short answer to your question is no; or rather, Yes, but only in the way that you've already discovered.
Color schemes can only apply foreground/background colors to scopes as well as bold/italic font weights. So assuming that there is a specific scope detected by the syntax you're using that is used for the things you're trying to highlight, the only thing the color scheme can do is alter the background color to make them visible.
The only thing that can render white space natively is the draw_white_space setting, which at the moment only allows you to turn it off everywhere, turn it on everywhere, or turn it on only for selected text. In this case that doesn't really help.
There are possibilities for something like this in the plugin realm though (these examples can be tested by opening the Sublime console with View > Show Console or Ctrl+` and entering the code in there; they also assume that you're using the default Markdown syntax):
view.add_regions("whitespace", view.find_by_selector("punctuation.definition.hard-line-break.markdown"), "comment", flags=sublime.DRAW_NO_FILL)
This will cause all of the hard line breaks to be outlined as if they were find results; the color is selected by the scope (which is comment here); that would make them visible without making the whole character position have a background color.
view.add_regions("whitespace", view.find_by_selector("punctuation.definition.hard-line-break.markdown"), "comment", "dot", flags=sublime.HIDDEN)
This will add a dot (colored as a comment) in the gutter for lines that end with this scope; you can also combine this with the previous example to outline them and also call attention in the gutter.
style = '<style>.w { color: darkgray; }</style>'
content = '<body id="whitespace">' + style + '<span class="w">··</span></body>'
phantom_set = sublime.PhantomSet(view, "whitespace")
phantoms = [sublime.Phantom(r, content, sublime.LAYOUT_INLINE) for r in view.find_by_selector("punctuation.definition.hard-line-break.markdown")]
phantom_set.update(phantoms)
This uses Sublime's ability to apply inline HTML phantoms into the document in order to inject a small inline sequence of two unicode center dots immediately between the actual whitespace and the text that comes before it. Here the content can be what you like if you can generate the appropriate HTML; we're just applying a color to the text in this example.
A potential downside here is that the characters you see in the inline HTML aren't considered to be part of the document flow; the cursor will skip over them in one chunk, and they're followed by the actual whitespace.
The result of this example looks like this:
Going the plugin route, you'd need an event handler like on_load() to apply these when a file is loaded and on_modified() to re-update them after modifications are made to the buffer. There may or may not be a package that already exists that has implemented this.

Partial ligature selection with DirectWrite

Using HitTestTextPosition style API from IDWriteTextLayout I did not managed to handle properly text positions inside "ti", "ffi" or other ligatures with fonts like Calibri. It always returns position after or before ligature not inside like t|i or f|f|i.
What is the recommended way to do a caret movement inside ligatures with DirectWrite API?

There... is no "inside" position if you have GSUB replacements turned on?
Opentype GSUB ligatures are single glyph replacements for codepoint sequences, rather than being "several glyphs, smushed together". They are literally distinct, single glyphs, with single bounding boxes, and a single left and right side bearing for cursor placement/alignment. If you have the text A + E and the font has a ligature replacement that turns it into Ӕ then with ligatures enabled there really are only two cursor positions in that code sequence: |Ӕ and Ӕ|. You can't place the cursor "in the middle", because there is no "middle"; it's a single, atomic, indivisible element.
The same goes for f. ligatures like ﬀ, ﬁ, ﬂ, ﬃ, ﬄ, or ﬅ: these are single glyphs once shaped with GSUB turned on. This is in fact what's supposed to happen: having GSUB ligatures enabled means you expressly want text to be presented—for all intents and purposes—as having atomic glyphs for many-to-one substitutions, like turning the full phrase "صلى الله عليه وعلى آله وسلم‎", as well as variations of that, into the single glyph ﷺ.
If you want to work with the base codepoint sequences (so that if you have a text with f + f + i it doesn't turn that into ﬃ) you will need to load the font with the liga OpenType feature disabled.

The text editors I know of use the simple hack of (1) dividing the width of the glyph cluster by the number of code points within the cluster (excluding any zero width combining marks), rather than use the GDEF caret positioning information. This includes even Word, which you can tell if you look closely enough below. It's not precise, but since it's simple and close enough at ordinary reading sizes, it's what many do:
(2) I've heard that some may (but don't know which) also use the original glyph advances of the unshaped characters (pre-ligation) and scale them proportionally to the ligature cluster width.
(3) Some text editors may use the GDEF table, but I never knew of any for sure (possibly Adobe In-Design?).
The most challenging aspect of using methods 2 or 3 with IDWriteTextLayout is that accessing the corresponding IDWriteFontFace in that run requires quite the indirection because the specific IDWriteFontFace used (after resolving font family name+WWS+variable font axes) is stored in the layout but not publicly accessible via any "getter" API. The only way you can extract them is by "drawing" the glyph runs via IDWriteTextLayout::Draw into a user-defined IDWriteTextRenderer interface to record all the DWRITE_GLYPH_RUN::fontFace's. Then you could call IDWriteFontFace::GetDesignGlyphAdvances on the code points or IDWriteFontFace::TryGetFontTable to read the OpenType GDEF table (which is complex to read). It's a lot of work, and that's because...
The official PadWrite example has the same issue
IDWriteTextLayout was designed for displaying text rather than editing it. It has some functionality for hit-testing which is useful if you want to display an underlined link in a paragraph and test for it being clicked (in which case the ligature would be whole anyway within a word), or if you want to draw some decorations around some text, but it wasn't really intended for the full editing experience, which includes caret navigation. It was always intended that actual text editing engines (e.g. those used in Word, PowerPoint, OpenOffice, ...) would call the lower level API's, which they do.
The PadWrite sample I wrote is a little misleading because although it supports basic editing, that was just so you can play around with the formatting and see how things worked. It had a long way to go before it could really be an interactive editor. For one (the big one), it completely recreated the IDWriteTextLayout each edit, which is why the sample only presented a few paragraphs of text, because a full editor with several pages of text would want to incrementally update the text. I don't work on that team anymore, but I've thought of creating a DWrite helper library on GitHub to fill in some hindsight gaps, and if I ever did, I'd probably just ... use method 1 :b.

React-Native Change Text With setNativeProps

Has anyone figured out a way to dynamically mutate text on the screen without triggering a render?
A large part of my screen utilizes setNativeProps for moving parts, meaning that the animations become lagged despite using shouldComponentUpdate. I would like to use the Text tag instead of the TextInput tag workaround suggested in this post for stylistic reasons.
Best case scenario is a workaround that involves setNaiveProps as it would follow the pattern of the rest of the screen; however, I currently plan to render all the numbers 0-9 on the screen an move them into place at the moment, so any help would be greatly appreciated!

As it turns out, you can actually format TextInputs the same exact way as Text elements (from what I have tested). For placing text horizontally, you have to set the width (something I had trouble with before). For those still interested in the original question however, you can nest TextInputs inside of a Text Element (one per text element because there is no justification and it automatically places them in a row). Styling applied to the Text Element will apply to the TextInput.

Specifying alternate glyphs for potentially unrepresented unicode code point

Unicode has a lot of fancy code points that can be used to do things like change the direction text is formatted, include inband protocol information and other things.
I want to find out if there is a way of specifying, as part of a unicode string, a 'fallback' glyph or set of glyphs where an uncommon unicode code point is unrepresented in the font being used to display the text?
Example:
"æ<character to signify use following as alternate if previous character unavailable>ae<character to specify end of alternate>"
I've had a look through some of these code points at unicode.org and tried some different searches on SO without success. Some of the topics I've looked at:
Is there a way to programatically determine if a font file has a specific Unicode Glyph? (Windows/C#)
How to map code points to unicode characters depending on the font used? (Java)

Laying out graphics in RTF

I'm interested how to construct certain kinds of layout in RTF documents, ideally using techniques that do not depend only on the most recent RTF standards, and that are "native", i.e., they do not involve embedding other representations, like picture files. In particular:
In Postscript and DVI, I can specify a coordinate at any time that the next text will be printed at: can this be done with RTF?
Can RTF compose characters through overstriking?
Can lines, outline boxes and filled boxes be drawn, with their geometry specified either absolutely, or relative to text?

You can use the \pvpg \phpg \posx123 \posy123 construct after
you start a paragraph with \pard to position it relative to the top left of the page. See: http://biblioscape.com/rtf15_spec.htm#Heading39
Yes, but it's rather involved, and I think it was only introduced in RTF 1.5. See the drawing objects section of the spec. Here is a basic example of drawing a box (I'm not sure it's entirely valid but it should give you an idea of how to work with drawing objects):
{\rtf1\ansi\deff0
{\pard {\*\do
\dobxcolumn \dobypara
\dprect \dpx0 \dpy0 \dpxsize1000 \dpysize1000 \dplinew25
}\par}
}
If you're doing any work with RTF it's worth picking up O'Reilly's RTF Pocket Guide.

I don't believe this is possible. You'd need to use tabs and newlines to get the text where you want it.
Not really, unless \strike and \strikedl count.
http://www.biblioscape.com/rtf15_spec.htm#Heading52 says drawing objects are an option, and so is inserting images, but neither are really "native", both being absent in the first RTF specs. (And the latter is a bad choice for i.e. just a line.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string