How to redefine -top-level-division in pandoc for each constituent file separately? - text

I have a bunch of .md constituent content files marked up with headers #, ##, etc.
I want to flexibly compile new documents, with constituent files unchanged but residing at different levels of ToC hierarchy of the final document.
For example:
In compiled-1.pdf, top-level header # Foo from constituent-1.md might end up as "Chapter Foo" --- no change to its level in hierarchy.
However, in compiled-2.pdf, the very same # Foo from the very same constituent-1.md might end up as "Section Foo" --- a demotion to level 2 in the ToC hierarchy of compiled-2.pdf.
In each constituent .md file the top level header is always # and each constituent .md file is always treated as a whole indivisible unit. Therefore, all of a constituent file's headers are to be demoted by the same factor. Also, a constituent file's headers are never promoted.
I feel the problem has to do with re-setting -top-level-divison for each file. How to do it properly (using .yaml configs and make)?
But maybe a better way is to create for each final document a master file that establishes a hierarchy of constituent files with a combination of include ('constituent-1.md') etc. and define ('level', '1') etc. Such master file would then be pre-processed with m4 to search and replace # with ## or ### etc., according to each file's level, and then piped to pandoc.
What's the best approach?

I think these are the right ideas, but not the right tools. Instead of using m4, you might want to check out pandoc filters, especially the built-in Lua filters or the excellent panflute python package. These allow you to manipulate the actual document structure instead of just the text representation.
E.g., this Lua filter demotes all headers in a document:
function Header (header)
header.level = header.level + 1
return header
end
Similarly, you could define your own include statement based on code blocks:
```{include="FILENAME.md"}
```
Include with this filter:
function CodeBlock (cb)
if not cb.attributes.include then
return
end
local fh = io.open(cb.attributes.include)
local blocks = pandoc.read(fh:read('*a')).blocks
f:close()
return blocks
end
It is also possible to apply a filter to only a subset of blocks (requires a little hack):
local blocks = …
local div = pandoc.Div(blocks)
local filtered_blocks = pandoc.walk_block(div, YOUR_FILTER).content
You can combine and extend these building blocks to write your own filter and define your extensions. This way, one can have a main document which includes all your sub-files and shifts header levels as necessary.

Related

How to build text from mixed xml content using Python?

I have a situation in which an XML document has information in varying depth (according to S1000D schemas), and I'm looking for a generic method to extract correct sentences.
I need to interpret a simple element containing text as one individual part/sentence, and when an element that's containing text contains other elements that in turn contain text, I need to flatten/concatenate it into one string/sentence. The nested elements shall not be visited again if this is done.
Using Pythons lxml library and applying the tostring function works ok if the source XML is pretty-printed, so that I may split the concatenated string into new lines in order to get each sentence. If the source isn't pretty-printed, in one single line, there won't be any newlines to make the split.
I have tried the iter function and applying xpaths to each node, but this often renders other results in Python than what I get when applying the xpath in XMLSpy.
I have started down some of the following paths, and my question is if you have some input on which ones to continue on, or if you have other solutions.
I think I could use XSLT to preprocess the XML file, and then use a simpler Python script to divide the content into a list of sentence for further processing. Using Saxon with Python is now doable, but here I run into problems if the XML source contains entities that I cannot redirect Saxon to resolve (such as & nbsp;). I have no problem parsing files with lxml, so I tend to lean towards a cleaner Python solution.
lxml doesn't seem to have xpath support that can give me all nodes with text that contains one or more children containing text, and all nodes that are simple elements with no parents containing text nodes. Is there way to preprocess the parsed tree so that I can ensure it is pretty printed in memory, so that tostring works the same way for every XML file? Otherwise, my logic gives me one string for a document with no white space, and multiple sentences/strings if the source had been pretty printed. This doesn't feel ok.
What are my options? Use XSLT 1.0 in Python, other parsers to get a better handle on where I am in the tree, ...
Just to reiterate the issue here; I am looking for a generic way to extract text, and the only rules to the XML source are that a sentence may be built from an element with child elements with text, but there won't be additional levels. The other possibility is the simple element, but this one cannot be included in a parent element with text since this is included in the first rule.
Help/thoughts are appreciated.
This is a downright ugly code, a hastily hack with no real thought on form, beauty or finesse. All I am after is one way of doing this in Python. I'll tidy things up when I find a good solution that I want to keep. This is one possible solution so I figured I'd post it to see if someone can be kind enough to show me how to do this instead.
The problems has been to have xpath expressions that could get me all elements with text content, and then to act upon the depending on their context. All my xpath expressions has given me the correct nodes, but also a root, or ancestor that has pulled a more or less complete string at the beginning, so I gave up on those. My xpath functions as they should in XSLT, but not in Python - don't know why...
I had to revert to regex to find nodes that contains strings that are not white space only.
Using lxml with xpath and tostring gives different results depending on how the source XML is formatted, so I had to get around that.
The following formats have been tested:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element,
</c> and back to b.</b>
</a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?><root><subroot><a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a><!-- Comment --><a>Simple element.</a><a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a></subroot></root>
Python code:
dmParser=ET.XMLParser(resolve_entities=False, recover=True)
xml_doc = r'C:/Temp/xml-testdoc.xml'
parsed = ET.parse(xml_doc)
for elem in parsed.xpath("//*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"}):
tmp = elem.xpath("parent::*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})
if(tmp and tmp[0].text and tmp[0].text.strip()): #Two first checks can yield None, and if there is something check if only white space
continue #If so, discard this node
elif(elem.xpath("./*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})): #If a child node also contains text
line =re.sub(r'\s+', ' ',ET.tostring(elem, encoding='unicode', method='text').strip()) #Replace all non wanted whitespace
if(line):
print(line)
else: #Simple element
print(elem.text.strip())
Always yields:
Intro, element a: Nested b to be included in a, and yet another nested c-element, and back to b.
Simple element.
Text with 1st nested b, back in a, and yet another b-element, before ending in a.

How to encode blob names that end with a period?

Azure docs:
Avoid blob names that end with a dot (.), a forward slash (/), or a
sequence or combination of the two.
I cannot avoid such names due to legacy s3 compatibility and so I must encode them.
How should I encode such names?
I don't want to use base64 since that will make it very hard to debug when looking in azure's blob console.
Go has https://golang.org/pkg/net/url/#QueryEscape but it has this limitation:
From Go's implementation of url.QueryEscape (specifically, the
shouldEscape private function), escapes all characters except the
following: alphabetic, decimal digits, '-', '_', '.', '~'.
I don't think there's any universal solution to handle this outside your application scope. Within your application scope, you can do ANY encoding so it falls to personal preference how you like your data to be laid out. There is not "right" way to do this.
Regardless, I believe you should go for these properties:
Conversion MUST be bidirectional and without conflicts in your expected file name space
DO keep file names without ending dots unencoded
with dot-ending files, DO encode just the conflicting dots, keeping the original name readable.
This would keep most (the non-conflicting) files short and with the original intuitive or hopefully meaningful names and should you ever be able to rename or phase out the conflicting files just remove the conversion logic without restructuring all stored data and their urls.
I'll suggest 2 examples for this. Lets suggest you have files:
/someParent/normal.txt
/someParent/extensionless
/someParent/single.
/someParent/double..
Use special subcontainers
You could remove N dots from end of filename and translate them to subcontainer name "dot", "dotdot" etc.
The result urls would like:
/someParent/normal.txt
/someParent/extensionless
/someParent/dot/single
/someParent/dotdot/double
When reading you can remove the "dot"*N folder level and append N dots back to file name.
Obviously this assumes you don't ever need to have such "dot" folders as data themselves.
This is preferred if stored files can come in with any extension but you can make some assumptions on folder structure.
Use discardable artificial extension
Since the conflict is at the end you could just append a never-used dummy extension to given files. For example "endswithdots", but you could choose something more suitable depending on what the expected extensions are:
/someParent/normal.txt
/someParent/extensionless
/someParent/single.endswithdots
/someParent/double..endswithdots
On reading if the file extension is "endswithdots" you remove the "endswithdots" part from end of filename.
This is preferred if your data could have any container structure but you can make some assumptions on incoming extensions.
I would suggest against Base64 or other full-name encoding as it would make file names notably longer and lose any meaningful details the file names may contain.

Switch Case Conditional output in Latex

I want to write a report which has a structure like this:
\begin{document}
\input[option=a]{class}
\input[option=b]{class}
\input[option=c]{class}
\input[option=d]{class}
\end{document}
class.tex has content like this:
here are some shared content
switch(option)
case a
some text a
case b
some text b
case c
some text c
case d
some text d
endswitch
Here maybe more shared content.
Is there any way to do this in Latex?
A simplified way of doing this could be with logic statements using if else fi logic
at the top of the .tex file set up a switch with
\newif\ifswitch
The default value will be false. To set the value to be true use
\switchtrue
Then in the text of the document use
\ifswitch
<<text to include if switch is true>>
\else
<<text to include if switch is false>>
\fi % ends the if statement
So for your particular question you could have a set of switches
\newifConditionA
\newifConditionB
\newifConditionC
\newifConditionD
This is not as elegant as using a switch statement, but allows conditions where you want text from A and C at the same time for example.
Reference where this is discussed is here for
two versions of a document with 'if else' logic statements
You can use the following (crude) method of identifying the textual components between which you want to extract stuff from a file:
\documentclass{article}
\usepackage{filecontents}
\begin{filecontents*}{class.tex}
switch(option)
case a
some text a
case b
some text b
case c
some text c
case d
some text d
endswitch
\end{filecontents*}
\usepackage{catchfile}
% \inputclass{<file>}{<from>}{<to>}
\newcommand{\inputclass}[2]{%
\CatchFileDef{\class}{class.tex}{}%
\long\def\classsegment##1#1 ##2 #2##3\relax{##2}%
\show\classsegment
\expandafter\classsegment\class\relax
}
\begin{document}
\inputclass{case c}{case d}
\inputclass{case a}{case b}
\inputclass{case d}{endswitch}
\inputclass{case b}{case c}
\end{document}
Related:
How to extract information between two unique words in a large text file
How to extract data between two different xml tags
\input only part of a file
The last one is a more adaptable approach using the catchfilebetweentags package. This requires the insertion of appropriate tags within your code, which might not be as helpful. You could also use listings to include specific lines of code from an external file.
As I understand it, what you want is to update the function for each different part of the text, while defining the function in just one place.
The easy way to do this is to renew a variable command at the start of each section.
At start:
\newcommand{\VARIABLENAME}{VARIABLE_1}
At section:
\renewcommand{\VARIABLENAME}{VARIABLE_2}
There are more advanced ways of doing this as well, involving defining variables but for all it is worth, this is more readable and simpler to implement.
Note: If you are planning to make something more dynamic then just a class, I recommend implementing something in another language such as python to write the file in LaTex as it usually gives a lot more room for modification.

Possible to balance unidic vs. unidic-neologd?

With the sentence "場所は多少わかりづらいんですけど、感じのいいところでした。" (i.e. "It is a bit hard to find, but it is a nice place.") using mecab with -d mecab-unidic-neologd the first line of output is:
場所 バショ バショ 場所 名詞-固有名詞-人名-姓
I.e. it says "場所" is a person's surname. Using normal mecab-unidic it more accurately says the "場所" is just a simple noun.
場所 バショ バショ 場所 名詞-普通名詞-一般
My first question is has unidic-neologd replaced all the entries in unidic, or has it simply appended its 3 million proper nouns?
Then, secondly, assuming it is a merger, is it possibly to re-weight the entries, to prefer plain unidic entries a bit more strongly? I.e. I'd love to be getting 中居正広のミになる図書館 and SMAP each recognized as single proper nouns, but I also need it to see that 場所 is always going to mean "place" (except in the cases it is followed by a name suffix such as さん or 様, of course).
References: unidic-neologd
Neologd merges with unidic (or ipadic), which is the reason it keeps "unidic" in the name. If an entry has multiple parts of speech, like 場所, which entry to use is chosen by minimizing cost across the sentence using part-of-speech transitions and, for words in the dictionary, the per-token cost.
If you look in the CSV file that contains neologd dictionary entries you'll see two entries for 場所:
場所,4786,4786,4329,名詞,固有名詞,一般,*,*,*,バショ,場所,場所,バショ,場所,バショ,固,*,*,*,*
場所,4790,4790,4329,名詞,固有名詞,人名,姓,*,*,バショ,場所,場所,バショ,場所,バショ,固,*,*,*,*
And in lex.csv, the default unidic dictionary:
場所,5145,5145,4193,名詞,普通名詞,一般,*,*,*,バショ,場所,場所,バショ,場所,バショ,混,*,*,*,*
The fourth column is the cost. A lower cost item is more likely to be selected, so in this case you can raise the cost for 場所 as a proper noun, though honestly I would just delete it. You can read more about fiddling with cost here (Japanese).
If you want to weight all default unidic entries more strongly, you can modify the neolog CSV file to increase all weights. This is one way to create a file like that:
awk -F, 'BEGIN{OFS=FS}{$4 = $4 * 100; print $0}' neolog.csv > neolog.fix.csv
You will have to remove the original csv file before building (see Note 2 below).
In this particular case, I think you should report this as a bug to the Neologd project.
Note 1: As mentioned above, since which entry is selected depends on the sentence as a whole, it's possible to get the non-proper-noun tag even with the default configuration. Example sentence:
お店の場所知っている?
Note 2: The way the neologd dictionary combines with the default unidic dictionary is based on a subtle aspect of the way Mecab dictionary builds work. Specifically, all CSV files in a dictionary build directory are used when creating the system dictionary. Order isn't specified so it's unclear what happens in the case of collisions.
This feature is mentioned in the Mecab documentation here (Japanese).

How to use MFC Resource id stored as string

I am developing an MFC application in which I have menus defined in .rc file. I have an requirement for removing of few menu items at run time which are defined in xml file.
The menu ids are stored as string in xml as like below
<exclusionmenu>ID_FILE_NEW</exclusionmenu>
<exclusionmenu>ID_FILE_OPEN</exclusionmenu>
From xml the menu ids are retrieved as string,
RemoveMenu function expects UINT (menu id),
How to convert the menu id string defined in xml to uint menu id
Note: This is not direct cstring to uint conversion, ID_FILE_NEW is macro and it has int value.
The symbolic names for resource identifiers are defined in a header file, Resource.h by default. In source code and resource scripts, the symbolic names are substituted for their respective numeric values by the preprocessor. When compilation begins, the symbolic information is already gone.
To implement a scheme that uses symbolic names for configuration, you have to extract and preserve the mapping between symbolic names and resource identifiers for later use at runtime, or apply the mapping to your configuration files prior to deployment. The following is a list of potential options:
Use an associative container and populate it at application startup: An appropriate container would be std::map<std::string, unsigned int>. Populating this container is conveniently performed using C++11's list initialization feature:
static std::map<std::string, unsigned int> IdMap = {
{"ID_FILE_NEW", ID_FILE_NEW},
{"ID_FILE_OPEN", ID_FILE_OPEN},
// ...
}
At runtime you can use this container to retrieve the resource identifier given its symbolic constant:
unsigned int GetId(const std::string& name) {
if (IdMap.find(name) == IdMap.end())
throw std::runtime_error("Unknown resource identifier.");
return IdMap[name];
}
The downside to this approach is that you have to keep IdMap and the resources in sync. Whenever a resource is added, modified, or removed, the container contents must be updated to account for the changes made.
Parse Resource.h and store the mapping: The header file containing the symbolic identifier names has a fairly simple structure. Code lines that define a symbolic constant usually have the following layout:
\s* '#' \s* 'define' \s+ <name> \s+ <value> <comment>?
A parser to extract the mappings is not as difficult to implement as it may appear, and should be run at an appropriate time in the build process. Once the mapping has been extracted, it can be stored in a file of arbitrary format, for example an INI file. This file can either be deployed alongside the application, or compiled into the binary image as a resource. At application startup the contents are read back, and used to construct a mapping as described in the previous paragraph. In contrast to the previous solution, parsing the Resource.h file does not require manually updating the code when resources change.
Parse Resource.h and transform the configuration XML file: Like the previous solution this option also requires parsing of the Resource.h file. Using this information, the configuration XML file can then be transformed, substituting the symbolic names for their numeric counterparts prior to deployment. This, too, requires additional work. Once this is done, though, the process can be automated, and the results verified to maintain consistency. At runtime you can simply read the XML and have the numeric identifiers readily available.
The only way your scenario would work is when you distribute Resoutce.h with your application and you have logic to parse Resource.h at startup into a table containing ID_* names and their values.
You can't, the string form is 'lost' at compile time, it's a preprocessor token. You can store the string variations of the menu items: somewhere in your code, have std::map and fill it with values: menu_ids["ID_FILE_NEW"] = ID_FILE_NEW; Then you call RemoveMenu(menu_ids[string_from_xml]);

Resources