How to get pandoc lua filter avoid counting words using this pattern in code blocks inside Rmarkdown file? - text

This is a follow-up question to this post. What I want to achieve is to avoid counting words in headers and inside code blocks having this pattern:
```{r label-name}
all code words not to be counted.
```
Rather than this pattern:
```
{r label-name}
all code words not to be counted.
```
Because when I use the latter pattern I lose my fontification lock in the Rmarkdown buffer in emacs, so I always use the first one.
Consider this MWE:
MWE (MWE-wordcount.Rmd)
# Results {-}
## Topic 1 {-}
This is just a random text with a citation in markdown \#ref(fig:pca-scree)).
Below is a code block.
```{r pca-scree, echo = FALSE, fig.align = "left", out.width = "80%", fig.cap = "Scree plot with parallel analysis using simulated data of 100 iterations (red line) suggests retaining only the first 2 components. Observed dimensions with their eigenvalues are shown in green."}
knitr::include_graphics("./plots/PCA_scree_parallel_analysis.png")
```
## Topic 2 {-}
<!-- todo: a comment that needs to be avoided by word count hopefully-->
The result should be 17 words only. Not counting words in code blocks, comments, or Markdown markups (like the headers).
I followed the method explained here to get pandoc count the words using a lua filter. In short I did these steps:
from command line:
mkdir -p ~/.local/share/pandoc/filters
Then created a file there named wordcount.lua with this content:
-- counts words in a document
words = 0
wordcount = {
Str = function(el)
-- we don't count a word if it's entirely punctuation:
if el.text:match("%P") then
words = words + 1
end
end,
Code = function(el)
_,n = el.text:gsub("%S+","")
words = words + n
end,
}
function Pandoc(el)
-- skip metadata, just count body:
pandoc.walk_block(pandoc.Div(el.blocks), wordcount)
print(words .. " words in body")
os.exit(0)
end
I put the following elisp code in scratch buffer and evaluated it:
(defun pandoc-count-words ()
(interactive)
(shell-command-on-region (point-min) (point-max)
"pandoc --lua-filter wordcount.lua"))
From inside the MWE Markdown file (MWE-wordcount.Rmd) I issued M-x pandoc-count-wordsand I get the count in the minibuffer.
Using the first pattern I get 62 words.
Using the second pattern I get 22 words, more reasonable.
This method successfully avoids counting words inside a comment.
Questions
How to get the lua filter code avoid counting words using the first pattern rather than the second?
How to get the lua filter avoid counting words in the headers ##?
I would also appreciate if the answer explains how lua code works.

This is a fun question, it combines quite a few technologies. The most important here is R Markdown, and we need to look under the hood to understand what's going on.
One of the first step in R Markdown processing is to parse the document, find all R code blocks (marked by the {r ...} pattern, execute those blocks, and replaces the blocks with the evaluation results. The modified input text is then passed to pandoc, which parses it into an abstract document tree (AST). That AST can be examined or modified with a filter before pandoc writes the document in the target format.
This is relevant because it is R Markdown, not pandoc, that recognizes input of the form
``` {r ...}
# code
```
as code blocks, while pandoc parses them as inline code that is identical to ` {r ...} # code `, i.e., all newlines in the code are ignored. The reason for this lies in pandoc's attribute parsing and the overloading of ` chars in Markdown syntax.¹
This gives us the answer to your first question: we can't! The two code snippets look exactly the same by the time they reach the filter in pandoc's AST; they cannot be distinguished. However, we get proper code blocks with newlines if we run R Markdown's knitr step to execute the code.
So one solution could be to make the wordcount.lua filter part of the R Markdown processing step, but to run the filter only when the COUNT_WORDS environment variable is set. We can do that by adding this snippet to the top of the filter file:
if not os.getenv 'COUNT_WORDS` then
return {}
end
See the R Markdown cookbook on how to integrate the filter.
I'm leaving out the second question, because this answer is already quite long and that subquestion is worth a separate post.
¹: pandoc would recognize this as a code block if the r was preceded by a dot, as in
``` {.r}
# code
```

Related

How do I print exact sentence by filtering using regular expression in Python

I'm new to regular expression and got stuck with the code below.
import re
s = "5. Consider the task in Figure 8.11, which are balanced in fig 99.2"
output = re.findall((r'[A-Z][a-z]*'), s)[0]
output2 = re.findall(r'\b[^A-Z\s\d]+\b', s)
mixing = " ".join(str(x) for x in output2)
finalmix = output+" " + mixing
print(finalmix)
Here I'm trying to print "Consider the task in Figure 8.11, which are balanced in fig 99.2" from the given string s' as a sentence in output. So I joined the two outputs using join statement at the end to get it as a sentence. But its a lot confusing now since "Figure 8.11" and "fig 99.2" will not be printed as I have not given a regex code for that because I cannot determine what regex I should be using and later combining it at the end.
It's probably because I'm using a wrong approach to print the given sentence from the string s. I'll be glad if anyone could help me fix the code or guide me using some alternate approach as this code looks absurd.
This is the output I get:
Consider the task in . which are balanced in .
To capture all bulleted items, I would use:
import re
s = "5. Consider the task in Figure 8.11, which are balanced in fig 99.2"
items = re.findall(r'\d+\.(?!\d)(.*?)(?=\d+\.(?!\d)|$)', s, flags=re.DOTALL)
print(items)
This prints:
['Consider the task in Figure 8.11, which are balanced in fig 99.2']
Here is an explanation of the regex pattern:
\d+\. match a bulleted number
(?!\d) which is NOT followed by another number
(.*?) match and capture all content, across newlines, until hitting
(?=\d+\.(?!\d)|$) another number bullet OR the end of the input
#TimBiegeleisen's answer works, but is somewhat verbose due to the fact that using re.findall would require repeating the pattern of the bullet point as a start and as a lookahead in the end.
For the purpose of finding strings between repeating patterns (bullet points in this case) it may be simpler to use re.split instead. Slice the resulting list to discard the first item since we don't need what comes before the first bullet point:
re.split(r'\d+\.(?!\d)\s*', s)[1:]
This returns:
['Consider the task in Figure 8.11, which are balanced in fig 99.2']

Is there a way to only list a certain format of text from a list?

I am quite new to python.
And i want to only get a certain format from a bigger list, example:
Whats in the list:
/ABC/EF213
/ABC/EF
/ABC/12AC4
/ABC/212
However the only on i want listed are the ones with this format /###/##### while the rest gets discarded
You could use a generator expression or a for loop to check each element of the list to see if it matches a pattern. One way of doing this would be to check if the item matches a regex pattern.
As an example:
import re
original_list = ["Item I don't want", "/ABC/EF213", "/ABC/EF", "/ABC/12AC4", "/ABC/212", "123/456", "another useless item", "/ABC/EF"]
filtered_list = [item for item in original_list if re.fullmatch("\/\w+\/\w+", item) is not None]
print(filtered_list)
outputs
['/ABC/EF213', '/ABC/EF', '/ABC/12AC4', '/ABC/212', '/ABC/EF']
If you need help making regex patterns, there are many great websites such as regexr which can help you
Every String can be used as a list without any conversion. If the only format you want to check is /###/##### then you can simply make if commands like these:
for text in your_list:
if len(text) == 10 and text[0] == "/" and text[4] == "/" (and so on):
print(text)
Of course this would require a lot of if statements and would take a pretty long time. So I would recomend doing a faster and simpler scan. We could perform this one by, for example, splitting the texts, which would look something like this:
for text in your_list:
checkstring = text.split("/")
Now you have your text Split in parts, and you can simply check what lengths these new parts have with the len() command.

How to quote some special words (registry numbers) to be not tokenized with Spacy?

I have some numbers inside text which I would like to stay as one sentence. Some of them:
7-2017-19121-B
7-2016-26132
wd/2012/0616
JLG486-01
H14-0890-12
How can I protect them to be not separated on words. I already use regex for custom tokenizer to never split words with dashes but it works only with letters not with numbers. I don't want to change the default regex which is big and very complicated. How can I do it easily?
What I have done already is using those "hyphen protector". For 7-2014-1721-Y I got tokens [7,-,2014,-,1721-Y], so last phrase is not divided but the previous are. As I said the code is complicated and would like to add the same to include such action for number-number entity.
This is the function:
def custom_tokenizer(nlp):
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
# changing default infixed
def_infx = nlp.Defaults.infixes
cur_infx = (d.replace('-|–|—|', '') for d in def_infx)
infix_re = compile_infix_regex(cur_infx)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search, suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer, token_match=None)
Maybe there's some easier way except moditication? I've tried to quote these "plates" with some escape characters like {7-2017-19121-B} but it doesn't work.
By the way, there's a regex which matches these special "numbers". Maybe workaround for me will be just removing them from the text (which I'll try later) but now I'm asking if I have any chances here.
["(?=[^\d\s]*\d)(?:[a-zA-Z\d]+(?:/[a-zA-Z\d]+)+)", "(?:[[A-Z\d]+(?:[-][A-Z\d]+)+)"]
Hint. I found out changing from 7-2017-19121-B to 7/2017/19121/B works as needed. The question is (for me to check) how can I adapt this to my current code and stay with the performance I have now.
You may add them as "special cases":
nlp.tokenizer.add_special_case("7-2017-19121-B", [{ORTH: "7-2017-19121-B"}])
...
nlp.tokenizer.add_special_case("H14-0890-12", [{ORTH: "H14-0890-12"}])
Test:
print([w.text for w in nlp("Got JLG486-01 and 7-2017-19121-B codes.")])
# => ['Got', 'JLG486-01', 'and', '7-2017-19121-B', 'codes', '.']

Python: Print entire line of string match and not cut off after the period

See bottom for the solution I came up with.
Hopefully this is a easy question for you guys. Trying to match a string to a list and print just that string matched. I was successful using re, but it is cutting off the rest of the string after the period. The span per re is 0,10 and when i look at the output without using re it is 0,14 not 0,10 so match is cutting off the info after the period. So I would like to learn how to tell it to print the entire span or learn a new way to match a var string to a list and print that exact string. My original attempts printed anything with the TESTPR in it, 3 printed total, the others I do not want printing have a 1 in the front and the last match has an additional R at the end. Here is my current match code:
#OLD See below
for element in catalog:
z = re.match("((TESTPRR )\w+)", element)
if z:
print((z.group()))
Output: TESTPR 105
It should show:
Wanted output: TESTPT 105.465
It will go up to 3 decimal places after the period and no more. I am currently taking a Python class to learn Python and love it so far, but this one has me stumped as I am just now learning about re and matching by reading as we have not gotten to that yet in class.
I am open to learning a different way to search for and match a string and print just that string. For my first attempt that prints 3 results was this:
catalog = [ long list pulled from API then code here to make it a nice column]
prod = 'TESTPR'
print ([s for s in catalog if prod in s])
When I add a space at the end of prod i can get rid of the match with the extra char at the end, but I cannot add a space to do the same thing with the match that has an extra char at the front. This is for the code above and not for the re match code. Thanks!
Answer below!
Since you are interested in learning about ways to match strings and solve your problem: try fuzzywuzzy.
In your case you could try:
from fuzzywuzzy import process
catalog = [long list pulled from API then code here to make it a nice column]
prod = "TESTPR"
hit = process.extractOne(prod, catalog, score_cutoff = 75) #you can adjust this to suit how close the match should be
print(hit[0]) #hit will be sth like ("TESTPT 105.465", 75)
Output: TESTPT 105.465
For information on different ways of using fuzzywuzzy, check out this link.
You can use different ways of matching such as:
fuzz.partial_ratio
fuzz.ratio
token_sort_ratio
fuzz.token_set_ratio
for this from fuzzywuzzy import fuzz
Kept at it with re.match and got the correct regex so the entire match prints and it does not cut off numbers after the period.
my original match as you can see above was re.match("((TESTPRR )\w+)", element), some of the ( were unneeded and needed to add a few more expressions and now it prints the correct match. See above for old code and below for the new code that works.
# New code, replaced w+ with w*\d*[.,]?\d*$
for element in catalog:
z = re.match("STRING\w*\d*[.,]?\d*$", element)
if z:
print(z.group())

What is Natural Language Processing Doing Exactly in This Code?

I am new to natural language processing and I want to use it to write a news aggregator(in Node.js in my case). Rather than just use a prepackage framework, I want to learn the nuts and bolts and I am starting with the NLP portion. I found this one tutorial that has been the most helpful so far:
http://www.p-value.info/2012/12/howto-build-news-aggregator-in-100-loc.html
In it, the author gets the RSS feeds and loops through them looking for the elements(or fields) title and description. I know Python and understand the code. But what I don't understand is what NLP is doing here with title and description under the hood(besides scraping and tokenizing, which is apparent...and those tasks don't need a NLP).
import feedparser
import nltk
corpus = []
titles=[]
ct = -1
for feed in feeds:
d = feedparser.parse(feed)
for e in d['entries']:
words = nltk.wordpunct_tokenize(nltk.clean_html(e['description']))
words.extend(nltk.wordpunct_tokenize(e['title']))
lowerwords=[x.lower() for x in words if len(x) > 1]
ct += 1
print ct, "TITLE",e['title']
corpus.append(lowerwords)
titles.append(e['title'])
(reading your question more carefully maybe this was all already obvious to you, but it doesn't look like anything more deep or interesting is going on)
wordpunct_tokenize is set up here here (last line) as
wordpunct_tokenize = WordPunctTokenizer().tokenize
WordPunctTokenizer is implemented by this code:
class WordPunctTokenizer(RegexpTokenizer):
def __init__(self):
RegexpTokenizer.__init__(self, r'\w+|[^\w\s]+')
The heart of this is just the regular expression r'\w+|[^\w\s]+', which defines what strings are considered to be tokens by this tokenizer. There are two options, separated by the |:
\w+, that is, more than one "word" character (alphabetical or numeric)
[^\w\s]+, more than one character that is not either a "word" character or whitespace, thus this matches any string of punctuation
Here is a reference for Python regular expressions.
I have not dug into the RegexpTokenizer, but I assume is set up such that the tokenize function returns an iterator that searches a string for the first match of the regular expression, then the next, etc.

Resources