NLP Challenge: Automatically removing bibliography/references? - nlp

I recently came across following problem: When applying a topic model on a bunch of parsed PDF files, I discovered that content of the references unfortunately also counts for the model. I.e. words within the references appear in the tokenized list of words.
Is there any known "best-practice" to solve this problem?
I thought about a search strategy where the python code automatically removes all content after the last mention of "references" or "bibliography". If I would go by the first, or a random mention of "references" or "bibliography" within the full text, the parser might not capture the true full content.
The input PDF are all from different journals and thus have a different page structure.

The syntax is what makes a bibliography entry distinct from a regular sentence.
Test for the pattern that coincides with whatever (or multiple) reference styles you are trying to remove.
Aka date, unquoted string, string, page numbers in a certain format.
I'd spend some time searching for a tool that already recognizes bibliography before doing this, as it will be unique to each style (MLA etc.)

Couple of additional features to consider for detecting the start of reference setion
Check if the mention of "references" or "bibliography" is in the last pages as opposed to earlier pages
Run entity recognition on some length of words (~50?) after the word and if a high number of tokens in the 50 are entities, that indicates journal names, author names, etc.

Related

For a project, search issues whose title includes ".*.", or strictly "operator", or through a regular expression

I used to use Bugzilla and its very powerful search engine.
But the project and its bug tracker have been moved to Gitlab.
When trying to search (in online Gitlab), for the project, all issues whose title includes some item like "./", or ".*." (Kronecker product), or "//" (1-line comments), etc, no issue is returned, while many issues matching the query actually exist! I tried with "\.\*\." and other trials, with no more success.
What should be the query syntax to return the right list?
When querying "operator" (with the double quotes for exact matching), when validating the query, quotes disappear, and i get a list of issues whose title includes operand, or operation, or oper, etc. How can i get only issues exactly matching "operator" ?
Is is possible to filter issues with a title matching a regular expression?
All this (and much more) was possible and very useful with Bugzilla. And for the time being, i am quite handicapped and loose a lot of time when Searching things for the project on Gitlab.
Thanks for any hints.

Remove part of a string in each row of a large column of data in KNIME

I am stumbed.
I have a column with some thousand rows of unique adresses regarding universities, pharmacompanies etc. in a KNIME workflow
Example:
55 Shattuck Street Boston Massachusetts 02115 US [NAT: US RES: US] for all designated states
What I need is to clean the data, so each row look like nice and computable like this:
55 Shattuck Street Boston Massachusetts 02115 US.
My problem Is I can't seem to get the system to remove everything after US. Does anyone know a suitable approach in KNIME?
You should be able to use either String Replacer or String Manipulation for this. The first one lets you use either a simple wildcard or a full regular expression pattern while the second one uses a Java-like syntax - the choice comes down to how many different variations on the input data you need to handle and which syntax you prefer.
If you just need to remove any text between square brackets including the space before the open bracket then you can use String Replacer configured like this:
Beside the nodes which were already mentioned by nekomatic and which will work perfectly for the given scenario, there's also a user-friendly regular expression tool in the Palladian nodes extension called Regex Extractor, which allows you to build your regexes with a live preview as you might know from popular online regex testers.
For your scenario, you could e.g. set up a regex like this:
^(?<address>.*)(?:\s\[.*)
In prose, this means: Capture all characters until a space + square opening bracket and output into a column named address.
The Palladian extension is available here as a free plugin for KNIME Desktop and provides a variety of different tools for web, text, and geo data mining and classification.

Creating text summary using NLP

I am in middle of applying NLP to the set of comments that I have received from my data. These comments are stored in one column. I have cleaned them altogether stored them in a list. There are no stop words, special characters etc. Now I want to create a summary from this text. What could be the best method to do that? I have already failed myself with heapq, so I dont want any solution around that.
My clean text is stored in list named : clean_text_summary and it looks like this :
clean_text_summary = ['You are so bad - I hate your product','I am going to deregister','You are frauds'....]
I need to get most common things that people have talked about as a summary.

Named entity recognition - tagging tools

Does someone have a recommendation of tagging tool for NER types in raw text?
The input for the tool should be a library of text files(.txt simple format) , there should be a convenient UI for selecting words and set the tag/annotation fit to selection, the output should be structural representations of the tags(e.gs tart index , last index, tag in a JSON format)
Founderof LightTag here
We provide a super convenient interface to do span annotations such as named entity recognition, classifications and relationships.
You can work as one labeler or bring in a team and LightTag will disribute work between everyone automatically (no more selecting files and remembering what you labeled already) .
You can upload your own suggestions and let labelers use those, or use LightTags built in model.
Of course you can annotate at the character level and highlight subwords or multi word phrases.
You can try https://github.com/lasigeBioTM/MER (bash)
see the demo at http://labs.fc.ul.pt/mer/
Online tools:
I guess Dataturks' POS tool should work fine for your use case, you can just upload your data and specify the labels. The UI seems convenient enough.
Here is the link:
https://dataturks.com
It's an online tool, so you can work with multiple people to get the tagging done.
The exact output format you are looking for is not supported, but the format can easily be converted to what you are looking for, the output is like: word___LABEL word2___LABEL , so a simple 2-line script can convert it to start and end index.
Offline:
Another tool you can check out is prodigy, it's a downloadable software and does similar things. Just that you might be willing to pay for it upfront.
https://prodi.gy

Moving data from Word to Access seamlessly

I am trying to migrate structured documents (i.e. documents that are mostly some metadata and one big table) to a database. When I try to move tabular data from Word to Excel, my main point of pain is handling CRLFs within a cell in Word. Any solution for this?
Now, since I will be transferring from Word to Access:
What will be the default behaviour when I attempt to populate a field with a string that contains a CRLF?
What is the cheapest way to get Access to respect "rich text"? (mostly boldface and overstrike)
Tnx
It should just enter the two characters as any other two characters.
HTML is a pretty good solution.
For a more detailed answer, we should probably know how you are doing this "migration".

Resources