Calculate length of string using yahoo pipes - string

I am using yahoo pipes to fetch articles from various sources including google, however articles from google include the title and source of title in the description, is there a way in yahoo pipes to remove the title & source and leave the rest of article intact. I tried to use sub-string however it requires length of the string which is variable for each article. I guess if there is way to calculate the length of title and source and pass it to sub-string module this may work.
Any help would be great.
Regards

Take a look at http://pipes.yahoo.com/pipes/pipe.info?_id=8KZMRx473hGtVMYsP27D0g, which can be used as a subpipe (i.e. within a loop) to calculate the length of a string. It should be relatively straightforward to add a second text input module and modify the Pipe to cater for your second text string.

Related

Find and save a Specific String Until a semicolon

I have a large dataset (~25GB) and I would like to retrieve the data following 8 specific modifiers. For example, if I have the "AC_afr" tag I searched for, I would also like to keep its data "AC_afr=8855525;". I need a way to search for a tag, and then keep everything after that tag until the semicolon.
I would normally open it up and Excel, but the data is much too large.
I have looked online for grep options, but could not find a solution.
Example of the data:
AC_afr=0;AN_afr=8250;non_neuro_AN_eas_kor=2404;non_neuro_AF_eas_kor=0.00000e+00;non_neuro_nhomalt_eas_kor=0;non_cancer_AF_nfe_seu=0.00000e+00;non_cancer_nhomalt_nfe_seu=0;

Remove part of a string in each row of a large column of data in KNIME

I am stumbed.
I have a column with some thousand rows of unique adresses regarding universities, pharmacompanies etc. in a KNIME workflow
Example:
55 Shattuck Street Boston Massachusetts 02115 US [NAT: US RES: US] for all designated states
What I need is to clean the data, so each row look like nice and computable like this:
55 Shattuck Street Boston Massachusetts 02115 US.
My problem Is I can't seem to get the system to remove everything after US. Does anyone know a suitable approach in KNIME?
You should be able to use either String Replacer or String Manipulation for this. The first one lets you use either a simple wildcard or a full regular expression pattern while the second one uses a Java-like syntax - the choice comes down to how many different variations on the input data you need to handle and which syntax you prefer.
If you just need to remove any text between square brackets including the space before the open bracket then you can use String Replacer configured like this:
Beside the nodes which were already mentioned by nekomatic and which will work perfectly for the given scenario, there's also a user-friendly regular expression tool in the Palladian nodes extension called Regex Extractor, which allows you to build your regexes with a live preview as you might know from popular online regex testers.
For your scenario, you could e.g. set up a regex like this:
^(?<address>.*)(?:\s\[.*)
In prose, this means: Capture all characters until a space + square opening bracket and output into a column named address.
The Palladian extension is available here as a free plugin for KNIME Desktop and provides a variety of different tools for web, text, and geo data mining and classification.

NLP Challenge: Automatically removing bibliography/references?

I recently came across following problem: When applying a topic model on a bunch of parsed PDF files, I discovered that content of the references unfortunately also counts for the model. I.e. words within the references appear in the tokenized list of words.
Is there any known "best-practice" to solve this problem?
I thought about a search strategy where the python code automatically removes all content after the last mention of "references" or "bibliography". If I would go by the first, or a random mention of "references" or "bibliography" within the full text, the parser might not capture the true full content.
The input PDF are all from different journals and thus have a different page structure.
The syntax is what makes a bibliography entry distinct from a regular sentence.
Test for the pattern that coincides with whatever (or multiple) reference styles you are trying to remove.
Aka date, unquoted string, string, page numbers in a certain format.
I'd spend some time searching for a tool that already recognizes bibliography before doing this, as it will be unique to each style (MLA etc.)
Couple of additional features to consider for detecting the start of reference setion
Check if the mention of "references" or "bibliography" is in the last pages as opposed to earlier pages
Run entity recognition on some length of words (~50?) after the word and if a high number of tokens in the 50 are entities, that indicates journal names, author names, etc.

Query wikipedia

I would like to query two or three terms in order to locate them in Wikipedia´s entries. Specifically, I´m trying to see if some terms get repeated in the first paragraphs (abstract) across entries. Could be direct or through dbpedia. Thanks
Using Mediawiki API you can find articles that contain those keywords.
Try the API:Search documentation.
For doing what you want to do, also, you'd probably need to find the articles that have those keywords and then parse the text to check if they are in the first paragraphs.
With this:
?action=parse&page=Nicolas_Cage&prop=text&section=0
you can get the HTML of the first section of a page (see this post).

Extract parameters and result contents from website

I have a website where I can input a list of strings and it'll display the results of each in the same format (basically a table).
What I want to do is to be able to save the results as well as their corresponding parameters (the input string that I searched) and output them into a file to analyze later. So basically capture my input and the output it returns. It's kind of like, if I search "stack" on google, I want my output file to be "stack" and all the displayed results from the search.
I've done some research on web and screen scraping, but I can't find anything that fits my needs. I looked into the curl function in php, but it looks like it can only get the contents of a specific URL, which I don't have since I'll be repeating the searches frequently.
I also looked into the HTML Agility Pack and HttpWatch, but they don't seem to be able to extract contents this dynamically.
I was wondering if there are any ideas or tips that I could use. I was thinking maybe a plugin or application that I could write that captures the parameters of my request (input strings) and the results sent from the server, but I'm not really sure how to do this, any tips? Or maybe there's an existing one that I wasn't able to find?
Thanks in advance!

Resources