Find and save a Specific String Until a semicolon - search

I have a large dataset (~25GB) and I would like to retrieve the data following 8 specific modifiers. For example, if I have the "AC_afr" tag I searched for, I would also like to keep its data "AC_afr=8855525;". I need a way to search for a tag, and then keep everything after that tag until the semicolon.
I would normally open it up and Excel, but the data is much too large.
I have looked online for grep options, but could not find a solution.
Example of the data:
AC_afr=0;AN_afr=8250;non_neuro_AN_eas_kor=2404;non_neuro_AF_eas_kor=0.00000e+00;non_neuro_nhomalt_eas_kor=0;non_cancer_AF_nfe_seu=0.00000e+00;non_cancer_nhomalt_nfe_seu=0;

Related

Creating text summary using NLP

I am in middle of applying NLP to the set of comments that I have received from my data. These comments are stored in one column. I have cleaned them altogether stored them in a list. There are no stop words, special characters etc. Now I want to create a summary from this text. What could be the best method to do that? I have already failed myself with heapq, so I dont want any solution around that.
My clean text is stored in list named : clean_text_summary and it looks like this :
clean_text_summary = ['You are so bad - I hate your product','I am going to deregister','You are frauds'....]
I need to get most common things that people have talked about as a summary.

Asciidoctor nested table

I am trying to create nested tables in my Asciidoctor pdf output but I cannot find the syntax.
If I understand it right, nested tables should be supported in Asciidoctor as of 1.5.0. I am running a Docker container that has 1.5.5 (https://github.com/asciidoctor/docker-asciidoctor).
I've tried as per example in table 11 here: http://www.methods.co.nz/asciidoc/newtables.html but to no avail.
Note that Asciidoc and Asciidoctor are not the same thing.
Therefore, make sure you are looking at the correct documentation.
I have not tried it, but if a nested table is going to work, the cell containing it will have to use the asciidoc style. You will then most likely have put the table in a block and escape all the pipe symbols (using \| instead of | or using some other delimiter).
A web search turned up this open issue in the AsciiDoctor tracker requesting (improvements to) nested table support. So this seems not to be implemented yet at least in some backends. The first comment contains an example of how to specify a nested table.
Are you sure you cannot use something other than nested tables? They are usually not the most readable thing.
In order to make it work, you need to delete two unintended newlines. Here's the modified content.
[width="75%",cols="1,2a"]
|==============================================
|Normal cell |Cell with nested table
[cols="2,1"]
!==============================================
!Nested table cell 1 !Nested table cell 2
!==============================================
|==============================================
I must say I used asciidoctor-pdf first time and although the process has been streamlined as much as possible with the docker image, there is a much quicker way to get rendered feedback: Asciidoctor.js - a Chrome extension that converts your .adoc file to HTML and reloads when you save the file.
Asciidoctor.js comes from the same great team that created and maintain Asciidoctor, so it has latest Asciidoctor under the hood.

Query wikipedia

I would like to query two or three terms in order to locate them in Wikipedia´s entries. Specifically, I´m trying to see if some terms get repeated in the first paragraphs (abstract) across entries. Could be direct or through dbpedia. Thanks
Using Mediawiki API you can find articles that contain those keywords.
Try the API:Search documentation.
For doing what you want to do, also, you'd probably need to find the articles that have those keywords and then parse the text to check if they are in the first paragraphs.
With this:
?action=parse&page=Nicolas_Cage&prop=text&section=0
you can get the HTML of the first section of a page (see this post).

Extract parameters and result contents from website

I have a website where I can input a list of strings and it'll display the results of each in the same format (basically a table).
What I want to do is to be able to save the results as well as their corresponding parameters (the input string that I searched) and output them into a file to analyze later. So basically capture my input and the output it returns. It's kind of like, if I search "stack" on google, I want my output file to be "stack" and all the displayed results from the search.
I've done some research on web and screen scraping, but I can't find anything that fits my needs. I looked into the curl function in php, but it looks like it can only get the contents of a specific URL, which I don't have since I'll be repeating the searches frequently.
I also looked into the HTML Agility Pack and HttpWatch, but they don't seem to be able to extract contents this dynamically.
I was wondering if there are any ideas or tips that I could use. I was thinking maybe a plugin or application that I could write that captures the parameters of my request (input strings) and the results sent from the server, but I'm not really sure how to do this, any tips? Or maybe there's an existing one that I wasn't able to find?
Thanks in advance!

Calculate length of string using yahoo pipes

I am using yahoo pipes to fetch articles from various sources including google, however articles from google include the title and source of title in the description, is there a way in yahoo pipes to remove the title & source and leave the rest of article intact. I tried to use sub-string however it requires length of the string which is variable for each article. I guess if there is way to calculate the length of title and source and pass it to sub-string module this may work.
Any help would be great.
Regards
Take a look at http://pipes.yahoo.com/pipes/pipe.info?_id=8KZMRx473hGtVMYsP27D0g, which can be used as a subpipe (i.e. within a loop) to calculate the length of a string. It should be relatively straightforward to add a second text input module and modify the Pipe to cater for your second text string.

Resources