Compare two text columns to find partial match and new rows - excel

I have two lists of messages. The first one is short messages and the second one is a master file which has longer texts which includes the short messages in the first list but also has many new messages. I want to find the new ones in master file (second list) which has no partial matches.
something like above. then NO means they are new errors
I tried =IF(ISERROR(VLOOKUP(""&A2&"",C:C,1,0)),"No","Yes") but it is other way around. it will find short messages within master file with big messages. I want to check big messages which have the short messages inside compare with the list with short messages and if there is no (partial) match label it as new.

This should work, currently can't test it though
=if(sumproduct(--isnumber(search($a$2:$a$8,b2)))>0,"YES","NO")

Try:
=IF(OR(ISNUMBER(FIND(" "&$A$2:$A$8&" "," "&B2& " "))),"YES","NO")
Note the use of spaces otherwise aaa would be found in kkaaa

Related

Problems working out the re.compile mask/cypher

I have working code using re.compile that searches for a given key and extracts specified bytes from that line.
Working cypher
S011=re.compile(r"S0\w*\W*11\b")
Searches for 'S0' at the start and '11' further in (the intervening alphanumeric changes with each file)
S012PA041 11 1001650953.34N 72627.05E 426930.97227906.7 285.3227033224
I am trying to use the same method for a different input file but I can't work out the correct mask/cypher. There are several lines starting 'P1' so that in not exclusive enough; the 'P1....,V0' is the exclusive key. Again the numbers between the keys change with each event and file.
P1,0,01169-72-063,,1001,,1,2020:07:31:12:48:01.7,1,V01,2,,436389.57,7196330.69,,64.88354429,7.65691702,,64.88327349,7.65520631,,0.00,0.00,0.00,0.00,,248.04
I have tried these but with no success:
V0=re.compile(r"^P1\w*\W*V0")
V0=re.compile(r"^P1\w*\W*V0\w*\W*")
V0=re.compile(r"^P1\w*V0\w*")
After running more combinations than an safe-cracker on Red Bull, I've finally got the right sequence for the regex.
Line to be identified using 'P1' and further in 'V01' as search keys
P1,0,01169-72-063,,1001,,1,2020:07:31:12:48:01.7,1,V01,2,,436389.57,7196330.69,,64.88354429,7.65691702,,64.88327349,7.65520631,,0.00,0.00,0.00,0.00,,248.04
re.compile code to identify it.
V0=re.compile(r"^P1\s*,*:*\S*V01\s*,*:*\S*\b")

Create automated report from web data

I have a set of multiple API's I need to source data from and need four different data categories. This data is then used for reporting purposes in Excel.
I initially created web queries in Excel, but my Laptop just crashes because there is too many querie which have to be updated. Do you guys know a smart workaround?
This is an example of the API I will source data from (40 different ones in total)
https://api.similarweb.com/SimilarWebAddon/id.priceprice.com/all
The data points I need are:
EstimatedMonthlyVisits, TopOrganicKeywords, OrganicSearchShare, TrafficSources
Any ideas how I can create an automated report which queries the above data on request?
Thanks so much.
If Excel is crashing due to the demand, and that doesn't surprise me, you should consider using Python or R for this task.
install.packages("XML")
install.packages("plyr")
install.packages("ggplot2")
install.packages("gridExtra")
require("XML")
require("plyr")
require("ggplot2")
require("gridExtra")
Next we need to set our working directory and parse the XML file as a matter of practice, so we're sure that R can access the data within the file. This is basically reading the file into R. Then, just to confirm that R knows our file is in XML, we check the class. Indeed, R is aware that it's XML.
setwd("C:/Users/Tobi/Documents/R/InformIT") #you will need to change the filepath on your machine
xmlfile=xmlParse("pubmed_sample.xml")
class(xmlfile) #"XMLInternalDocument" "XMLAbstractDocument"
Now we can begin to explore our XML. Perhaps we want to confirm that our HTTP query on Entrez pulled the correct results, just as when we query PubMed's website. We start by looking at the contents of the first node or root, PubmedArticleSet. We can also find out how many child nodes the root has and their names. This process corresponds to checking how many entries are in the XML file. The root's child nodes are all named PubmedArticle.
xmltop = xmlRoot(xmlfile) #gives content of root
class(xmltop)#"XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode"
xmlName(xmltop) #give name of node, PubmedArticleSet
xmlSize(xmltop) #how many children in node, 19
xmlName(xmltop[[1]]) #name of root's children
To see the first two entries, we can do the following.
# have a look at the content of the first child entry
xmltop[[1]]
# have a look at the content of the 2nd child entry
xmltop[[2]]
Our exploration continues by looking at subnodes of the root. As with the root node, we can list the name and size of the subnodes as well as their attributes. In this case, the subnodes are MedlineCitation and PubmedData.
#Root Node's children
xmlSize(xmltop[[1]]) #number of nodes in each child
xmlSApply(xmltop[[1]], xmlName) #name(s)
xmlSApply(xmltop[[1]], xmlAttrs) #attribute(s)
xmlSApply(xmltop[[1]], xmlSize) #size
We can also separate each of the 19 entries by these subnodes. Here we do so for the first and second entries:
#take a look at the MedlineCitation subnode of 1st child
xmltop[[1]][[1]]
#take a look at the PubmedData subnode of 1st child
xmltop[[1]][[2]]
#subnodes of 2nd child
xmltop[[2]][[1]]
xmltop[[2]][[2]]
The separation of entries is really just us, indexing into the tree structure of the XML. We can continue to do this until we exhaust a path—or, in XML terminology, reach the end of the branch. We can do this via the numbers of the child nodes or their actual names:
#we can keep going till we reach the end of a branch
xmltop[[1]][[1]][[5]][[2]] #title of first article
xmltop[['PubmedArticle']][['MedlineCitation']][['Article']][['ArticleTitle']] #same command, but more readable
Finally, we can transform the XML into a more familiar structure—a dataframe. Our command completes with errors due to non-uniform formatting of data and nodes. So we must check that all the data from the XML is properly inputted into our dataframe. Indeed, there are duplicate rows, due to the creation of separate rows for tag attributes. For instance, the ELocationID node has two attributes, ValidYN and EIDType. Take the time to note how the duplicates arise from this separation.
#Turning XML into a dataframe
Madhu2012=ldply(xmlToList("pubmed_sample.xml"), data.frame) #completes with errors: "row names were found from a short variable and have been discarded"
View(Madhu2012) #for easy checking that the data is properly formatted
Madhu2012.Clean=Madhu2012[Madhu2012[25]=='Y',] #gets rid of duplicated rows
Here is a link that should help you get started.
http://www.informit.com/articles/article.aspx?p=2215520
If you have never used R before, it will take a little getting used to, but it's worth it. I've been using it for a few years now and when compared to Excel, I have seen R perform anywhere from a couple hundred percent faster to many thousands of percent faster than Excel. Good luck.

How to change many .sra documents into one fastq document?

I download a series of .sra which belong to one sample from NCBI. I tried to change one sra into fastq, but it is error.
My code:
$fastq-dump I --split-files ERRXXXXX.sra.
And My .sra document is paired.
I used $fastq-dump SRR5XXXXX.sra to change another process, and it worked well.
Therefore I would like to know how to make many .sra into one .fastq document? Thank you for your kindness.
I don't really understand the whole message, but regarding your specific question "how to make many .sra into one .fastq document", the answer is pretty simple:
Generate multiple fastq files from all the sra files you are interested in in the usual way.
Concatenate all those fastq files in a single one: cat fastq1.fq fastq2.fq ... fastqN.fq > new_fastq.fq
Remove intermediate files if no longer needed
The new_fastq.fq file contains all the information from the original sra files.
Take care and don't mix first and second ends in the same fastq (unless you know what you are doing, of course).

Combining phrases from list of words Python3

doing my best to grab information out of a lot of pdf files. Have them in a dictionary format where the key is a given date and the values are a list of occupations.
looks like this when proper:
'12/29/2014': [['COUNSELING',
'NURSING',
'NURSING',
'NURSING',
'NURSING',
'NURSING']]
However, occasionally there are occupations with several words which cannot be reliably understood in single word-form, such as this:
'11/03/2014': [['DENTISTRY',
'OSTEOPATHIC',
'MEDICINE',
'SURGERY',
'SOCIAL',
'SPEECH-LANGUAGE',
'PATHOLOGY']]
Notice that "osteopathic medicine & surgery" and "speech-language pathology" are the full text for two of these entries. This gets hairier when we also have examples of just "osteopathic medicine" or even "medicine."
So my question is this - How should I go about testing combinations of these words to see if they match more complex occupational titles? I can use the same order of the words, as I have maintained that from the source.
Thanks!

LogStash-Extracting information from different events

I have case where information is being displayed in different lines. These lines are not even consecutive lines. Is there anyway to read these two and combine into single event?
Example log line: where I wanted to gather information from first line and last line into single event
07:11:02.002015|OrderId=100 Client=TEST
07:11:02.002016|blah1
07:11:02.002017|blah2
07:11:02.002018|blah3
07:11:02.002019|OrderId=100 Symbol=APPLE Price=99 Quantity=100
I wanted to gather all information about OrderId=100 from line1 and line5.
Or can you suggest what is the best way to work on these kind of logs?

Resources