How can I separate consecutive strings without any delimiters? - vcf-variant-call-format

My input data is a VCF (Variant Call Format) file. Each line that I am interested in looks like this:
chrI 22232 DEL00BED N <DEL> . PASS SUPP=1159;SUPP_VEC=11111111111111111111111011111111111
I want to count the presence (1) of a specific deletion in a specific position (22232) supported by n samples. For this reason, I looked at SUPP_VEC= values, however, I don't know how to split each value as 1) it is a string, and 2) doesn't have delimiters. How could I add a space between every character? or How could I split/ count the values from SUPP_VEC= for Python3?
I was also curious to know what SUPP means. I found oneSUPP=2and I looked on Excel if the presence(1)\abscence(0) in the SUPP_VEC counted the value of SUPP, nevertheless, I could only count 1 instead of 2, probably does somebody know what SUPP means.
The reason for my procedure is to have a frequency table for a specific deletion type.
I hope I made myself clear.
Thank you in advance.

Related

Comparing values with some alphanumerics

I've looked through the forums but couldn't find any questions (with answers) that helped. Any guidance would be appreciated.
I'm working on an Excel/Access project that cross references error codes. The codes are twelve digits long, with the first half and second half that need to be sortable. 99% of these codes are entirely numeric, but the 1% that includes letters is really screwing me up.
For example, a common error code might be "386748000123". This would be split into "386748" and "000123", with the first being the code for the type of system and the second being the type of error.
But then the 1% are something like this: "0957AB003A41". "0957AB", and "003A41".
If I format the columns (in Excel and Access) as numbers than the numeric comparisons are far easier, "000123" equals "123". If I format the column as strings than I can compare the alphanumeric values but then "000123" and "123" stop crossing.
The possible solution I've come across is utilizing the Val function inside an Access query to purely compare values but I've never used it and it seems like only a partial fix. Val ignores the strings, which means "0957AB" will have the same value as "0957XY" - and that doesn't work for this project.
I'm sure many of you have had similar issues, so I'm hoping to get some ideas on different ways the problem has been approached and resolved.
You have not provided a minimal sample of the data and also the output, also there is no code that I can amend it for you, but the only part that you are having problem is comparing the alphanumeric ones, you should format all of your data as strings and then compare. to make 123 be equal to "000123" you need to just format the numeric ones as string as below:
format(123,"000000")
which will give you "000123"
Edit
from you comment I learned that the problem is the key that is always or often a number, format will return the proper string for comparison, if it is already a 6-character string it will return itself so there would not be a problem:
do something like this:
if format(key,"0000000")=format(code,"000000") then
'do something
end if

Is there a quick way for excel to identify and remove duplicate series from a cell such as this?

Is there a built in function, or a simple UDF that can identify the pattern in the information below and remove the duplicates?
Assume the following is all within a single excel cell:
80154, 80299, 80299, 82055, 82145, 82205, 82520, 82570, 83840, 83925,
83925, 83986, 83992, 84315, 80154, 80299, 80299, 82055, 82145, 82205,
82520, 82570, 83840, 83925, 83925, 83986, 83992, 84315
There are two sets of data (starts with 80154 ends with 84315). I want to end up with only one set, but I want to do it to 50,000 lines. The final output should be just the BOLD text. Also, sometimes the data repeats itself 3 times, again, I just want the unique set of data.
NOTE: I can't just remove duplicates, because sometimes there will be duplicates in the set that I need to capture in the final output. For example, (A,A,B,C,A,A,B,C) needs to be reduced to (A,A,B,C).
This finds where the first 20% is repeated and cuts the string at that point.
IF it does not find a duplicate it will return the whole string.
=IFERROR(LEFT(A1,FIND(LEFT(A1,LEN(A1)/5),A1,2)-3),A1)
Play with the 5 till you find the proper length of string that will get you the correct answer on all your strings. The higher the number the smaller the string it compares.
Also if it is cutting off too much or not enough, like leaving the , at the end adjust the -3 up and down.

Assigning and reading multidimensional arrays in Python

I'm stumped.
for a in range(0,500): #500 is a highly variable number but using it for example purposes
b = findall(r'<(.*?)>', d) # d will return a highly number variable number of matches could be anywhere from 45-10000
c.append([b])
print(c[0][1])
This returns the error because everything from 'b' goes into c[0][0]. I can understand this. The question is how do I split 'b' apart so I can put it into c so I can
print(c[0][234])
and get it give me back the 235, err element 234 of the 1, err 0, line?
This is a situation like I said above where the number of times going through 'b' will be variable, at least for right now until I get the entire file prepped I can only that 'b' in the end will be way north of 10,000 and probably closer to 100,000 by the time I have all the data collection finished. The number of elements that are stored can and will be highly variable depending on the file that they come from. They are all coming from a csv file but I'm hoping to not to deal with adding in any 'complexity' by going out and having to deal with the csv module...since I've never used it before and that will probably just lead to more questions.
I have tried something similiar to...different variables naturally so things would be appropriately matched up
d = list(zip(*(e.split(',') for e in b)))
all this has did is split on each and every letter versus on the comma.
Your error is coming from the square brackets you have in c.append([b]). The brackets create an extra list that contains the list b. So rather than a two dimensional data structure, you're ending up with three dimensions. Your indexing fails because c[0][1] is trying to get a second value from the middle list (which only ever has one item in it).
You might get what you want with c[0][0][1] instead. But you probably don't actually want that extra level in your data structure. You can avoid creating it by using: c.append(b)

SAS: Match single word within string values of a single variable then replace entire string value with a blank

I'm working in SAS 9.2, in an existing dataset. I need a simple way to match a single word within string values of a single variable, and then replace entire string value with a blank. I don't have experience with SQL, macros, etc. and I'm hoping for a way to do this (even if the code is less efficient" that will be clear to a novice.
Specifically, I need to remove the entire string containing the word "growth" in a variable "pathogen." Sample values include "No growth during two days", "no growth," "growth did not occur," etc. I cannot enter all possible strings since I don't yet know how they will vary (we have only entered a few observations so far).
TRANSWD and TRANSLATE will not work as they will not allow me to replace an entire phrase when the target word is only a part of the string.
Other methods I've looked at (for example, a SESUG paper using PRX at http://analytics.ncsu.edu/sesug/2007/CC06.pdf) appear to remove all instances of the target string in every variable in the dataset, instead of just in the variable of interest.
Obviously I could subset the dataset to a single variable before I perform one of these actions and then merge back, but I'm hoping for something less complicated. Although I will certainly give something more complicated a shot if someone can provide me with sample code to adapt (and it would be greatly appreciated).
Thanks in advance--Kim
Could you be a little more clear on who the data set is constructed? I think mjsqu's solution will work if your variable pathogen is stored sentence by sentence. If not then I would say your best bet is to parse the blocks into sentences and then apply mjsqu's solution.
DATA dataset1;
format Ref best1.
pathogen $40.;
input Ref pathogen $40. ;
datalines;
1 No growth during two days
2 no growth,
3 growth did not occur,
4 does not have the word
;
RUN;
DATA dataout;
SET dataset1;
IF index(lowcase(pathogen),"growth") THEN pathogen="";
RUN;

How to convert a string containing non-numeric values into numeric values?

I have several variables of the form:
1 gdppercap
2 19786,97
3 20713,737
4 20793,163
5 23070,398
6 5639,175
I have copy-pasted the data into Stata, and it thinks they are strings. So far I have tried:
destring gdppercap, generate(gdppercap_n)
but get
gdppercap contains nonnumeric characters; no generate
And:
encode gdppercap, gen(gdppercap_n)
but get a variable numbered from 1 to 1055 regardless of the previous value.
Also I've tried:
gen gdppercap_n = real(gdppercap)
But get:
(1052 missing values generated)
Can you help me? As far as I can tell, Stata do not like the fact that the variable contains fraction numbers.
If I understand you correctly, the interpretation as string arises from one and possibly two facts:
The variable name may be echoed in the first observation. If so, that's text and it's inconsistent with a numeric variable. The root problem there is likely to be a copy-and-paste operation that copied too much. Stata typically gives you a choice when importing by copy-and-paste of whether the first row of what you copied is to be treated as variable names or as data, and you need the first choice, so that column headers become variable names, not data. It may be best to go back and do the copy-and-paste correctly. However, Stata can struggle with multiple header lines in a spreadsheet. Alternatively, use import excel, not a copy-and-paste. Alternatively, drop in 1 to remove the first observation, provided that it consistently is superfluous.
Commas indicate decimal places. destring can easily cope with this: see the help for its dpcomma option. Stata has no objection to fractions; that would be absurd. The problem is that you need to flag your use of commas.
Note that
destring is a wrapper for real(), so real() is not a way round this.
encode is for mapping genuine categorical variables to integers, as you discovered, and as its help does explain. It is not for fixing data input errors.
You can write a for loop to convert a comma to a period. I don't quite know your variables but imagine you have a variable gdppercap with information like 1234,343 and you want that to be 1234.343 before you do the destring.
For example:
forvalues x = 1(1)10 {
replace gdppercap = substr(gdppercap, 1, `x'-1) + "." + substr(gdppercap, `x'+1, .)
if substr(gdppercap, `x', 1) == ","
}

Resources