GATK GnarlyGenotyper limit of alleles - vcf-variant-call-format

GATK GnarlyGenotyper limit of alleles - vcf-variant-call-format

I am joint calling 167 samples with GATK GEnomicsDBImpot. But I got this kind of error:
Sample/Callset 45( TileDB row idx 107) at Chromosome Chr1 position
1320197 (TileDB column 247913574) has too many genotypes in the
combined VCF record : 1081 : current limit : 1024 (num_alleles,
ploidy) = (46, 2). Fields, such as PL, with length equal to the
number of genotypes will NOT be added for this sample for this
location.
Following the advises I have found on the link below, I decided to use GnarlyGenotyper to call the variants, as it seems to manage more alleles.
https://gatk.broadinstitute.org/hc/en-us/community/posts/360072168712-GenomicsDBImport-Attempting-to-genotype-more-than-50-alleles?page=1#community_comment_360012343671
The following script has been run, with the correct option to accept more alleles:
~/gatk-4.2.0.0/gatk GnarlyGenotyper \
-R "$reference" \
-V gendb://GenomicsDBImport_GATK \
--max-alternate-alleles 100 \
-O GenotypeGVCFs_gnarly.vcf
Unfortunately I got the following error as well:
Chromosome Chr1 position 198912 (TileDB column 246792289) has too many
alleles in the combined VCF record : 7 : current limit : 6. Fields,
such as PL, with length equal to the number of genotypes will NOT be
added for this location.
Has anyone already used this tool? Is it possible to input more alleles?

Related

How do I fix USER FATAL MESSAGE 740?

How do I fix USER FATAL MESSAGE 740? This error is generated by Nastran when I try to run a BDF/DAT file of mine.
*** USER FATAL MESSAGE 740 (RDASGN)
UNIT NUMBER 5 HAS ALREADY BEEN ASSIGNED TO THE LOGICAL NAME INPUT
USER ACTION: CHANGE THE UNIT NUMBER ON THE ASSIGN STATEMENT AND IF THE UNIT IS USED FOR
PARAM,POST,<0 THEN SPECIFY PARAM,OUNIT2 WITH THE NEW UNIT NUMBER.
AVOID USING THE FOLLOWING UNIT NUMBERS THAT ARE ASSIGNED TO SPECIAL FILES IN MSC.NASTRAN:
1 THRU 12, 14 THRU 22, 40, 50, 51, 91, 92. SEE THE MSC.NASTRAN INSTALLATIONS/OPERATIONS
GUIDE SECTION ON MAKING FILE ASSIGNMENTS OR MSC.NASTRAN QUICK REFERENCE GUIDE ON
ASSIGN PHYSICAL FILE FOR REFERENCE.
Below is the head of my BDF file.
assign userfile='SUB1_PLATE.csv', status=UNKNOWN, form=formatted, unit=52
SOL 200
CEND
ECHO = NONE
DESOBJ(MIN) = 35
set 30=1008,1007,1015,1016
DESMOD=SUB1_PLATE
SUBCASE 1
$! Subcase name : DefaultLoadCase
$LBCSET SUBCASE1 DefaultLbcSet
ANALYSIS = STATICS
SPC = 1
LOAD = 6
DESSUB = 99
DISPLACEMENT(SORT1,PLOT,REAL)=ALL
STRESS(SORT1,PLOT,VONMISES,CORNER)=ALL
BEGIN BULK
param,xyunit,52
[...]
ENDDATA

Below is the solution
Correct
assign userfile='SUB1_PLAT.csv', status=UNKNOWN, form=formatted, unit=52
I shortened the name of CSV file to SUB1_PLAT.csv. This reduced the length of the line to 72 characters.
Incorrect
assign userfile='SUB1_PLATE.csv', status=UNKNOWN, form=formatted, unit=52
The file management section is limited to 72 characters, spaces included. The incorrect line stretches 73 characters. The nastran reader ignores the 73rd character and on. Instead of reading "unit=52" the reader reads "unit=5" which triggers the error.
|<--------------------- 72 Characters -------------------------------->||<- Characters are ignored truncated ->
assign userfile='SUB1_PLATE.csv', status=UNKNOWN, form=formatted, unit=52
References
MSC Nastran Reference Guide
The records of the first four sections are input in free-field format
and only columns 1 through 72 are used for data. Any information in
columns 73 through 80 may appear in the printed echo, but will not be
used by the program. If the last character in a record is a comma,
then the record is continued to the next record.

Why does my PySpark regular expression not give more than the first row?

Taking inspiration from this answer: https://stackoverflow.com/a/61444594/4367851 I have been able to split my .txt file into columns in a Spark DataFrame. However, it only gives me the first game - even though the sample .txt file contains many more.
My code:
basefile = spark.sparkContext.wholeTextFiles("example copy 2.txt").toDF().\
selectExpr("""split(replace(regexp_replace(_2, '\\\\n', ','), ""),",") as new""").\
withColumn("Event", col("new")[0]).\
withColumn("White", col("new")[2]).\
withColumn("Black", col("new")[3]).\
withColumn("Result", col("new")[4]).\
withColumn("UTCDate", col("new")[5]).\
withColumn("UTCTime", col("new")[6]).\
withColumn("WhiteElo", col("new")[7]).\
withColumn("BlackElo", col("new")[8]).\
withColumn("WhiteRatingDiff", col("new")[9]).\
withColumn("BlackRatingDiff", col("new")[10]).\
withColumn("ECO", col("new")[11]).\
withColumn("Opening", col("new")[12]).\
withColumn("TimeControl", col("new")[13]).\
withColumn("Termination", col("new")[14]).\
drop("new")
basefile.show()
Output:
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
| Event| White| Black| Result| UTCDate| UTCTime| WhiteElo| BlackElo| WhiteRatingDiff| BlackRatingDiff| ECO| Opening| TimeControl| Termination|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
|[Event "Rated Cla...|[White "BFG9k"]|[Black "mamalak"]|[Result "1-0"]|[UTCDate "2012.12...|[UTCTime "23:01:03"]|[WhiteElo "1639"]|[BlackElo "1403"]|[WhiteRatingDiff ...|[BlackRatingDiff ...|[ECO "C00"]|[Opening "French ...|[TimeControl "600...|[Termination "Nor...|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
Input file:
[Event "Rated Classical game"]
[Site "https://lichess.org/j1dkb5dw"]
[White "BFG9k"]
[Black "mamalak"]
[Result "1-0"]
[UTCDate "2012.12.31"]
[UTCTime "23:01:03"]
[WhiteElo "1639"]
[BlackElo "1403"]
[WhiteRatingDiff "+5"]
[BlackRatingDiff "-8"]
[ECO "C00"]
[Opening "French Defense: Normal Variation"]
[TimeControl "600+8"]
[Termination "Normal"]
1. e4 e6 2. d4 b6 3. a3 Bb7 4. Nc3 Nh6 5. Bxh6 gxh6 6. Be2 Qg5 7. Bg4 h5 8. Nf3 Qg6 9. Nh4 Qg5 10. Bxh5 Qxh4 11. Qf3 Kd8 12. Qxf7 Nc6 13. Qe8# 1-0
[Event "Rated Classical game"]
.
.
.
Each game starts with [Event so I feel like it should be doable as the file has repeating structure, alas I can't get it to work.
Extra points:
I don't actually need the move list so if it's easier they can be deleted.
I only want the content of what is inside the " " for each new line once it has been converted to a Spark DataFrame.
Many thanks.

wholeTextFiles reads each file into a single record. If you read only one file, the result will a RDD with only one row, containing the whole text file. The regexp logic in the question returns only one result per row and this will be the first entry in the file.
Probably the best solution would be to split the file at the os level into one file per game (for example here) so that Spark can read the multiple games in parallel. But if a single file is not too big, splitting the games can also be done within PySpark:
Read the file(s):
basefile = spark.sparkContext.wholeTextFiles(<....>).toDF()
Create a list of columns and convert this list into a list of column expressions using regexp_extract:
from pyspark.sql import functions as F
cols = ['Event', 'White', 'Black', 'Result', 'UTCDate', 'UTCTime', 'WhiteElo', 'BlackElo', 'WhiteRatingDiff', 'BlackRatingDiff', 'ECO', 'Opening', 'TimeControl', 'Termination']
cols = [F.regexp_extract('game', rf'{col} \"(.*)\"',1).alias(col) for col in cols]
Extract the data:
split the whole file into an array of games
explode this array into single records
delete the line breaks within each record so that the regular expression works
use the column expressions defined above to extract the data
basefile.selectExpr("split(_2,'\\\\[Event ') as game") \
.selectExpr("explode(game) as game") \
.withColumn("game", F.expr("concat('Event ', replace(game, '\\\\n', ''))")) \
.select(cols) \
.show(truncate=False)
Output (for an input file containing three copies of the game):
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Event |White|Black |Result|UTCDate |UTCTime |WhiteElo|BlackElo|WhiteRatingDiff|BlackRatingDiff|ECO|Opening |TimeControl|Termination|
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Rated Classical game |BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
|Rated Classical game2|BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
|Rated Classical game3|BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+

Failing to use sumproduct on date ranges with multiple conditions [Python]

From replacement data table (below on the image), I am trying to incorporate the solbox product replace in time series data format(above on the image). I need to extract out the number of consumers per day from the information.
What I need to find out:
On a specific date, which number of solbox product was active
On a specific date, which number of solbox product (which was a consumer) was active
I have used this line of code in excel but cannot implement this on python properly.
=SUMPRODUCT((Record_Solbox_Replacement!$O$2:$O$1367 = "consumer") * (A475>=Record_Solbox_Replacement!$L$2:$L$1367)*(A475<Record_Solbox_Replacement!$M$2:$M$1367))
I tried in python -
timebase_df['date'] = pd.date_range(start = replace_table_df['solbox_started'].min(), end = replace_table_df['solbox_started'].max(), freq = frequency)
timebase_df['date_unix'] = timebase_df['date'].astype(np.int64) // 10**9
timebase_df['no_of_solboxes'] = ((timebase_df['date_unix']>=replace_table_df['started'].to_numpy()) & (timebase_df['date_unix'] < replace_table_df['ended'].to_numpy() & replace_table_df['customer_type'] == 'customer']))
ERROR:
~\Anaconda3\Anaconda4\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
232 # The ambiguous case is object-dtype. See GH#27803
233 if len(lvalues) != len(rvalues):
--> 234 raise ValueError("Lengths must match to compare")
235
236 if should_extension_dispatch(lvalues, rvalues):
ValueError: Lengths must match to compare
Can someone help me please? I can explain in comment section if I have missed something.

Calculating number of times each IAT was called from PE

I am trying to calculate the number of times each IAT-import address table was called by a PE. It is like this:
counter=0
for entry in file.DIRECTORY_ENTRY_IMPORT:
print (entry.dll)
for imp in entry.imports:
print ('\t', hex(imp.address), imp.name)
counter=counter+1
print(entry.dll,":",counter)
The output is somewhat like this:
b'KERNEL32.dll'
0x180006000 b'GetProcAddress'
0x180006008 b'LoadLibraryA'
0x180006010 b'IsProcessorFeaturePresent'
0x180006018 b'GetStartupInfoW'
0x180006020 b'SetUnhandledExceptionFilter'
0x180006028 b'UnhandledExceptionFilter'
0x180006030 b'IsDebuggerPresent'
0x180006038 b'RtlVirtualUnwind'
0x180006040 b'RtlLookupFunctionEntry'
0x180006048 b'RtlCaptureContext'
0x180006050 b'InitializeSListHead'
0x180006058 b'DisableThreadLibraryCalls'
0x180006060 b'GetSystemTimeAsFileTime'
0x180006068 b'GetCurrentThreadId'
0x180006070 b'GetCurrentProcessId'
0x180006078 b'QueryPerformanceCounter'
0x180006080 b'GetModuleHandleW'
b'KERNEL32.dll' : 17
b'MSVCP140.dll'
0x180006090 b'?_Xout_of_range#std##YAXPEBD#Z'
0x180006098 b'?_Xlength_error#std##YAXPEBD#Z'
0x1800060a0 b'?_Xbad_alloc#std##YAXXZ'
b'MSVCP140.dll' : 20
b'VCRUNTIME140.dll'
0x1800060b0 b'_purecall'
0x1800060b8 b'__std_terminate'
0x1800060c0 b'memmove'
0x1800060c8 b'_CxxThrowException'
0x1800060d0 b'__std_type_info_destroy_list'
0x1800060d8 b'__RTDynamicCast'
0x1800060e0 b'memcpy'
0x1800060e8 b'__C_specific_handler'
0x1800060f0 b'__std_exception_copy'
0x1800060f8 b'__std_exception_destroy'
0x180006100 b'__CxxFrameHandler3'
0x180006108 b'memset'
b'VCRUNTIME140.dll' : 32
But it should count each entry individually. For example, MSVCP140.dll should be counted as '3' and not '20'. Any help would be gladly appreciated.

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5

Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string