Better splitting of mutliallelic sites then bcftools norm --m-any - vcf-variant-call-format

I am trying to split the multiallelic sites of my VCF. I used bcftools norm --m-any. However, the result is not really reasonable to me. Here's an example.
Let's say, I have this multiallelic site:
REF ALT GT1 GT2 GT3
A C,G 1/2 0/2 0/1
After splitting I get these two:
REF ALT GT1 GT2 GT3
A C 1/0 0/0 0/1
A G 0/1 0/1 0/0
So, the results for the "unused" ALT allele for a specific row is just set to REF. Is there a way to change this behavior, since I don't think it's reasonable to do it this way, at least for my analysis. I would like my result to be more like this:
REF ALT GT1 GT2 GT3 GT1 GT2 GT3
A C 1/. 0/. 0/1 or ./. ./. 0/1
A G ./1 0/1 0/. ./. 0/1 ./.
Or similar. At least I don't want to have REF where there was an ALT before.

Have you try bcftools norm -a . ?
You can also check the --atom-overlaps option: 'Alleles missing because of an overlapping variant can be set either to missing (.) or to the star alele (*), as recommended by the VCF specification.'

Related

How to get delta percentage from /proc/schedstat

I am trying to get node CFS scheduler throttling in percent. For that i am reading 2 values 2 times (ignoring timeslices) from /proc/schedstat it has following format:
$ cat /proc/schedstat
version 15
timestamp 4297299139
cpu0 0 0 0 0 0 0 1145287047860 105917480368 8608857
CpuTime RunqTime
so i read from file, sleep for some time, read again, calculate time passed and value delta between, and calc percent then using following code:
cputTime := float64(delta.CpuTime) / delta.TimeDelta / 10000000
runqTime := float64(delta.RunqTime) / delta.TimeDelta / 10000000
percent := runqTime
the trick is that percent could be like 2000%
i assumed that runqtime is incremental, and is expressed in nanoseconds, so i divided it by 10^7 (to get it to 0-100% range), and timedelta is difference between measurements in seconds. what is wrong with it? how to do that properly?
I, for one, do not know how to interpret the output of /proc/schedstat.
You do quote an answer to a unix.stackexchange question, with a link to a mail in LKML that mentions a possible patch to the documentation.
However, "schedstat" is a term which is suspiciously missing from my local man proc page, and from the copies of man proc I could find on the internet. Actually, when searching for schedstat on Google, the results I get either do not mention the word "schedstat" (for example : I get links to copies of the man page, which mentions "sched" and "stat"), or non authoritative comments (fun fact : some of them quote that answer on stackexchange as a reference ...)
So at the moment : if I had to really understand what's in the output, I think I would try to read the code for my version of the kernel.
As far as "how do you compute delta ?", I understand what you intend to do, I had in mind something more like "what code have you written to do it ?".
By running cat /proc/schedstat; sleep 1 in a loop on my machine, I see that the "timestamp" entry is incremented by ~250 units on each iteration (so I honestly can't say what's the underlying unit for that field ...).
To compute delta.TimeDelta : do you use that field ? or do you take two instances of time.Now() ?
The other deltas are less ambiguous, I do imagine you took the difference between the counters you see :)
Do note that, on my mainly idle machine, I sometimes see increments higher than 10^9 over a second on these counters. So again : I do not know how to interpret these numbers.

Top command: How to stick to one unit (KB/KiB)

I'm using the top command in several distros to feed a Bash script. Currently I'm calling it with top -b -n1.
I'd prefer a unified output in KiB or KB. However, it will display large units in megabytes or gigabytes. Is there an option to avoid these large units?
Please consider the following example:
4911 root 20 0 274m 248m 146m S 0 12.4 0:07.19 example
Edit: To answer 123's question, I transform the columns and send them to a log monitoring appliance. If there's no alternative, I'll convert the units via awk beforehand as per this thread.
Consider cutting out the middleman top and reading directly from /proc/[1-9]*/statm. All those files consist of one line of numbers, of which the first three correspond with top's VIRT RES SHR, respectively, in units of pages, normally 4096 B, so that by multiplying with 4 you get units of KiB.
You need a config file. You can create it yourself as $HOME/.toprc or using top interactively. The latter is easy. You just need to press W while top is running in interactive mode.
But first you need to set top interactively to the state you want. To change the memory scale press e until you see what you want. (Then save with W.)
Either way, you need this set in your config: Task_mscale=0 for the lowest scale.

What is the behavior of the carry flag for CP on a Game Boy?

On the page 87 of the Game Boy CPU Manual it is claimed that the CP n instruction sets the carry flag when there was no borrow and that it means that A < n. This seems to conflict itself, because the carry flag is set when A > n.
An example: If A=0 and B=1, CP B sets the flags like SUB A, B, which is 0 - 1. This becomes 0 + 255 = 255 and the carry flag is not set, even though A < B.
I came across this same issue in other Z80 documents as well, so I don't believe this is a typo.
Am I misunderstanding how borrow and SUB work or is there something else going on? Is SUB not equal to ADD with two's complement in terms of flags?
The GameBoy CPU manual has it backwards. SUB, SBC and CP all set the carry flag when there is borrow. If SUB/SBC/CP A,n is executed then carry is set if n > A otherwise it is clear.
This is consistent with Z-80 (and 8080) operation. As well MAME and MESS implement carry the same way.

How can I Remove a Wandering DC Offset from an Audio Clip?

I've licensed some audio clips, but some of them come with what I have learned is a "DC Offset" that should normally have been removed during production.
Audacity's "Normalize" filter is able to fix a static DC Offset, but after applying it to my audio clips, I noticed that their DC offset varies (within 0.5 seconds it could go from 0.05 to 0.03 along a normalized amplitude range). For example:
To the left, silence is at 0.02, to the right, it's at 0.00 - this is after normalization by Audacity.
With me not being an audio engineer and not having any professional tools, is there a way to fix this?
A DC offset is a frequency component at 0 Hz. The "wandering DC offset" will be made of very low frequency components, so you should be able to remove this by using a high-pass filter with a cutoff of around 15 Hz. That way, you'll remove any sub-sonic DC related stuff without altering the audible frequency range.
Use a filter with a steep rolloff. Seeing as you're doing this offline, you can use a simple IIR type and filter the signal in both forward and reverse directions to remove any phase distortion that would otherwise be imposed by the filtering.
If you use matlab, the operation would look something like this . .
[x, fs] = wavread('myfile.wav');
[b,a] = butter(8, 15/(fs/2), 'highpass');
y = filtfilt(b,a,x);
From the command line, you can have a try with sox.
sox fileIn.wav fileOut.wav highpass 10
This will apply an high pass filter at a frequency of 10 Hz.
This should remove the DC offset (but maybe not in the early beginning of the files).
See the sox manual for a little bit more information (but not so much).
As #learnvst explains in his answer, what looks like "wandering DC offset" is actually just content at very low frequencies. You can remove this LF content with a high pass filter. Since frequencies below 20 Hz are generally inaudible, you should be able to take out the "wandering DC" without actually changing how the file sounds.
The latest version of Audacity (2.0.5) includes a high pass filter. Select Effect > High pass filter ... and adjust the cutoff frequency and rolloff parameters. A cutoff of around 15 Hz and a rolloff of 6 dB/oct should do the trick.
for f in *.wav; do
mv "$f" /tmp/dc1.wav
dc=$(ffprobe -f lavfi "amovie=/tmp/dc1.wav,astats=metadata=1" 2>&1 | sed '/Overall/,$!d' | grep DC )
#echo "$dc"
dc=$(echo "$dc" | awk '{ print $6 }')
#echo "$dc"
dc=$(echo "$dc * -1" | bc)
echo "bc" "$dc"
ffmpeg -hide_banner -loglevel error -y -i "/tmp/dc1.wav" -af "dcshift=$dc:limitergain=0.02" "$f"
done

HaarTraining with OpenCV error

I have about 15000 cropped images with the object of interest (positive samples) and 7000 negative images (non object of interest). The cropped images have a resolution of 48x96 and are placed in a folder. The .txt file containing the positive samples looks something like this : picture1.pgm 1 0 0 48 96 meaning that there is 1 positive sample in picture 1 from (0,0) to (48, 96). Likewise I have a .txt file for negative images.
The command for training is the following:
c:\libraries\OpenCV2.4.1\opencv\built\bin\Debug>opencv_haartrainingd.exe -data d
ata/cascade -vec data/positives.vec -bg c:/users/gheorghi/desktop/daimler/pedest
rian_stereo_extracted/nonpedestrian/nonpedestrian/c0/negatives.txt -npos 15660 -
nneg 7129 -nstage 14 -mem 1000 -mode ALL -w 18 -h 36 -nonsym
But at some point I always get this error :
Parent node: 0
*** 1 cluster ***
OpenCV Error: Assertion failed (elements_read == 1) in unknown function, file C:
\libraries\OpenCV2.4.1\opencv\apps\haartraining\cvhaartraining.cpp, line 1858
How can I overcome this ??? Any help is appreciated. Many many thanks
I found that the problem can be solved in 2 ways. you can either decrease the amount of positives or increase the amount of negatives. either way it turns out that having a small positive to negative ratio helps.
I answered the question here.
It may be of some help.
The same issue was posted by many others, i used the advice given here.

Resources