Lifting over GWAS summary statististic file from build 38 to build 37 - linux

I am using the UCSC lift over tool and the associated chain to lift over the results of my GWAS summary statistic file (a tab separated file) from build 38 to build 37. The GWAS summary stat file looks like:
1 chr1_17626_G_A 17626 A G 0.016 -0.0332 0.0237 0.161
1 chr_20184_G_A 20184 A G 0.113 -0.185 0.023 0.419
Follwing is the UCSC tool with the associated chain I am using:
liftover: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
chain file: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
I want to create a file in bed format from GWAS summary stat fle that is the required input by the tool, where I would like the first three columns to be tab separated and rest of the columns to be merged in a single column and separated by a non tab separator such as "." so as to preserve them while running the lift over. The first three columns of the input bed file would be:
awk '{print chr$1, $3-1, $3}' GWAS summary stat file > ucsc.input.file
#$1 = chrx - where x is chromosome number
#$2 position -1 for SNPs
#$3 bp position hg38 for SNPs
The above three are the required columns for the tool.
My questions are:
How can I use a non tab separator say ":" to merge rest of the columns of the GWAS summary stat file in one column?
After running the liftover, how can I unpack the columns separated by :?

I am not sure if this answers your questions but please take a look.
You can use awk to merge multiple columns by :
awk '{print $1 ":" $2 ":" $3}' file
and then say you want to replace : by tab in $1 then you can do
awk -F ":" '{gsub(/:/,"\t",$1)}1' file
Is this of any help?

Related

Linux filtering a file by two columns and printing the output

I have a table that has 9 columns as shown below.
How would I first sort by the strand column so only those with a "+" are selected, and then of those I select the ones that have 3 exons (In the exon count column).
I have been trying to use grep for this as I understand I can pick out a word from a column, but I only get the particular column or just the total number.
using awk
awk -F "," ' $4=="+" && $9=="3" ' file.csv
If it's not CSV then remove -F "," from this command

Extract specific columns from delimited file (long row to next line)

Want to extract 2 columns from delimited file (delimiter '||') in unix can be easily be done if complete row in on one line like below
foo||bar||baz||quux
by
cut -d'||' -f1 file_name
but in my case records in file for a single row record went to next line for example:
foo||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
||quux||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
and its output from above command is
foo
quux
instead should be just "foo" because it is in first column.
file contain in row 1
foo||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
||quux||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
file contain in row 2
foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2
||quux2||bar2||baz2||quux2||foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2
output should be
foo
foo2
Almost, but the -d switch only takes one char:
cut -d'|' -f1 file_name
Output:
foo
foo2
Note: since the delimiters are doubled, the -f switch won't work as expected if the field number is greater than 1. One way to handle that is adjust the field to equal "2n-1". So to get field #3, do -f$(( (3*2) - 1 )).
Using awk. Since it's the first field of every other record (NR%2), use:
$ awk -F\| 'NR%2{print $1}' file
foo
foo2
Data (four records):
$ cat file
foo||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
||quux||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2
||quux2||bar2||baz2||quux2||foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2
Interesting phenomenon is that mawk accepts -F"\|\|" (dual pipes) as delimiter but GNU awk doesn't.

Join the original sorted files, include 2 fields in one file and 1 field in 2nd file

I need help with linux command.
I have 2 files StockSort and SalesSort. They are sorted and they have 3 fields each. I know how to sort 1 field in 1st file and 1 field in 2nd file. But I can't get a right syntax for joining two fields in 1st file and only 1 field in second file. I also need to save it i na new file.
So far I have this command, but it doesn't work.I think the mistake is in "2,3" part, where I need to combine two fields from the 1st file.
join -1 2,3 -2 2 StockSort SalesSort >FinalReport
StockSort file
3976:diode:350
4105:resistor:750
4250:resistor:500
SalesSort file
3976:120:net
4105:250:chg
5500:100:pde
Output should be like this:
3976:350:120
4105:750:250
4250:500:100
You can try
join -t: -o 1.1,1.3,2.2 stocksort salesort
where
-t set the column separator
-o is the output format (a comma sep. list of filenumber.fieldnumber)
Here is an awk:
$ awk 'BEGIN{ FS=OFS=":"}
FNR==NR {Stock[$1]=$3; next}
$1 in Stock{ print $1,Stock[$1],$2}' StockSort SalesSort

Linux - putting lines that contain a string at a specific column in a new file

I want to pull all rows from a text file in linux which contain a specific number (in this case 9913) in a specific column (column 4). This is a tab-delimited file, so I am calling this a column, though I am not sure it is.
In some cases, there is only one number in column 4, but in other lines there are multiple numbers in this column (ex. 9913; 4444; 5555). I would like to get any rows for which the number 9913 appears in the 4th column (whether or not it is the only number or in a list). How do I put all lines which contain the number 9913 in column 4 and put them in their own file?
Here is an example of what I have tried:
cat file.txt | grep 9913 > newFile.txt
result is a mixture of the following:
CDR1as CDR1as ENST00000003100.8 9913 AAA-GGCAGCAAGGGACUAAAA (files that I want)
CDR1as CDR1as ENST00000399139.1 9606 GUCCCCA................(file ex. I don't want)
I do not get any results when calling a specific column. Shown by the helper below, the code is not recognizing the columns I think, and I get blank files when using awk.
awk '$4 == "9913"' file.txt > newfile.txt
will give me no transfer of data to a new file.
Thanks
This is one way of doing it
awk '$4 == "9913" {print $0}' file.txt > newfile.txt
or just
awk '$4 == "9913"' file.txt > newfile.txt

AWK compare two columns in two seperate files

I would like to compare two files and do something like this: if the 5th column in the first file is equal to the 5th column in the second file, I would like to print the whole line from the first file. Is that possible? I searched for the issue but was unable to find a solution :(
The files are separated by tabulators and I tried something like this:
zcat file1.txt.gz file2.txt.gz | awk -F'\t' 'NR==FNR{a[$5];next}$5 in a {print $0}'
Did anybody tried to do a similar thing? :)
Thanks in advance for help!
Your script is fine, but you need to provide each file individually to awk and in reverse order.
$ cat file1.txt
a b c d 100
x y z w 200
p q r s 300
1 2 3 4 400
$ cat file2.txt
. . . . 200
. . . . 400
$ awk 'NR==FNR{a[$5];next} $5 in a {print $0}' file2.txt file1.txt
x y z w 200
1 2 3 4 400
EDIT:
As pointed out in the comments, the generic solution above can be improved and tailored to OP's situation of starting with compressed tab-separated files:
$ awk -F'\t' 'NR==FNR{a[$5];next} $5 in a' <(zcat file2.txt) <(zcat file1.txt)
x y z w 200
1 2 3 4 400
Explanation:
NR is the number of the current record being processed and FNR is the number
of the current record within its file . Thus NR == FNR is only
true when awk is processing the first file given to it (which in our case is file2.txt).
a[$5] adds the value of the 5th column as an index to the array a. Arrays in awk are associative arrays, but often you don't care about associating a value and just want to make a nice collection of things. This is a
pithy way to make a collection of all the values we've seen in 5th column of the
first file. The next statement, which follows, says to immediately get the next
available record without looking at any anymore statements in the awk program.
Summarizing the above, this line says "If you're reading the first file (file2.txt),
save the value of column 5 in the array called a and move on to the record without
continuing with the rest of the awk program."
NR == FNR { a[$5]; next }
Hopefully it's clear from the above that the only way we can past that first line of
the awk program is if we are reading the second file (file1.txt in our case).
$5 in a evaluates to true if the value of the 5th column occurs as an index in
the a array. In other words, it is true for every record in file1.txt whose 5th
column we saw as a value in the 5th column of file2.txt.
In awk, when the pattern portion evaluates to true, the accompanying action is
invoked. When there's no action given, as below, the default action is triggered
instead, which is to simply print the current record. Thus, by just saying
$5 in a, we are telling awk to print all the records in file1.txt whose 5th
column also occurs in file2.txt, which of course was the given requirement.
$5 in a

Resources