Linux filtering a file by two columns and printing the output - linux

I have a table that has 9 columns as shown below.
How would I first sort by the strand column so only those with a "+" are selected, and then of those I select the ones that have 3 exons (In the exon count column).
I have been trying to use grep for this as I understand I can pick out a word from a column, but I only get the particular column or just the total number.

using awk
awk -F "," ' $4=="+" && $9=="3" ' file.csv
If it's not CSV then remove -F "," from this command

Related

Shell | Sort Date and Month in Ascending order

I wanted to display/sort the file records in Ascending order of Date and Month or if there are any equal data values they should list in the very next column in ascending order.
Date & Month to sort: (current scenario)
ver.....03.02../ver>
ver.....19.01../ver>
ver.....02.02..ver>
File content:
ver>0.1.1-ABC-XYA-BR-03.02-v1.0-1-4d4f3dd/ver>
ver>0.1.1-XYZ-LOK-BR-19.01-v1.0-5-8a8d7dd/ver>
ver>0.1.1-DXD-UIJ-BR-02.02-v1.0-4-9o2k4wk/ver>
How would I can achieve below following results?
ver>0.1.1-XYZ-LOK-BR-19.01-v1.0-5-8a8d7dd/ver>
ver>0.1.1-DXD-UIJ-BR-02.02-v1.0-4-9o2k4wk/ver>
ver>0.1.1-ABC-XYA-BR-03.02-v1.0-1-4d4f3dd/ver>
I tried using sort: (not working)
sort -n sortfile.txt
ver>0.1.1-DXD-UIJ-BR-02.02-v1.0-4-9o2k4wk/ver>
ver>0.1.1-ABC-XYA-BR-03.02-v1.0-1-4d4f3dd/ver>
ver>0.1.1-XYZ-LOK-BR-19.01-v1.0-5-8a8d7dd/ver>
You can use sort, but you will need to specify the field-seperator -t '-' so that fields are separated by '-' and then specify the keydef to sort on the 5th field beginning with the 4th character and then again with the 1st character and finally a version sort on field 6 if all else is equal. That would be:
sort -t '-' -k5.4n -k5.1n -k6V contents
Providing full start and stop characters within each keydef can be done as:
sort -t '-' -k5.4n,5.5 -k5.1n,5.2 -k6V contents
(though for this data the output isn't changed)
Example Use/Output
$ sort -t '-' -k5.4n -k5.1n -k6V contents
ver>0.1.1-XYZ-LOK-BR-19.01-v1.0-5-8a8d7dd/ver>
ver>0.1.1-DXD-UIJ-BR-02.02-v1.0-4-9o2k4wk/ver>
ver>0.1.1-ABC-XYA-BR-03.02-v1.0-1-4d4f3dd/ver>

Uniqing a delimited file based on a subset of fields

I have data such as below:
1493992429103289,207.55,207.5
1493992429103559,207.55,207.5
1493992429104353,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
Due to the nature of the last two columns, their values change throughout the day and their values are repeated regularly. By grouping the way outlined in my desired output (below), I am able to view each time there was a change in their values (with the enoch time in the first column). Is there a way to achieve the desired output shown below:
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
So I consolidate the data by the second two columns. However, the consolidation is not completely unique (as can be seen by 207.55, 207.5 being repeated)
I have tried:
uniq -f 1
However the output gives only the first line and does not go on through the list
The awk solution below does not allow the occurrence which happened previously to be outputted again and so gives the output (below the awk code):
awk '!x[$2 $3]++'
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
I do not wish to sort the data by the second two columns. However, since the first is epoch time, it may be sorted by the first column.
You can't set delimiters with uniq, it has to be white space. With the help of tr you can
tr ',' ' ' <file | uniq -f1 | tr ' ' ','
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
You can use an Awk statement as below,
awk 'BEGIN{FS=OFS=","} s != $2 && t != $3 {print} {s=$2;t=$3}' file
which produces the output as you need.
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
The idea is to store the second and third column values in variables s and t respectively and print the line contents only if the current line is unique.
I found an answer which is not as elegant as Inian but satisfies my purpose.
Since my first column is always enoch time in microseconds and does not increase or decrease in characters, I can use the following uniq command:
uniq -s 17
You can try to manually (with a loop) compare current line with previous line.
previous_line=""
# start at first line
i=1
# suppress first column, that don't need to compare
sed 's#^[0-9][0-9]*,##' ./data_file > ./transform_data_file
# for all line within file without first column
for current_line in $(cat ./transform_data_file)
do
# if previous record line are same than current line
if [ "x$prev_line" == "x$current_line" ]
then
# record line number to supress after
echo $i >> ./line_to_be_suppress
fi
# record current line as previous line
prev_line=$current_line
# increment current number line
i=$(( i + 1 ))
done
# suppress lines
for line_to_suppress in $(tac ./line_to_be_suppress) ; do sed -i $line_to_suppress'd' ./data_file ; done
rm line_to_be_suppress
rm transform_data_file
Since your first field seems to have a fixed length of 18 characters (including the , delimiter), you could use the -s option of uniq, which would be more optimal for larger files:
uniq -s 18 file
Gives this output:
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
From man uniq:
-f num
Ignore the first num fields in each input line when doing comparisons.
A field is a string of non-blank characters separated from adjacent fields by blanks.
Field numbers are one based, i.e., the first field is field one.
-s chars
Ignore the first chars characters in each input line when doing comparisons.
If specified in conjunction with the -f option, the first chars characters after
the first num fields will be ignored. Character numbers are one based,
i.e., the first character is character one.

Awk matching values of first two columns and printing in blank field

I have a csv file which looks like below:
2212,A1,
2212,A1,128
2307,B1,
2307,B1,107
how can i copy value of 3rd column in place of missing values in 3rd column of if value of first 2 column is same. e.g. first two columns of first two rows are same so automatically it should print value of 3rd column of second row in missing place of third column of first row.
expected output:
2212,A1,128
2212,A1,128
2307,B1,107
2307,B1,107
Please help as i couldn't even think of a solution and there are millions of values such like this in my file..
If you first sort the file in reverse order, the rows with data preceed the empty rows:
$ sort -r file
2307,B1,107
2307,B1,
2212,A1,128
2212,A1,
Then use following awk to process the output of sort:
$ sort -r file | awk 'NR>1 && match(prev,$0) {$0=prev} {prev=$0} 1'
2307,B1,107
2307,B1,107
2212,A1,128
2212,A1,128
awk -F, '{a[$1FS$2]++;b[$1FS$2]=$NF}END{for (i in b) {for(j=1;j<=a[i];j++) print i FS b[i]}}' file

Split and compare in awk

I want to split and comparison in awk command.
Input file (tab-delimited)
1 aaa 1|3
2 bbb 3|3
3 ccc 0|2
Filtration
First column value > 1
First value of third column value splitted by "|" > 2
Process
Compare first column value if bigger than 1
Split third column value by "|"
Compare first value of the third column if bigger than 2
Print if the first value bigger than 2 only
Command line (example)
awk -F "\t" '{if($1>1 && ....?) print}' file
Output
2 bbb 3|3
Please let me know command line for above processing.
You can set the field separator to either tab or pipe and check the 1st and 3rd values:
awk -F'\t|\\|' '$1>1 && $3>2' file
or
awk -F"\t|\\\\|" '$1>1 && $3>2' file
You can read about all this character escaping in this comprehensive answer by Ed Morton in awk: fatal: Invalid regular expression when setting multiple field separators.
Otherwise, you can split the 3rd field and check the value of the first slice:
awk -F"\t" '{split($3,a,"|")} $1>1 && a[1]>=2' file

Mapping lines to columns in *nix

I have a text file that was created when someone pasted from Excel into a text-only email message. There were originally five columns.
Column header 1
Column header 2
...
Column header 5
Row 1, column 1
Row 1, column 2
etc
Some of the data is single-word, some has spaces. What's the best way to get this data into column-formatted text with unix utils?
Edit: I'm looking for the following output:
Column header 1 Column header 2 ... Column header 5
Row 1 column 1 Row 1 column 2 ...
...
I was able to achieve this output by manually converting the data to CSV in vim by adding a comma to the end of each line, then manually joining each set of 5 lines with J. Then I ran the csv through column -ts, to get the desired output. But there's got to be a better way next time this comes up.
Perhaps a perl-one-liner ain't "the best" way, but it should work:
perl -ne 'BEGIN{$fields_per_line=5; $field_seperator="\t"; \
$line_break="\n"} \
chomp; \
print $_, \
$. % $fields_per_row ? $field_seperator : $line_break; \
END{print $line_break}' INFILE > OUTFILE.CSV
Just substitute the "5", "\t" (tabspace), "\n" (newline) as needed.
You would have to use a script that uses readline and counter. When the program reaches that line you want, use cut command and space as a dilimeter to get the word you want
counter=0
lineNumber=3
while read line
do
counter += 1
if lineNumber==counter
do
echo $line | cut -d" " -f 4
done
fi

Resources