Subset delimited file by subset of one column - linux

I have a text file (call it infile.txt) where the columns have headers and are delimited by semicolons. A subset of it is reproduced below:
SCHCD; SCHNAME
13110208001; GOVT MIDSCHOOL
10110208002; GOVT HIGHSCHOOL
21110208101; MATRIC
21110208102; UPPER SECONDARY
13110208201; SECONDARY
I want a subset of the file where the first two characters of "SCHCD" is "13". So my subset (call it outfile.txt) should look like:
SCHCD; SCHNAME
13110208001; GOVT MIDSCHOOL
13110208201; SECONDARY

With awk:
awk ' NR == 1 || /^13/ ' infile.txt > outfile.txt

Related

Linux filtering a file by two columns and printing the output

I have a table that has 9 columns as shown below.
How would I first sort by the strand column so only those with a "+" are selected, and then of those I select the ones that have 3 exons (In the exon count column).
I have been trying to use grep for this as I understand I can pick out a word from a column, but I only get the particular column or just the total number.
using awk
awk -F "," ' $4=="+" && $9=="3" ' file.csv
If it's not CSV then remove -F "," from this command

Extract substring from first column

I have a large text file with 2 columns. The first column is large and complicated, but contains a name="..." portion. The second column is just a number.
How can I produce a text file such that the first column contains ONLY the name, but the second column stays the same and shows the number? Basically, I want to extract a substring from the first column only AND have the 2nd column stay unaltered.
Sample data:
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
So the result file would be something like this
app-name_01 0
myapp-02 1
app_name_public 3
...
If your actual Input_file is same as the shown sample then following code may help you in same.
awk '{sub(/.*name=\"/,"");sub(/\".* /," ")} 1' Input_file
Output will be as follows.
app-name_01 0
myapp-02 1
app_name_public 3
Using GNU awk
$ awk 'match($0,/name="([^"]*)"/,a){print a[1],$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Non-Gawk
awk 'match($0,/name="([^"]*)"/){t=substr($0,RSTART,RLENGTH);gsub(/name=|"/,"",t);print t,$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Input:
$ cat infile
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
Here's a sed solution:
sed -r 's/.*name="([^"]+).* ([0-9]+)$/\1 \2/g' Input_file
Explanation:
With the parantheses your store in groups what's inbetween.
First group is everything after name=" till the first ". [^"] means "not a double-quote".
Second group is simply "one or more numbers at the end of the line preceeded with a space".

Linux - How to remove certain lines from a files based on a field value

I want to remove certain lines from a tab-delimited file and write output to a new file.
a b c 2017-09-20
a b c 2017-09-19
es fda d 2017-09-20
es fda d 2017-09-19
The 4th column is Date, basically I want to keep only lines that has 4th column as "2017-09-19" (keep line 2&4) and write to a new file. The new file should have same format as the raw file.
How to write the linux command for this example?
Note: The search criteria should be on the 4th field as I have other fields in the real data and possibly have same value as 4th field.
With awk:
awk 'BEGIN{OFS="\t"} $4=="2017-09-19"' file
OFS: output field separator, a space by default
Use grep to filter:
cat file.txt | grep '2017-09-19' > filtered_file.txt
This is not perfect, since the string 2017-09-19 is not required to appear in the 4th column, but if your file looks like the example, it'll work.
Sed solution:
sed -nr "/^([^\t]*\t){3}2017-09-19/p" input.txt >output.txt
this is:
-n - don't output every line
-r - extended regular expresion
/regexp/p - print line that contains regular expression regexp
^ - begin of line
(regexp){3} - repeat regexp 3 times
[^\t] - any character except tab
\t - tab character
* - repeat characters multiple times
2017-09-19 - search text
That is, skip 3 columns separated by a tab from the beginning of the line, and then check that the value of column 4 coincides with the required value.
awk '/2017-09-19/' file >newfile
cat newfile
a b c 2017-09-19
es fda d 2017-09-19

Linux - putting lines that contain a string at a specific column in a new file

I want to pull all rows from a text file in linux which contain a specific number (in this case 9913) in a specific column (column 4). This is a tab-delimited file, so I am calling this a column, though I am not sure it is.
In some cases, there is only one number in column 4, but in other lines there are multiple numbers in this column (ex. 9913; 4444; 5555). I would like to get any rows for which the number 9913 appears in the 4th column (whether or not it is the only number or in a list). How do I put all lines which contain the number 9913 in column 4 and put them in their own file?
Here is an example of what I have tried:
cat file.txt | grep 9913 > newFile.txt
result is a mixture of the following:
CDR1as CDR1as ENST00000003100.8 9913 AAA-GGCAGCAAGGGACUAAAA (files that I want)
CDR1as CDR1as ENST00000399139.1 9606 GUCCCCA................(file ex. I don't want)
I do not get any results when calling a specific column. Shown by the helper below, the code is not recognizing the columns I think, and I get blank files when using awk.
awk '$4 == "9913"' file.txt > newfile.txt
will give me no transfer of data to a new file.
Thanks
This is one way of doing it
awk '$4 == "9913" {print $0}' file.txt > newfile.txt
or just
awk '$4 == "9913"' file.txt > newfile.txt

Split and compare in awk

I want to split and comparison in awk command.
Input file (tab-delimited)
1 aaa 1|3
2 bbb 3|3
3 ccc 0|2
Filtration
First column value > 1
First value of third column value splitted by "|" > 2
Process
Compare first column value if bigger than 1
Split third column value by "|"
Compare first value of the third column if bigger than 2
Print if the first value bigger than 2 only
Command line (example)
awk -F "\t" '{if($1>1 && ....?) print}' file
Output
2 bbb 3|3
Please let me know command line for above processing.
You can set the field separator to either tab or pipe and check the 1st and 3rd values:
awk -F'\t|\\|' '$1>1 && $3>2' file
or
awk -F"\t|\\\\|" '$1>1 && $3>2' file
You can read about all this character escaping in this comprehensive answer by Ed Morton in awk: fatal: Invalid regular expression when setting multiple field separators.
Otherwise, you can split the 3rd field and check the value of the first slice:
awk -F"\t" '{split($3,a,"|")} $1>1 && a[1]>=2' file

Resources