How to convert tab seperated files into list? - linux

I have a tab separated file as shown below,
ENSONIT00000008797.2 GO:0000003 GO:0000149 GO:0000226
want to convert this file as
List
ENSONIT00000008797.2 GO:0000003
ENSONIT00000008797.2 GO:0000149
ENSONIT00000008797.2 GO:0000226

Do you mean this? If there only one column in one line, it would not print anything.
awk '{ for(i = 2; i <= NF; i++) print $1 "\t" $i}' file
PS: awk will separate line by space or tab in default.
Tips: use sort and uniq to format the output in your requirement.

Related

Search for a string and print the line in a different order using Linux

I need to write a shell script that does the following which I am showing below with an example.
Suppose I have a file cars.txt which depicts a table like this
Person|Car|Country
The '|' is the separator. So the first two lines goes like this
Michael|Ford|USA
Rahul|Maruti|India
I have to write a shell script which will find the lines in the cars.txt file that has the country as USA and will print it like
USA|Ford|Michael
I am not very adept with Unix so I need some help here.
Will this do?
while read -r i; do
NAME="$(cut -d'|' -f1 <<<"$i")"
MAKE="$(cut -d'|' -f2 <<<"$i")"
COUNTRY="$(cut -d'|' -f3 <<<"$i")"
echo "$COUNTRY|$MAKE|$NAME"
done < <(grep "USA$" cars.txt)
Updated To Locate USA Not 1st Line As Provided in Your Question
Using awk you can do what you are attempting in a very simple manner, e.g.
$ awk -F'|' '/USA/ {for (i = NF; i >= 1; i--) printf "%s%s", $i, i==1 ? RS : FS}' cars.txt
USA|Ford|Michael
India|Maruti|Rahul
Explanation
awk -F'|' read the file using '|' as the Field-Separator, specified as -F'|' at the beginning of the call, or as FS within the command itself,
/USA/ locate only lines containing "USA",
for (i = NF; i >= 1; i--) - loop over fields in reverse order,
printf "%s%s", $i, i==1 ? RS : FS - output the field followed by a '|' (FS) if i is not equal 1 or by the Record-Separator (RS) which is a "\n" by default if i is equal 1. The form test ? true_val : false_val is just the ternary operator that tests if i == 1 and if so provides RS for output, otherwise provides FS for output.
It will be orders of magnitude faster than spawning 8-subshells using command substitutions, grep and cut (plus the pipes).
Printing Only The 1st Occurrence of Line Containing "USA"
To print only the first line with "USA", all you need to do is exit after processing, e.g.
$ awk -F'|' '/USA/ {for (i = NF; i >= 1; i--) printf "%s%s", $i, i==1 ? RS : FS; exit}' cars.txt
USA|Ford|Michael
Explanation
simply adding exit to the end of the command will cause awk to stop processing records after the first one.
While both awk and sed take a little time to make friends with, together they provide the Unix-Swiss-Army-Knife for text processing. Well worth the time to learn both. It only takes a couple of hours to get a good base by going through one of the tutorials. Good luck with your scripting.

Reformat data using awk

I have a dataset that contains rows of UUIDs followed by locations and transaction IDs. The UUIDs are separated by a semi-colon (';') and the transactions are separated by tabs, like the following:
01234;LOC_1=ABC LOC_1=BCD LOC_2=CDE
56789;LOC_2=DEF LOC_3=EFG
I know all of the location codes in advance. What I want to do is transform this data into a format I can load into SQL/Postgres for analysis, like this:
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
I'm pretty sure I can do this easily using awk (or similar) by looking up location IDs from a file (ex. LOC_1) and matching any instance of the location ID and printing that out next to the UUID. I haven't been able to get it right yet, and any help is much appreciated!
My locations file is named location and my dataset is data. Note that I can edit the original file or write the results to a new file, either is fine.
awk without using split: use semicolon or tab as the field separator
awk -F'[;\t]' -v OFS=';' '{for (i=2; i<=NF; i++) print $1,$i}' file
I don't think you need to match against a known list of locations; you should be able to just print each line as you go:
$ awk '{print $1; split($1,a,";"); for (i=2; i<=NF; ++i) print a[1] ";" $i}' file
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
You comment on knowing the locations and the mapping file makes me suspicious what your example seems to have done isn't exactly what is being asked - but it seems like you're wanting to reformat each set of tab delimited LOC= values into a row with their UUID in front.
If so, this will do the trick:
awk ' BEGIN {OFS=FS=";"} {split($2,locs,"\t"); for (n in locs) { print $1,locs[n]}}'
Given:
$ cat -A data.txt
01234;LOC_1=ABC^ILOC_1=BCD^ILOC_2=CDE$
56789;LOC_2=DEF^ILOC_3=EFG$
Then:
$ awk ' BEGIN {OFS=FS=";"} {split($2,locs,"\t"); for (n in locs) { print $1,locs[n]}}' data.txt
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
The BEGIN {OFS=FS=";"} block sets the input and output delimiter to ;.
For each row, we then split the second field into an array named locs, splitting on tab, via - split($2,locs,"\t")
And then loop through locs printing the UUID and each loc value - for (n in locs) { print $1,locs[n]}
How about without loop or without split one as follows.(considering that Input_file is same as shown samples only)
awk 'BEGIN{FS=OFS=";"}{gsub(/[[:space:]]+/,"\n"$1 OFS)} 1' Input_file
This might work for you (GNU sed):
sed -r 's/((.*;)\S+)\s+(\S+)/\1\n\2\3/;P;D' file
Repeatedly replace the white space between locations with a newline, followed by the UUID and a ;, printing/deleting each line as it appears.

bash text file transpose, add new column and make one big two-column again

I have a large text file:
modularity_class;keys;columna1;columna2;columna3;
1;Antimalarial;Borneo;Cytotoxicity;Indonesia
0;Africa;malaria;morbidity;mortality
6;Anopheles albimanus;compression sprayer;house?spraying;;
12;;;;Tanzania;;
The final result should be:
Antimalarial;1
Borneo;1
Cytotoxicity;1
Indonesia;1
Africa;0
malaria;0
morbidity;0
mortality;0
Anopheles albimanus;6
compression sprayer;6
house?spraying;6
Tanzania;12
As you can see I need to:
1st: remove first row (should be trivial)
transpose each row (one by one)
add first value in original row to every element transposed as a second column
skip every null/blank value between semicolon delimiters
I've read about awk, sed, tr and so on... but I cannot figure out how to get it in an efficient way.
Note: every row may have different length or elements.
Simple awk should do the trick:
awk -F';' 'NR>1 {
for(i=2; i<=NF; i++) {
if($i!="")
print $i FS $1
}
}' file
One-liner:
awk -F';' 'NR>1 { for(i=2; i<=NF; i++) { if($i!="") print $i FS $1 } }' file

How to display duplicates from a text file using awk

I'm trying to find out how to use the "awk" command, in order to display a word that shows up multiple times in a file(txt). In addition, how can you display the name of this/those file/s?
ex: first sentence first file.
Second sentence followed by the second word.
This should display: "first" and "second"
I assume with -i you mean comparison / counting should be ignoring case.
If I understand your requirements correctly an command like this should work:
awk '{ for( i=1; i<=NF; i++){ cnt[ tolower( $i ) ]++; if (cnt[$i] > 1) {print $i} } }' yourfile | sort -u
It prints these words for your example:
first
second
sentence
the
If you need a case sensitive counting, just delete tolower .
For each line in the file, the script iterates through each word (the for( i=1 i <= NF; i++) loop):
increments for each word a counter ( cnt[ tolower( $i) ]++ )
if the count is larger than 1 the word is printer
the pipe to sort -u sorts the output and removes the duplicates from the output.

Removing Fields from file keeping delimeter intact

We have a requirement to remove certain fields inside a delimited file using shell script, but we do not want to loose delimeter. i.e
$cat file1
col1,col2,col3,col4,col5
1234,4567,7890,9876,6754
abcd,efgh,ijkl,rtyt,cvbn
now we need to generate different file (say file2) out of file1 with second and fourth column removed with delimeter (,) intact
i.e
$cat file2
col1,,col3,,col5
1234,,7890,,6754
abcd,,ijkl,,cvbn
Please suggest what would be easiest and efficient way of achieving this, also as file is having around 300 fields/columns AWK is not working because of its limitation related to number of fields.
awk '{$2 = $4 = ""}1' FS=, OFS=, file1 > file2
awk 'BEGIN { FS = OFS = ","; } { $2 = $4 = ""; print }' file1 > file2
Which, simply saying:
sets input & output field separator to , at start,
empties fields 2 & 4 for each line, and then prints the line back.

Resources