Reformat data using awk - linux

I have a dataset that contains rows of UUIDs followed by locations and transaction IDs. The UUIDs are separated by a semi-colon (';') and the transactions are separated by tabs, like the following:
01234;LOC_1=ABC LOC_1=BCD LOC_2=CDE
56789;LOC_2=DEF LOC_3=EFG
I know all of the location codes in advance. What I want to do is transform this data into a format I can load into SQL/Postgres for analysis, like this:
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
I'm pretty sure I can do this easily using awk (or similar) by looking up location IDs from a file (ex. LOC_1) and matching any instance of the location ID and printing that out next to the UUID. I haven't been able to get it right yet, and any help is much appreciated!
My locations file is named location and my dataset is data. Note that I can edit the original file or write the results to a new file, either is fine.

awk without using split: use semicolon or tab as the field separator
awk -F'[;\t]' -v OFS=';' '{for (i=2; i<=NF; i++) print $1,$i}' file

I don't think you need to match against a known list of locations; you should be able to just print each line as you go:
$ awk '{print $1; split($1,a,";"); for (i=2; i<=NF; ++i) print a[1] ";" $i}' file
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG

You comment on knowing the locations and the mapping file makes me suspicious what your example seems to have done isn't exactly what is being asked - but it seems like you're wanting to reformat each set of tab delimited LOC= values into a row with their UUID in front.
If so, this will do the trick:
awk ' BEGIN {OFS=FS=";"} {split($2,locs,"\t"); for (n in locs) { print $1,locs[n]}}'
Given:
$ cat -A data.txt
01234;LOC_1=ABC^ILOC_1=BCD^ILOC_2=CDE$
56789;LOC_2=DEF^ILOC_3=EFG$
Then:
$ awk ' BEGIN {OFS=FS=";"} {split($2,locs,"\t"); for (n in locs) { print $1,locs[n]}}' data.txt
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
The BEGIN {OFS=FS=";"} block sets the input and output delimiter to ;.
For each row, we then split the second field into an array named locs, splitting on tab, via - split($2,locs,"\t")
And then loop through locs printing the UUID and each loc value - for (n in locs) { print $1,locs[n]}

How about without loop or without split one as follows.(considering that Input_file is same as shown samples only)
awk 'BEGIN{FS=OFS=";"}{gsub(/[[:space:]]+/,"\n"$1 OFS)} 1' Input_file

This might work for you (GNU sed):
sed -r 's/((.*;)\S+)\s+(\S+)/\1\n\2\3/;P;D' file
Repeatedly replace the white space between locations with a newline, followed by the UUID and a ;, printing/deleting each line as it appears.

Related

How to convert tab seperated files into list?

I have a tab separated file as shown below,
ENSONIT00000008797.2 GO:0000003 GO:0000149 GO:0000226
want to convert this file as
List
ENSONIT00000008797.2 GO:0000003
ENSONIT00000008797.2 GO:0000149
ENSONIT00000008797.2 GO:0000226
Do you mean this? If there only one column in one line, it would not print anything.
awk '{ for(i = 2; i <= NF; i++) print $1 "\t" $i}' file
PS: awk will separate line by space or tab in default.
Tips: use sort and uniq to format the output in your requirement.

Grouping related rows of data into a single column in Linux

I have a csv file that gets generated daily and automatically that has output similar to the following example:
"N","3.5",3,"Bob","10/29/17"
"Y","4.5",5,"Bob","10/11/18"
"Y","5",6,"Bob","10/28/18"
"Y","3",1,"Jim",
"N","4",2,"Jim","09/29/17"
"N","2.5",4,"Joe","01/26/18"
I need to transform the text so that it is grouped by person (the fourth column), and all of the records are in a single row and in the columns are repeated using the same sequence: 1,2,3,5. Some cells may be missing data but must remain in the sequence so the columns line up. So the output I need will look like this:
"Bob","N","3.5",3,"10/29/17","Y","4.5",5,"10/11/18","Y","5",6,"10/28/18"
"Jim","Y","3",1,,"N","4",2,"09/29/17"
"Joe","N","2.5",4,"01/26/18"
I am open to using sed, awk, or pretty much any standard Linux command to get this task done. I've been trying to use awk, and though I get close, I can't figure out how to finish it.
Here is the command where I'm close. It lists the header and the names, but no other data:
awk -F"," 'NR==1; NR>1 {a[$4]=a[$4] ? i : ""} END {for (i in a) {print i}}' test2.csv
you need little more code
$ awk 'BEGIN {FS=OFS=","}
{k=$4; $4=$5; NF--; a[k]=(k in a?a[k] FS $0:$0)}
END {for(k in a) print k,a[k]}' file
"Bob","N","3.5",3,"10/29/17" ,"Y","4.5",5,"10/11/18" ,"Y","5",6,"10/28/18"
"Jim","Y","3",1, ,"N","4",2,"09/29/17"
"Joe","N","2.5",4,"01/26/18"
note that NF-- trick may not work in all awks.
Could you please try following too, reading the Input_file 2 times, it will provide output in same sequence in which 4th column has come in Input_file.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$4]=a[$4]?a[$4] OFS $1 OFS $2 OFS $3 OFS $5:$4 OFS $1 OFS $2 OFS $3 OFS $5
next
}
a[$4]{
print a[$4]
delete a[$4]
}
' Input_file Input_file
If there is any chance that any of the CSV values has a comma, then a "CSV-aware" tool will would be advisable to obtain a reliable but straightforward solution.
One approach would be to use one of the many readily available csv2tsv command-line tools. A variety of elegant solutions then becomes possible. For example, one could pipe the CSV into csv2tsv, awk, and tsv2csv.
Here is another solution that uses csv2tsv and jq:
csv2tsv < input.csv | jq -Rrn '
[inputs | split("\t")]
| group_by(.[3])[]
| sort_by(.[2])
| [.[0][3]] + ( map( del(.[3])) | add)
| #csv
'
This produces:
"Bob","N","3.5","3","10/29/17 ","Y","4.5","5","10/11/18 ","Y","5","6","10/28/18 "
"Jim","Y","3","1"," ","N","4","2","09/29/17 "
"Joe","N","2.5","4","01/26/18"
Trimming the excess spaces is left as an exercise :-)

Use sed to find and replace a number following by its successor in bash

I have a string that contains multiple occurrences of number ranges, which are separated by a comma, e.g.,
2-12,59-89,90-102,103-492,593-3990,3991-4930
Now I would like to remove all directly neighbouring ranges and remove them from the string, i.e., remove anything that is of the form -(x),(x+1), to get something like this:
2-12,59-492,593-4930
Can anyone think of a method to accomplish this? I can honestly not post anything that I have tried, because all my tries were highly unsuccessful. To me it seems like it is not possible to actually find anything of the form -(x),(x+1) using sed, since that would require doing operations or comparisons of a found number by another number that has to be part of the command that is currently searching for numbers.
If everybody agrees that sed is NOT the correct tool for doing this, I will do it another way, but I am still interested if it's possible.
with awk
awk -F, -v RS="-" -v ORS="-" '$2!=$1+1' file
with appropriate separator setting, print the record when second field is not +1.
RS is the record separator and ORS is the outpout record separator.
test:
> awk -F, -v RS="-" -v ORS="-"
'$2!=$1+1' <<< "2-12,59-89,90-102,103-492,593-3990,3991-4930"
2-12,59-492,593-4930
awk solution:
awk -F'-' '{ r=$1;
for (i=2; i<=NF; i++) {
split($i, a, ",");
r=sprintf("%s%s", r, a[2]-a[1]==1? "" : FS $i)
}
print r
}' file
-F'-' - treat -(hyphen) as field separator
r - resulting string
split($i, a, ",") - split adjacent range boundaries into array a by separator ,
a[2]-a[1]==1 - crucial condition, reflects (x),(x+1)
The output:
2-12,59-492,593-4930
This might work for you (GNU sed):
sed -r ' s/^/\n/;:a;ta;s/\n([^-]*-)([0-9]*)(.*,)/\1\n\2\n\2\n\3/;Td;:b;s/(\n.*\n.*)9(_*\n)/\1_\2/;tb;s/(\n.*\n)(_*\n)/\10\2/;s/$/\n0123456789/;s/(\n.*\n[0-9]*)([0-8])(_*\n.*)\n.*\2(.).*/\1\4\3/;:z;tz;s/(\n.*\n[^_]*)_([^\n]*\n)/\10\2/;tz;:c;tc;s/([0-9]*-)\n(.*)\n(.*)\n,(\3)-/\n\1/;ta;s/\n(.*)\n.*\n,/\1,\n/;ta;:d;s/\n//g' file
This proof-of-concept sed solution, iteratively increments and compares the end of one range with the start of another. If the comparison is true it removes both and repeats, otherwise it moves on to the next range and repeats until all ranges have been compared.

how to append same header data to one header in linux

my data is seperated by comma delimiter
So by talking value before comma as main header column and if the same header occured somewhere elese, apped data into one header by placing open and closed flower brackets
Please consider my example for better understading
Input file data
19,66:BILL
19,34
19,02
21,:0
21,:0
21,:1
21,37
26,:19
26,87
27,35
31,77
31,12
31,202
Output file data
19,{66:BILL}{34}{02}
21,:{0}{:0}{:1}
21,37
26,{:19}{87}
27,35
31,{77}{12}{102}
A solution using awk
$ awk -F, '{a[$1]=a[$1]"{"$2"}"} END{for (i in a) print i FS a[i]}' input.csv
Assuming that the input file contains two columns only, the script constructs an array a by appending the values $2 of all rows with the same index $1 into the same element a[$1]
input.csv
19,66:BILL
19,34
19,02
21,:0
21,:0
21,:1
21,37
26,:19
26,87
27,35
31,77
31,12
31,202
output
19,{66:BILL}{34}{02}
21,{:0}{:0}{:1}{37}
26,{:19}{87}
27,{35}
31,{77}{12}{202}

Get list of all duplicates based on first column within large text/csv file in linux/ubuntu

I am trying to extract all the duplicates based on the first column/index of my very large text/csv file (7+ GB / 100+ Million lines). Format is like so:
foo0:bar0
foo1:bar1
foo2:bar2
first column is any lowercase utf-8 string and the second column is any utf-8 string. I have been able to sort my file based on the first column and only the first column with:
sort -t':' -k1,1 filename.txt > output_sorted.txt
I have also been able to drop all duplicates with:
sort -t':' -u -k1,1 filename.txt > output_uniq_sorted.txt
These operations take 4-8 min.
I am now trying to extract all duplicates based on the first column and only the first column, to make sure all entries in the second columns are matching.
I think I can achieve this with awk with this code:
BEGIN { FS = ":" }
{
count[$1]++;
if (count[$1] == 1){
first[$1] = $0;
}
if (count[$1] == 2){
print first[$1];
}
if (count[$1] > 1){
print $0;
}
}
running it with:
awk -f awk.dups input_sorted.txt > output_dup.txt
Now the problem is this takes way to long 3+hours and not yet done. I know uniq can get all duplicates with something like:
uniq -D sorted_file.txt > output_dup.txt
The problem is specifying the delimiter and only using the first column. I know uniq has a -f N to skip the first N fields. Is there a way to get these results without having to change/process my data? Is there another tool the could accomplish this? I have already used python + pandas with read_csv and getting the duplicates but this leads to errors (segmentation fault) and this is not efficient since I shouldn't have to load all the data in memory since the data is sorted. I have decent hardware
i7-4700HQ
16GB ram
256GB ssd samsung 850 pro
Anything that can help is welcome,
Thanks.
SOLUTION FROM BELOW
Using:
awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c'
with the command time I get the following performance.
real 0m46.058s
user 0m40.352s
sys 0m2.984s
If your file is already sorted you don't need to store more than one line, try this
$ awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c' sorted.input
If you try this please post the timings...
I have changed the awk script slightly because I couldn't fully understand what was happening in the above awnser.
awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c>=1{if(c==1){print p0;} print $0}' sorted.input > duplicate.entries
I have tested and this produces the same output as the above but might be easier to understand.
{if(p!=$1){p=$1; c=0; p0=$0} else c++}
If the first token in the line is not the same as the previous we will save the first token then set c to 0 and save the whole line into p0. If it is the same we increment c.
c>=1{if(c==1){print p0;} print $0}
In the case of the repeat, we check if its first repeat. If thats the case we print save line and current line, if not just print current line.

Resources