Move every x(dynamic) number of lines to a single line [Shell] - linux

So I have data that looks like this
/blah
etc1: etc
etc2
etc3: etc
etc4
/blah
etc1: etc
etc2
etc3
/blah
etc1: etc
etc2
etc3: etc
etc4
/blah
etc1
etc2
So I can't do a specific number of lines, so thought was to use / as a delimiter and put every line after until / on same line(comma delimited?)
Ideal Expected Output:
/blah,etc1: etc,etc2,etc3: etc,etc4,,
/blah,etc1,etc2,etc3,,
/blah,etc1: etc,etc2,etc3: etc,etc4,,
/blah,etc1,etc2,,
Prefer shell/bash/ksh but an excel solution would work too.

Here's an awk solution:
awk '
/^\// { if (NR > 1) print ","; printf "%s,", $0; next }
{ gsub(/^ +| +$/, ""); printf "%s,", $0 }
END { print "," }
' file
Note that it assumes that the input file starts with a /blah-like line, but doesn't end with one.
Crammed into a (less readable) one-liner:
awk '/^\// {if(NR>1) print","; printf"%s,",$0; next} {gsub(/^ +| +$/, ""); printf"%s,",$0} END {print","}' file

A sed solution
sed -r ':a;N;$!ba;s/\n\s+/,/g' input | sed 's/$/,,/'
you get,
/blah,etc1: etc,etc2,etc3: etc,etc4,,
/blah,etc1: etc,etc2,etc3,,
/blah,etc1: etc,etc2,etc3: etc,etc4,,
/blah,etc1,etc2,,

Related

Match lines based on patterns and reformat file Bash/ Linux

I am looking preferably for a bash/Linux method for the problem below.
I have a text file (input.txt) that looks like so (and many many more lines):
TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW
index_07_barcode_04_PA-17-ACW-04 17-ACW
index_09_barcode_05_PA-17-ACW-05 17-ACW
index_08_barcode_37_PA-21-YC-15 21-YC
index_09_barcode_04_PA-22-GB-10 22-GB
index_10_barcode_37_PA-28-CC-17 28-CC
index_11_barcode_29_PA-32-MW-07 32-MW
index_11_barcode_20_PA-32-MW-08 32-MW
I want to produce a file that looks like
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
I thought that I could do something along the lines of this.
cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??
But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?
Thank you!
(Happy Easter/ long weekend to all!)
With your shown samples please try following.
awk '
FNR==NR{
arr[$2]=(arr[$2]?arr[$2]",":"")$1
next
}
($2 in arr){
print $2"("arr[$2]")"
delete arr[$2]
}
' Input_file Input_file
2nd solution: Within a single read of Input_file try following.
awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file
Explanation(1st solution): Adding detailed explanation for 1st solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if 2nd field is present in arr then do following.
print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here.
delete arr[$2] ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input:
$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
printf "%s%s(", ORS, $2
ORS = ")\n"
sep = ""
prev = $2
}
{
printf "%s%s", sep, $1
sep = ","
}
END { print "" }
$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
This might work for you (GNU sed):
sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file
Append each manipulated line to the hold space.
Before moving on to the next line, accumlate like keys into a single line.
Delete every line except the last.
Replace the last line by the contents of the hold space.
Remove the first character (newline artefact introduced by H comand) and print the result.
N.B. The final solution is unsorted and in the original order.

how to modify a text file that every line has same number of columns?

I've got a text file which includes several lines. Every line has words which are separated with a comma. The number of words in lines are not the same. I would like with the help of the awk command to make every line have same number of column. For example, if the text file is as follows:
word1, text, help, test
number, begin
last, line, line
I would like the output be as the following which every line has same size in column with an extra null word:
word1, text, help, test
number, begin, null, null
last, line, line, null
I tried the following code:
awk '{print $0,Null}' file.txt
$ awk 'BEGIN {OFS=FS=", "}
NR==FNR {max=max<NF?NF:max; next}
{for(i=NF+1;i<=max;i++) $i="null"}1' file{,}
first scan to find the max number of columns and fill the missing entries in the second round. If the first line contains all the columns (header perhaps), you can change to
$ awk 'BEGIN {OFS=FS=", "}
NR==1 {max=NF}
{for(i=NF+1;i<=max;i++) $i="null"}1' file
file{,} is expanded by bash to file file, a neat trick not to repeat the filename (and eliminates possible typos).
Passing twice through the input file, using getline on first pass:
awk '
BEGIN {
OFS=FS=", "
while(getline < ARGV[1]) {
if (NF > max) {max = NF}
}
close(ARGV[1])
}
{ for(i=NF+1; i<=max; i++) $i="null" } 1
' file.txt
Alternatively, keeping it simple by running awk twice...
#!/bin/bash
infile="file.txt"
maxfields=$(awk 'BEGIN {FS=", "} {if (NF > max) {max = NF}} END{print max}' "$infile" )
awk -v max="$maxfields" 'BEGIN {OFS=FS=", "} {for(i=NF+1;i<=max;i++) $i="null"} 1' "$infile"
Use these Perl one-liners. The first one goes through the file and finds the max number of fields to use. The second one goes through the file and prints the input fields, padded at the end by the null strings:
export num_fields=$( perl -F'/,\s+/' -lane 'print scalar #F;' in_file | sort -nr | head -n1 )
perl -F'/,\s+/' -lane 'print join ", ", map { defined $F[$_] ? $F[$_] : "null" } 0..( $ENV{num_fields} - 1 );' in_file > out_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/,\s+/' : Split into #F on comma with whitespace.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Filling empty spaces in a CSV file

I have a CSV file where some columns are empty such as
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
How do I replace all the empty columns with the word "empty".
I have tried using awk(which is a command I am learning to use).
I want to have
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
I tried to replace just the 3rd column to see if I was on the right track
awk -F '[[:space:]]' '$2 && !$3{$3="empty"}1' file
this left me with
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
I have also tried
nawk -F, '{$3="\ "?"empty":$3;print}' OFS="," file
this resulted in
oski14,safe,empty,13,53,4
oski15,Unknow,empty,,,0
oski16,Unknow,empty,,,0
oski17,Unknow,empty,,,0
oski18,unsafe,empty,,1,2
oski19,unsafe,empty,4,,56
Lastly I tried
awk '{if (!$3) {print $1,$2,"empty"} else {print $1,$2,$3}}' file
this left me with
oski14,safe,empty,13,53,4 empty
oski15,Unknow,empty,,,0 empty
oski16,Unknow,empty,,,0 empty
oski17,Unknow,empty,,,0 empty
oski18,unsafe,empty,,1,2 empty
oski19,unsafe,empty,4,,56 empty
With a sed that supports EREs with a -E argument (e.g. GNU sed or OSX/BSD sed):
$ sed -E 's/(^|,)(,|$)/\1empty\2/g; s/(^|,)(,|$)/\1empty\2/g' file
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
You need to do the substitution twice because given contiguous commas like ,,, one regexp match would use up the first 2 ,s and so you'd be left with ,empty,,.
The above would change a completely empty line into empty, let us know if that's an issue.
This is the awk command
awk 'BEGIN { FS=","; OFS="," }; { for (i=1;i<=NF;i++) { if ($i == "") { $i = "empty" }}; print $0 }' yourfile
As suggested in the comments, you can shorten the BEGIN procedure to FS=OFS="," as awk allows chained assignment (which I did not know, thank you #EdMorton).
I've set FS="," in the BEGIN procedure instead of using the -F, option just for uniformity with setting OFS=",".
Clearly you can put the script in a more nice looking form:
#!/usr/bin/awk -f
BEGIN {
FS = ","
OFS = ","
}
{
for (i = 1; i <= NF; ++i)
if ($i == "")
$i = "empty"
print $0
}
and use it as a standalone program (you have to chmod +x it), even if this is known to have some drawbacks (consult the comments to this question as well as this answer):
./the_script_above your_file
or
down_the_pipe | ./the_script_above | further_processing
Clearly you are still able to feed the above script to awk this way:
awk -f the_script_above file1 file2

How to make awk grab string in between a second set of single-quotes

Please help, this is driving me mad.
I've got a standard Wp-config.php file and I'm trying to get awk to output only the database name, database username and password on a single line, but no matter what I try it spits out either irrelevant nonsense or syntax errors.
define('DB_NAME', 'pinkywp_wrdp1');
/** MySQL database username */
define('DB_USER', 'pinkywp_user1');
/** MySQL database password */
define('DB_PASSWORD', 'Mq2uMCLuGvfyw');
Desired output:
pinkywp_wrdp1 pinkywp_user1 Mq2uMCLuGvfyw
Actual output:
./dbinfo.sh: line 28: unexpected EOF while looking for matching `''
./dbinfo.sh: line 73: syntax error: unexpected end of file
With GNU awk:
Use ' as field separator and if current line contains 5 columns, print content of column 4 with a trailing blank.
awk -F "'" 'NF==5 {printf("%s ",$4)}' file
Output:
pinkywp_wrdp1 pinkywp_user1 Mq2uMCLuGvfyw
$ awk -F"'" '$1~/^define/ && $2~/^DB_/{ printf "%s%s", $4, (++cnt%3 ? OFS : ORS)}'
pinkywp_wrdp1 pinkywp_user1 Mq2uMCLuGvfyw
A few awk solutions:
1) with GNU flavor:
awk -v RS="');" '{ printf "%s%s", (NR==1? "":OFS), substr($NF, 2) }END{ print "" }' file
2) quotes-independant solution:
awk -F', ' '/define/{
gsub(/^["\047]|["\047]\);$/, "", $2);
printf "%s%s", (NR==1? "":" "), $2
}
END{ print "" }' file
The output (for both approaches):
pinkywp_wrdp1 pinkywp_user1 Mq2uMCLuGvfyw

Unix command to create new output file by combining 2 files based on condition

I have 2 files. Basically i want to match the column names from File 1 with the column name listed in the File 2. The resulting output File should have data for the column that matches with File 2 and Null value for the remaining column name in File 2.
Example:
file1
Name|Phone_Number|Location|Email
Jim|032131|xyz|xyz#qqq.com
Tim|037903|zzz|zzz#qqq.com
Pim|039141|xxz|xxz#qqq.com
File2
Location
Name
Age
Based on these 2 files, I want to create new file which has data in the below format:
Output:
Location|Name|Age
xyz|Jim|Null
zzz|Tim|Null
xxz|Pim|Null
Is there a way to get this result using join, awk or sed. I tried with join but couldnt get it working.
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==FNR { names[++numNames] = $0; next }
FNR==1 {
for (nameNr=1;nameNr<=numNames;nameNr++) {
name = names[nameNr]
printf "%s%s", name, (nameNr<numNames?OFS:ORS)
}
for (i=1;i<=NF;i++) {
name2fldNr[$i] = i
}
next
}
{
for (nameNr=1;nameNr<=numNames;nameNr++) {
name = names[nameNr]
fldNr = name2fldNr[name]
printf "%s%s", (fldNr?$fldNr:"Null"), (nameNr<numNames?OFS:ORS)
}
}
$ awk -f tst.awk file2 file1
Location|Name|Age
xyz|Jim|Null
zzz|Tim|Null
xxz|Pim|Null
Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
I'd suggest using csvcut, which is part of CSVKit (https://csvkit.readthedocs.org), along the lines of the following:
#!/bin/bash
HEADERS=File2
PSV=File1
headers=$(tr '\n' , < "$HEADERS" | sed 's/,$//' )
awk '-F|' '
BEGIN {OFS=FS}
NR==1 {print $0,"Age"; next}
{print $0, "Null"}' "$PSV" ) |\
csvcut "-d|" -c "$headers"
I realize this may not be entirely satisfactory, but csvcut doesn't currently have options to handle missing columns or translate missing data to a specified value.

Resources