Merge values for same key - linux

Is that possible to use awk to values of same key into one row?
For instance
a,100
b,200
a,131
a,102
b,203
b,301
Can I convert them to a file like this:
a,100,131,102
b,200,203,301

You can use awk like this:
awk -F, '{a[$1] = a[$1] FS $2} END{for (i in a) print i a[i]}' file
a,100,131,102
b,200,203,301
We use -F, to use comma as delimiter and use array a to keep aggregated value.
Reference: Effective AWK Programming

If Perl is an option,
perl -F, -lane '$a{$F[0]} = "$a{$F[0]},$F[1]"; END{for $k (sort keys %a){print "$k$a{$k}"}}' file
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
-F autosplit modifier, in this case splits on ,
#F is the array of words in each line, indexed starting with $F[0]
$F[0] is the first element in #F (the key)
$F[1] is the second element in #F (the value)
%a is a hash which stores a string containing all matches of each key

tl;dr
If you presort the input, it is possible to use sed to join the lines, e.g.:
sort foo | sed -nE ':a; $p; N; s/^([^,]+)([^\n]+)\n\1/\1\2/; ta; P; s/.+\n//; ba'
A bit more explanation
The above one-liner can be saved into a script file. See below for a commented version.
parse.sed
# A goto label
:a
# Always print when on the last line
$p
# Read one more line into pattern space and join the
# two lines if the key fields are identical
N
s/^([^,]+)([^\n]+)\n\1/\1\2/
# Jump to label 'a' and redo the above commands if the
# substitution command was successful
ta
# Assuming sorted input, we have now collected all the
# fields for this key, print it and move on to the next
# key
P
s/.+\n//
ba
The logic here is as follows:
Assume sorted input.
Look at two consecutive lines. If their key fields match, remove the key from the second line and append the value to the first line.
Repeat 2. until key matching fails.
Print the collected values and reset to collect values for the next key.
Run it like this:
sort foo | sed -nEf parse.sed
Output:
a,100,102,131
b,200,203,301

With datamash
$ datamash -st, -g1 collapse 2 <ip.txt
a,100,131,102
b,200,203,301
From manual:
-s, --sort
sort the input before grouping; this removes the need to manually pipe the input through 'sort'
-t, --field-separator=X
use X instead of TAB as field delimiter
-g, --group=X[,Y,Z]
group via fields X,[Y,Z]
collapse
comma-separated list of all input values

Related

Bash script sorting of lines in file

Hi im writing a bash script I need to remove blank space and sort the number in ascending anyone can help?
for line in `sed '/^$/d' $userInput`;
myarray[$index]="$line"
index=$(($index+1))
Currently im using the code above to remove blank space but I am not able to sort it.
$userInput is the file. the file contain few lines of number e.g.
1,4,5,6,2,3
If you have perl, and assuming index is zero-based, you could do:
declare -a myarray=($(
perl -F, -laE '/\S/ and say join ",", sort map {$_+0} #F' "$userInput"
))
index=${#myarray[#]}
The Perl script does:
read input from file $userInput
-F, : set comma as delimiter for autosplit option
-l : don't treat newline character as part of line
-a : autosplit input lines into array #F
-E : program follows
/\S/ and : require input line contains non-whitespace (if not true, following command is skipped)
map {$_+0} #F : try to convert elements of#F to numeric (ie. strip whitespace) (denote result as #r1)
sort #r1 : sort the elements of #r1 (denote result as #r2)
join ",", #r2 : construct comma-delimited string from elements of #r2 (denote result as #r3)
say #r3 : output #r3 with trailing newline
The lines output by the Perl script are used as elements of a new bash array myarray. Finally, we set index from the number of elements in myarray.

How to parse a specific column or data without losing content from other columns/rows after parsing?

I have the following output to grep the value in this case "225". This value is actually a variable $pd so it could change depending on users input" It could be integer numbers or an alphanumeric character case-insensitive exact match. Example if value of variable is "225" then a "0225" or "11225" its not a valid output from the file Im reading it.
Input File:
10.20.223.10|2000-H1|1/1/2|DeviceX_4021|LG
10.20.223.10|2000-H1|1/1/3|Undiscoverable|Unkwn
10.20.225.10|2000-H1|1/1/5|DeviceZ_2050|LG
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
10.20.223.10|2000-H1|1/1/8|DeviceY_01225_|Kenmore
10.20.225.10|2000-H1|1/1/8|DeviceY_2250_|Kenmore
Desired Output File:
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
If user input is "lg"; then it should output the line without not ignoring it because the input file has "lg" in uppercase. (This part is already fixed on the script).
Desired Output:
10.20.223.10|2000-H1|1/1/2|DeviceX_4021|LG
10.20.225.10|2000-H1|1/1/5|DeviceZ_2050|LG
$ awk -F'|' -v n='225' '$4 ~ n' file
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
or if you don't want a partial match (e.g. against 1225) then one way is:
$ awk -F'|' -v n='225' '$4 ~ ("(^|[^0-9])" n "([^0-9]|$)")' file
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
or:
$ awk -F'|' -v n='225' '$4 ~ ("(^|_)" n "(_|$)")' file
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
There are other possibilities too. The right solution depends on the requirements you haven't told us about and will pass or fail when using input other then you've shown us yet.
awk
awk -F"|" -v var="[A-Za-z].225_" '$4 ~ var{print}'
sed
sed -n '/[A-Za-z].225./p'
grep
grep '[A-Za-z].225.'
Output
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
Using sed:
sed -n '/^\([^|]*\|\)\{3\}[^|]*225/p' < input
Explanation:
the -n option disables automatic output at the end of each sed cycle
the pattern matches arbitrary contents of the first three (\{3\}) columns of data via the \(parenthesized\) pattern [^|]*\| -- any number of non-delmiter characters followed by the column delimiter
it matches additional input at the beginning of the fourth column, but not spanning columns, with a similar subexpression: [^|]*
then comes the literal text you want to match
the p command after the pattern causes the line to be printed to sed's output in the event that it matches the pattern
There's almost certainly an awk solution too, but in Perl it's this:
$ perl -aF'\|' -ne '$F[3] =~ 225 and print' < input
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
-a: Autosplit the input into array #F
-F'\|: Set the autosplit delimiter to |
-n: Run code for each line in the input file
-e: Here's the code to run
$F[3]: The 4th element of the autosplit array #F
=~: Regex match
and print: Print the input line if the regex matches
Update: You can get the string you're interested in from a command line parameter by assigning it in a BEGIN block.
$ perl -aF'\|' -ne 'BEGIN { $x = shift } $F[3] =~ $x and print' 225 < input

Print range of rows if a condition is met in AWK

What I am trying to do is to show 2 rows above and 2 rows below a line that meets a certain criteria without a pipe, using awk. For example, I am searching for the string 's62234' and when found, I want to print all the rows bounded in the blue rectangle as shown in the attached screenshot.
This is the file I am using (thefmifile.txt)
s62098:x:1271:504:Velizar Vrabchev,SI,3,1:/home/SI/s62098:/bin/bash
s62101:x:1272:504:Georgi Georgiev,SI,3,5:/home/SI/s62101:/bin/bash
s62108:x:1273:504:Sherif Kunch,SI,3,1:/home/SI/s62108:/bin/bash
s62111:x:1274:504:Yulian Bizeranov,SI,3,3:/home/SI/s62111:/bin/bash
s62121:x:1275:504:Daniel Dimitrov,SI,2,1:/home/SI/s62121:/bin/bash
s62133:x:1276:504:Ivaylo Ivanov,SI,2,2:/home/SI/s62133:/bin/bash
s62160:x:1277:504:Veniyana Tsolova,SI,2,3:/home/SI/s62160:/bin/bash
s62199:x:1278:504:Nikola Petrov,SI,2,5:/home/SI/s62199:/bin/bash
s62219:x:1279:504:Viliyan Ivanov,SI,2,6:/home/SI/s62219:/bin/bash
s62234:x:1280:504:Viktoriya Dobreva,SI,2,3:/home/SI/s62234:/bin/bash
s855264:x:1281:504:Toni Dupkarski,SI,4,2:/home/SI/s855264:/bin/bash
s81555:x:1282:503:Elena Georgieva,KN,2,0:/home/KN/s81555:/bin/bash
s81585:x:1283:503:Stela Marinova,KN,2,0:/home/KN/s81585:/bin/bash
s81441:x:1284:503:Vesela Plamenova Borislavova , KN, k2, g7:/home/KN/s81441:/bin/bash
s81644:x:1285:503:Viktor Rusev, KN, k2, g7:/home/KN/s81644:/bin/bash
s81628:x:1286:503:Iliyan Yordanov Yordanov, KN, k2, g6:/home/KN/s81628:/bin/bash
s81490:x:1287:503:Yana Spasova, KN, k2, g6:/home/KN/s81490:/bin/bash
What I have tried is using awk to find the row that meets the criteria and use NR to get the numbers of the other rows needed, but seems I am missing something.
Here is the command I used:
cat thefmifile.txt | awk -F ':' '$1==s62234 {for (x = NR -2; x <= NR + 2; x++){print}}'
Output is in the below screenshot.
And this is the desired output:
s62199:x:1278:504:Nikola Petrov,SI,2,5:/home/SI/s62199:/bin/bash
s62219:x:1279:504:Viliyan Ivanov,SI,2,6:/home/SI/s62219:/bin/bash
s62234:x:1280:504:Viktoriya Dobreva,SI,2,3:/home/SI/s62234:/bin/bash
s855264:x:1281:504:Toni Dupkarski,SI,4,2:/home/SI/s855264:/bin/bash
s81555:x:1282:503:Elena Georgieva,KN,2,0:/home/KN/s81555:/bin/bash
When it is {print x} it shows the numbers of the lines I need, but is there some way to access the lines of the file as elements in array and just to use this 'x' as an index (e.g. something like NR[x])?
Or Is there some other way to retrieve these rows?
Thank you!
$ awk -v n=2 -F':' '$1=="s62234"{for (i=0;i<n;i++) print buf[(NR+i)%n]; c=n+1} c&&c--; {buf[NR%n]=$0}' file
s62199:x:1278:504:Nikola Petrov,SI,2,5:/home/SI/s62199:/bin/bash
s62219:x:1279:504:Viliyan Ivanov,SI,2,6:/home/SI/s62219:/bin/bash
s62234:x:1280:504:Viktoriya Dobreva,SI,2,3:/home/SI/s62234:/bin/bash
s855264:x:1281:504:Toni Dupkarski,SI,4,2:/home/SI/s855264:/bin/bash
s81555:x:1282:503:Elena Georgieva,KN,2,0:/home/KN/s81555:/bin/bash
buf[] is just an array storing the n lines preceding the current line so those can be printed when your $1=="s62234" condition is met. c&&c--; represents a true condition which will cause awk to print (the default action) the current line plus n subsequent lines due to c=n+1 also being set when your condition is met - i.e. it'll print the current line and decrement c until c reaches zero.
Could you please try following, simple grep could handle this task.
grep -A2 -B2 '^s62234:' Input_file
Also more accurately you could try following to match exact string with grep:
grep -C2 '^s62234:' Input_file
That's easily doable with grep:
-B, --before-context=NUM print NUM lines of leading context
-A, --after-context=NUM print NUM lines of trailing context
-C, --context=NUM print NUM lines of output context
-NUM same as --context=NUM
With awk, you could do something like this:
awk -F ':' '$1==s62234{print l2;print l1;a=3}a&&a-->0{print}{l2=l1;l1=$0}' thefmifile.txt
You can handle the number of before-lines dynamically by storing them in an array and using a loop.

rearranging column based on condition

I have a *.csv file. with value as below
"ASDP02","8801942183589"
"ASDP06","8801939151023"
"CSDP04","8801963981740"
"ASDP09","8801946305047"
"ASDP12","8801941195677"
"ASDP05","8801922826186"
"CSDP08","8801983008938"
"ASDP04","8801944346555"
"CSDP11","8801910831518"
or sometimes the value is as below
"8801989353984","KSDP05"
"8801957608165","ASDP11"
"8801991455848","CSDP10"
"8801981363116","CSDP07"
"8801921247870","KSDP07"
"8801965386240","CSDP06"
"8801956293036","KSDP10"
"8801984383904","KSDP11"
"8801944211742","ASDP09"
I just want to put the numeric value (e.g. 8801989353984) always in 1st column. Is it possible using BASH script?
Sed is also your friend here
Input
cat 41189347
"ASDP02","8801942183589"
"ASDP06","8801939151023"
"CSDP04","8801963981740"
"ASDP09","8801946305047"
"ASDP12","8801941195677"
"ASDP05","8801922826186"
"CSDP08","8801983008938"
"ASDP04","8801944346555"
"CSDP11","8801910831518"
Script
sed -E 's/^("[[:alpha:]]+.*"),("[[:digit:]]+")$/\2,\1/' 41189347
Output
"8801942183589","ASDP02"
"8801939151023","ASDP06"
"8801963981740","CSDP04"
"8801946305047","ASDP09"
"8801941195677","ASDP12"
"8801922826186","ASDP05"
"8801983008938","CSDP08"
"8801944346555","ASDP04"
"8801910831518","CSDP11"
awk to the rescue!
$ awk -F, -v OFS=, '$1~/[A-Z]/{t=$2;$2=$1;$1=t}1' file
if first field has alpha chars, swap first and second columns and print.
Bash can do the work but awk might be a better choice for rearrange your file:
sample.csv:
"ASDP02","8801942183589"
"8801944211742","ASDP09"
command:
awk -F, 'BEGIN{OFS=","}{$1=$1;if(substr($1, 2, length($1) - 2) + 0 == substr($1, 2, length($1) - 2)){print $1,$2}else{print $2,$1}}' sample.csv
substr($1, 2, length($1) - 2) + 0 == substr($1, 2, length($1) - 2) checks the column is numeric or not. If it is, print the original line otherwise switch column1 and column2
Output:
"8801942183589","ASDP02"
"8801944211742","ASDP09"
You can create a pure bash script to generate other file which has the structure you need:
#!/bin/bash
csv_file="/path/to/your/csvfile"
output_file="/path/to/output_file"
#Optional
rm -rf "${output_file}"
readarray -t LINES < <(cat < "${csv_file}" 2> /dev/null)
for item in "${LINES[#]}"; do
if [[ $item =~ ^\"([0-9A-Z]+)\"\,\"([0-9]+)\" ]]; then
echo "\"${BASH_REMATCH[2]}\",\"${BASH_REMATCH[1]}\"" >> "${output_file}"
else
echo "$item" >> "${output_file}"
fi
done
This works even if your file is "mixed" I mean with some lines in the right format and other lines in the bad format.
The following commands assume that the cells in the CSV files do not contain newlines and commas. Otherwise, you should write a more complicated script in Perl, PHP, or other programming language capable of parsing CSV files properly. But Bash, definitely, is not appropriate for this task.
Perl
perl -F, -nle '#F = reverse #F if $F[0] =~ /^"\d+"$/;
print join(",", #F)' file
Beware, If the cells contain newlines, or commas, use Perl's Text::CSV module, for instance. Although it is a simple task in Perl, it goes beyond the scope of the current question.
The command splits the input lines by commas (-F,) and stores the result into #F array, for each line. The items in the array are reversed, if the first field $F[0] matches the regular expression. You can also swap the items this way: ($F[0], $F[1]) = ($F[1], $F[0]).
Finally, the joins the array items with commas, and prints to the standard output.
If you want to edit the file in-place, use -i option: perl -i.backup -F, ....
AWK
awk -F, -vOFS=, '/^"[0-9]+",/ {print; next}
{ t = $1; $1 = $2; $2 = t; print }' file
The input and output field separators are set to , with -F, and -vOFS=,.
If the line matches the pattern /^"[0-9]+",/ (the line begins with a "numeric" CSV column), the script prints the record and advances to the next record. Otherwise the next block is executed.
In the next block, it swaps the first two columns and prints the result to the standard output.
If you want to edit the file in-place, see answers to this question.

Getting n-th line of text output

I have a script that generates two lines as output each time. I'm really just interested in the second line. Moreover I'm only interested in the text that appears between a pair of #'s on the second line. Additionally, between the hashes, another delimiter is used: ^A. It would be great if I can also break apart each part of text that is ^A-delimited (Note that ^A is SOH special character and can be typed by using Ctrl-A)
output | sed -n '1p' #prints the 1st line of output
output | sed -n '1,3p' #prints the 1st, 2nd and 3rd line of output
your.program | tail +2 | cut -d# -f2
should get you 2/3 of the way.
Improving Grumdrig's answer:
your.program | head -n 2| tail -1 | cut -d# -f2
I'd probably use awk for that.
your_script | awk -F# 'NR == 2 && NF == 3 {
num_tokens=split($2, tokens, "^A")
for (i = 1; i <= num_tokens; ++i) {
print tokens[i]
}
}'
This says
1. Set the field separator to #
2. On lines that are the 2nd line, and also have 3 fields (text#text#text)
3. Split the middle (2nd) field using "^A" as the delimiter into the array named tokens
4. Print each token
Obviously this makes a lot of assumptions. You might need to tweak it if, for example, # or ^A can appear legitimately in the data, without being separators. But something like that should get you started. You might need to use nawk or gawk or something, I'm not entirely sure if plain awk can handle splitting on a control character.
bash:
read
read line
result="${line#*#}"
result="${result%#*}"
IFS=$'\001' read result -a <<< "$result"
$result is now an array that contains the elements you're interested in. Just pipe the output of the script to this one.
here's a possible awk solution
awk -F"#" 'NR==2{
for(i=2;i<=NF;i+=2){
split($i,a,"\001") # split on SOH
for(o in a ) print o # print the splitted hash
}
}' file

Resources