Retaining one member of a pair - linux

Good afternoon to all,
I have a file containing two fields, each representing a member of a pair.
I want to retain one member of each pair and it does not matter which member as these are codes for duplicate samples in a study.
Each pair appears twice in my file, with each member of the pair appearing once in either column.
An example of an input file is:
XXX1 XXX7
XXX2 XXX4
abc2 dcb3
XXX7 XXX1
dcb3 abc2
XXX4 XXX2
And an example of the desired output would be
XXX1
XXX2
abc2
How might this be accomplished in bash? Thank you.

Here is a combination of GNU awk, cut and sort, store the scipt as duplicatePairs.awk:
{ if ( $1 < $2) print $1, $2
else print $2, $1
}
and run it like this: awk -f duplicatePairs.awk your_file | sort -u | cut -d" " -f1
The if sorts the pairs such that a line with x,y and a line with y,x will be printed the same. Then sort -u can remove the duplicate lines. And the cut selects the first column.
With a slightly larger awk script, we can solve the requirements "awk-only":
{
smallest = $1;
if ( $1 > $2) {
smallest = $2
}
if( !(smallest in seen) ) {
seen [ smallest ] = 1
print smallest
}
}
Run it like this: awk -f duplicatePairs.awk your_file

While the answer posted by Lars above works very well I would like to suggest an alternative, just in case someone stumbles upon this problem.
I had previously used awk '!seen[$2,$1]++ {print $1}' to the same result. I didn't realize it had worked since the number of lines in my file wasn't halved. This turned out to be because of some wrong assumptions I made about my data.

Related

AWK make a simple subtract and find the minimum value of that

I have this matrix:
{{1,4},{6,8}}
and I want to substract the second value from the first value like: 4-1 and 8-6
and then, comparer both and show what was the minimun value from both, in this case: 8-6=2
All of this using AWK in terminal
You seem a little confused about whether you want to subtract the first from the second or the second from the first. Also, about whether your data is in a file or a variable. However, this should get you started...
If we replace any mad braces or commas with spaces:
echo "{{1,4},{6,8}}" | awk '{gsub(/[{},]/," "); print}'
1 4 6 8
Now we can access the fields as $1 through $4 and do what you want:
echo "{{1,4},{6,8}}" | awk '{gsub(/[{},]/," "); x=$2-$1; y=$4-$3; if(x<y)print x; else print y}'
2
As a, maybe more elegant, alternative suggested by #3161993 in the comments, you could set the field separator to be one or more open or close braces or commas, like this:
awk -F '[,{}]+' '{x=$3-$2; y=$5-$4; if(x<y) print x; else print y}' <<< "{{1,4},{6,8}}"
2
And, as #EdMorton kindly pointed out, it can be made a bit more succinct with a ternary operator like this:
awk -F '[,{}]+' '{x=$3-$2; y=$5-$4; print (x<y ? x : y)}' <<< "{{1,4},{6,8}}"

How to extract specific value using grep and awk?

I am facing a problem to extract a specific value in a .txt file using grep and awk.
I show below an excerpt from the .txt file:
"-
bravais-lattice index = 2
lattice parameter (alat) = 10.0000 a.u.
unit-cell volume = 250.0000 (a.u.)^3
number of atoms/cell = 2
number of atomic types = 1
number of electrons = 28.00
number of Kohn-Sham states= 18
kinetic-energy cutoff = 60.0000 Ry
charge density cutoff = 300.0000 Ry
convergence threshold = 1.0E-09
mixing beta = 0.7000"
I also defined some variable: ELEMENT and lat.
I want to extract the "unit-cell volume" value which is equal to 250.00.
I tried the following to extract the value using grep and awk:
volume=`grep "unit-cell volume" ./latt.10/$ELEMENT.scf.latt_$lat.out | awk '{printf "%15.12f\n",$5}'`
However, when i run the bash file I always get 00.000000 as a result instead of the correct value of 250.00.
Can anyone help, please?
Thanks in advance.
awk '{printf "%15.12f\n",$5}'
You're asking awk to print out the fifth field of the line ($5).
unit-cell volume = 250.0000 (a.u.)^3
1 2 3 4 5
The fifth field is (a.u.)^3, which you are then asking awk to interpret as a number via the %f format code. It's not a number, though (or actually, doesn't start with a number), and when awk is asked to treat a non-numeric string as a number, it uses 0 instead. Thus it prints 0.
Solution: use $4 instead.
By the way, you can skip invoking grep by using awk itself to select the line, e.g.
awk /^ unit-cell/ {...}
The /^ unit-cell/ is a regular expression that matches "unit-cell" (with a leading space) at the beginning of the line. Adjust as necessary if you have other lines that start with unit-cell which you don't want to select.
You never need grep when you're using awk since awk can do anything useful that grep can do. It sounds like this is all you need:
$ awk -F'=' '/unit-cell volume/{printf "%.2f\n",$2}' file
250.00
The above works because when FS is = that means $2 is <spaces>250.000 (a.u.)^3 and when awk is asked to convert a string to a number it strips off leading spaces and anything after the numeric part so that leaves 250.000 to be converted to a number by %.2f.
In the script you posted $5 was failing because the 5th space-separated field in:
$1 $2 $3 $4 $5
<unit-cell> <volume> <=> <250.0000> <(a.u.)^3>
is (a.u.)^3 - you could have just added print $5 to see that.
Since you are processing key-value pairs where the key can have variable amount on space in it, you need to tune that field number ($4, $5 etc.) separately for each record you want to process unless you set the field separator (FS) appropriately to FS=" *= *". Then the key will always be in $1 and value in $2.
Then use split to split the value and unit parts from each other.
Also, you can loose that grep by defining in awk a pattern (or condition, /unit-cell volume/) for that printaction:
$ awk 'BEGIN{FS=" *= *"} /unit-cell volume/{split($2,a," +");print a[1]}' file
250.0000
Explained:
$ awk '
BEGIN { FS=" *= *" } # set appropriate field separator
/unit-cell volume/ { # pattern or condition
split($2,a," +") # split value part to value and possible unit parts
print a[1] # output value part
}' file

Comparing two files using awk and printing contains which are matching from other files

I have two files:
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
file2.txt
919167000000
919594000000
Output
919167000000,hutch,mumbai
919594000000,idea,mumbai
How can I achieve this using AWK? I've got a huge file of phone numbers which needs to be compared like this. I believe Awk can handle it; if not please let me know how can I do this.
Extra definitions
Is the common part always a 6-digit number? Yes always 6.
Are the two files already sorted? file1 is not sorted. file2 can be sorted.
Are the trailing digits in file 2 always zeros? No, these are phone numbers this can vary, purpose of this is to get series information of the phone number.
Is there any danger of file 1 containing three records for a given number while file 2 contains 2 records, or is it one-to-one? It's one-to-one.
Can there be records in file 1 with no match in file 2, or vice versa?_ Yes.
If so, do you want to see the unmatched records? Yes I want both records.
Extended data
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
918888,airtel,karnataka
file2.txt
919167838888
919594998484
919212334323
Output Expected:
919167838888,hutch,mumbai
919594998484,idea,mumbai
919212334323,nomatch,nomatch
As I noted in a comment, there's a lot of unstated information needed to give a definitive answer. However, we can make some plausible guesses:
The common number is the first 6 digits of file 2 (we don't care about the trailing digits, but will simply copy them to the output).
The files are sorted in order.
If there are unmatched records in either file, those records will be ignored.
The tools of choice are probably sed and join:
sed 's/^\([0-9]\{6\}\)/\1,\1/' file2.txt |
join -t, -o 1.2,2.2,2.3 - file1.txt
This edits file2.txt to create a comma-separated first field with the 6-digit phone number followed by all the rest of the line. The input is fed to the join command, which joins on the first column, and outputs the 'rest of the line' (column 2) from file2.txt and columns 2 and 3 from file1.txt.
If the phone numbers are variable length, then the matching operation is horribly complex. For that, I'd drop into Perl (or Python) to do the work. If the data is unsorted, it can be sorted before being fed into the commands. If you want unmatched records, you can specify how to handle those in the options to join.
The extra information needed is now available. The key information is the 6-digits is fixed — phew! Since you're on Linux, I'm assuming bash is available with 'process substitution':
sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -a 2 -e 'no-match' - <(sort file1.txt)
If process substitution is not available, simply sort file1.txt in situ:
sort -o file1.txt file1.txt
Then use file1.txt in place of <(sort file1.txt).
I think the comment might be asking for inputs such as:
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
902130,airtel,karnataka
file2.txt
919167000000
919594000000
919342313242
Output
no-match,airtel,karnataka
919167000000,hutch,mumbai
919342313242,no-match,no-match
919594000000,idea,mumbai
If that's not what the comment is about, please clarify by editing the question to add the extra data and output in a more readable format than comments allow.
Working with the extended data, this mildly modified command:
sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -e 'no-match' - <(sort file1.txt)
produces the output:
919167838888,hutch,mumbai
919212334323,no-match,no-match
919594998484,idea,mumbai
which looks rather like a sorted version of the desired output. The -a n options control whether the unmatched records from file 1 or file 2 (or both) are printed; the -e option controls the value printed for the unmatched fields. All of this is readily available from the man pages for join, of course.
Here's one way using GNU awk. Run like:
awk -f script.awk file2.txt file1.txt
Contents of script.awk:
BEGIN {
FS=OFS=","
}
FNR==NR {
sub(/[ \t]+$/, "")
line = substr($0, 0, 6)
array[line]=$0
next
}
{
printf ($1 in array) ? $0"\n" : "FILE1 no match --> "$0"\n"
dup[$1]++
}
END {
for (i in array) {
if (!(i in dup)) {
printf "FILE2 no match --> %s\n", array[i]
}
}
}
Alternatively, here's the one-liner:
awk 'BEGIN { FS=OFS="," } FNR==NR { sub(/[ \t]+$/, ""); line = substr($0, 0, 6); array[line]=$0; next } { printf ($1 in array) ? $0"\n" : "FILE1 no match --> "$0"\n"; dup[$1]++} END { for (i in array) if (!(i in dup)) printf "FILE2 no match --> %s\n", array[i] }' file2.txt file1.txt
awk -F, 'FNR==NR{a[$1]=$2","$3;next}{for(i in a){if($1~/i/) print $1","a[i]}}' your_file

sed - pass match to external command

I have written a little script using sed to transform this:
kaefert#Ultrablech ~ $ cat /sys/class/power_supply/BAT0/uevent
POWER_SUPPLY_NAME=BAT0
POWER_SUPPLY_STATUS=Full
POWER_SUPPLY_PRESENT=1
POWER_SUPPLY_TECHNOLOGY=Li-ion
POWER_SUPPLY_CYCLE_COUNT=0
POWER_SUPPLY_VOLTAGE_MIN_DESIGN=7400000
POWER_SUPPLY_VOLTAGE_NOW=8370000
POWER_SUPPLY_POWER_NOW=0
POWER_SUPPLY_ENERGY_FULL_DESIGN=45640000
POWER_SUPPLY_ENERGY_FULL=44541000
POWER_SUPPLY_ENERGY_NOW=44541000
POWER_SUPPLY_MODEL_NAME=UX32-65
POWER_SUPPLY_MANUFACTURER=ASUSTeK
POWER_SUPPLY_SERIAL_NUMBER=
into a csv file format like this:
kaefert#Ultrablech ~ $ Documents/Asus\ Zenbook\ UX32VD/power_to_csv.sh
"date";"status";"voltage µV";"power µW";"energy full µWh";"energy now µWh"
2012-07-30 11:29:01;"Full";8369000;0;44541000;44541000
2012-07-30 11:29:02;"Full";8369000;0;44541000;44541000
2012-07-30 11:29:04;"Full";8369000;0;44541000;44541000
... (in a loop)
What I would like now is to divide each of those numbers by 1.000.000 so that they don't represent µV but V and W instead of µW, so that they are easily interpretable on a quick glance. Of course I could do this manually afterwards once I've opened this csv inside libre office calc, but I would like to automatize it.
So what I found is, that I can call external programs in between sed, like this:
...
s/\nPOWER_SUPPLY_PRESENT=1\nPOWER_SUPPLY_TECHNOLOGY=Li-ion\nPOWER_SUPPLY_CYCLE_COUNT=0\nPOWER_SUPPLY_VOLTAGE_MIN_DESIGN=7400000\nPOWER_SUPPLY_VOLTAGE_NOW=\([0-9]\{1,\}\)/";'`echo 0`'\1/
and that I could get values like I want by something like this:
echo "scale=6;3094030/1000000" | bc | sed 's/0\{1,\}$//'
But the problem now is, how do I pass my match "\1" into the external command?
If you are interested in looking at the full script, you'll find it there:
http://koega.no-ip.org/mediawiki/index.php/Battery_info_to_csv
if your sed is GNU sed. you can use 'e' to pass matched group to external command/tools within sed command.
an example might be helpful to make it clear:
say, you have a problem:
you have a string "120+20foobar" now you want to get the calculation result of 120+20 part, and replace "oo" to "xx" in "foobar" part.
Note that this example is not for solving the problem above, just for
showing the sed 'e' usage
so you could make 120+20 in the first match group, and rest in 2nd group, then pass two groups to different command/tools and then get the result. like:
kent$ echo "100+20foobar"|sed -r 's#([0-9+]*)(.*)#echo \1 \|bc\;echo \2 \| sed "s/oo/xx/g"#ge'
120
fxxbar
in this way, you could nest many seds one in another one, till you get lost. :D
As sed doesn't do arithmetic on its own I would recommend using awk for something like this, e.g. to divide 3rd, 5th and 6th field by a million do something like this:
awk -F';' -v OFS=';' '
NR == 1
NR != 1 {
$3 /= 1e6
$5 /= 1e6
$6 /= 1e6
print
}'
Explanation
-F';' and -v OFS=';' specify the input and output field separator.
NR == 1 pass first line through without change.
NR != 1 if it is not the first line, divide and print.
To divide by 1,000,000 directly, you do so :
Q='3094030/1000000'
sed ':r /^[[:digit:]]\{7\}/{s$\([[:digit:]]*\)\([[:digit:]]\{6\}\)/1000000$\1.\2$;p;d};s:^:0:;br;d'

Getting n-th line of text output

I have a script that generates two lines as output each time. I'm really just interested in the second line. Moreover I'm only interested in the text that appears between a pair of #'s on the second line. Additionally, between the hashes, another delimiter is used: ^A. It would be great if I can also break apart each part of text that is ^A-delimited (Note that ^A is SOH special character and can be typed by using Ctrl-A)
output | sed -n '1p' #prints the 1st line of output
output | sed -n '1,3p' #prints the 1st, 2nd and 3rd line of output
your.program | tail +2 | cut -d# -f2
should get you 2/3 of the way.
Improving Grumdrig's answer:
your.program | head -n 2| tail -1 | cut -d# -f2
I'd probably use awk for that.
your_script | awk -F# 'NR == 2 && NF == 3 {
num_tokens=split($2, tokens, "^A")
for (i = 1; i <= num_tokens; ++i) {
print tokens[i]
}
}'
This says
1. Set the field separator to #
2. On lines that are the 2nd line, and also have 3 fields (text#text#text)
3. Split the middle (2nd) field using "^A" as the delimiter into the array named tokens
4. Print each token
Obviously this makes a lot of assumptions. You might need to tweak it if, for example, # or ^A can appear legitimately in the data, without being separators. But something like that should get you started. You might need to use nawk or gawk or something, I'm not entirely sure if plain awk can handle splitting on a control character.
bash:
read
read line
result="${line#*#}"
result="${result%#*}"
IFS=$'\001' read result -a <<< "$result"
$result is now an array that contains the elements you're interested in. Just pipe the output of the script to this one.
here's a possible awk solution
awk -F"#" 'NR==2{
for(i=2;i<=NF;i+=2){
split($i,a,"\001") # split on SOH
for(o in a ) print o # print the splitted hash
}
}' file

Resources