Delete lines from a file matching first 2 fields from a second file in shell script - linux

Suppose I have setA.txt:
a|b|0.1
c|d|0.2
b|a|0.3
and I also have setB.txt:
c|d|200
a|b|100
Now I want to delete from setA.txt lines that have the same first 2 fields with setB.txt, so the output should be:
b|a|0.3
I tried:
comm -23 <(sort setA.txt) <(sort setB.txt)
But the equality is defined for whole line, so it won't work. How can I do this?

$ awk -F\| 'FNR==NR{seen[$1,$2]=1;next;} !seen[$1,$2]' setB.txt setA.txt
b|a|0.3
This reads through setB.txt just once, extracts the needed information from it, and then reads through setA.txt while deciding which lines to print.
How it works
-F\|
This sets the field separator to a vertical bar, |.
FNR==NR{seen[$1,$2]=1;next;}
FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, when FNR==NR, we are reading the first file, setB.txt. If so, set the value of associative array seen to true, 1, for the key consisting of fields one and two. Lastly, skip the rest of the commands and start over on the next line.
!seen[$1,$2]
If we get to this command, we are working on the second file, setA.txt. Since ! means negation, the condition is true if seen[$1,$2] is false which means that this combination of fields one and two was not in setB.txt. If so, then the default action is performed which is to print the line.

This should work:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p' setB.txt |sed -f- setA.txt
How this works:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p'
generates an output:
/^c|d/d
/^a|b/d
which is then used as a sed script for the next sed after the pipe and outputs:
b|a|0.3

(IFS=$'|'; cat setA.txt | while read x y z; do grep -q -P "\Q$x|$y|\E" setB.txt || echo "$x|$y|$z"; done; )
explanation: grep -q means only test if grep can find the regexp, but do not output, -P means use Perl syntax, so that the | is matched as is because the \Q..\E struct.
IFS=$'|' will make bash to use | instead of the spaces (SPC, TAB, etc.) as token separator.

Related

Linux Bash: extracting text from file int variable

I haven't found anything that clearly answers my question. Although very close, I think...
I have a file with a line:
# Skipsdata for serienummer 1158
I want to extract the 4 digit number at the end and put it into a variable, this number changes from file to file so I can't just search for "1158". But the "# Skipsdata for serienummer" always remains the same.
I believe that either grep, sed or awk may be the answer but I'm not 100 % clear on their usage.
Using Awk as
numberRequired=$(awk '/# Skipsdata for serienummer/{print $NF}' file)
printf "%s\n" "$numberRequired"
1158
You can use grep with the -o switch, which prints only the matched part instead of the whole line.
Print all numbers at the end of lines from file yourFile
grep -Po '\d+$' yourFile
Print all four digit numbers at the end of lines like described in your question:
grep -Po '^# Skipsdata for serienummer \K\d{4}$' yourFile
-P enables perl style regexes which support \d and especially \K.
\d matches any digit (0-9).
\d{4} matches exactly four digits.
\K lets grep forget the previously matched part, such that only the part afterwards is printed.
There are multiple ways to find your number. Assuming the input data is in a file called inputfile:
mynumber=$(sed -n 's/# Skipsdata for serienummer //p' <inputfile) will print only the number and ignore all the other lines;
mynumber=$(grep '^# Skipsdata for serienummer' inputfile | cut -d ' ' -f 5) will filter the relevant lines first, then only output the 5th field (the number)

Find the most common line in a file in bash

I have a file of strings:
string-string-123
string-string-123
string-string-123
string-string-12345
string-string-12345
string-string-12345-123
How do I retrieve the most common line in bash (string-string-123)?
You can use sort with uniq
sort file | uniq -c | sort -n -r
You could use awk to do this:
awk '{++a[$0]}END{for(i in a)if(a[i]>max){max=a[i];k=i}print k}' file
The array a keeps a count of each line. Once the file has been read, we loop through it and find the line with the maximum count.
Alternatively, you can skip the loop in the END block by assigning the line during the processing of the file:
awk 'max < ++c[$0] {max = c[$0]; line = $0} END {print line}' file
Thanks to glenn jackman for this useful suggestion.
It has rightly been pointed out that the two approaches above will only print out one of the most frequently occurring lines in the case of a tie. The following version will print out all of the most frequently occurring lines:
awk 'max<++c[$0] {max=c[$0]} END {for(i in c)if(c[i]==max)print i}' file
Tom Fenech's elegant awk answer works great [in the amended version that prints all most frequently occurring lines in the event of a tie].
However, it may not be suitable for large files, because all distinct input lines are stored in an associative array in memory, which could be a problem if there are many non-duplicate lines; that said, it's much faster than the approaches discussed below.
Grzegorz Żur's answer combines multiple utilities elegantly to implicitly produce the desired result, but:
all distinct lines are printed (highest-frequency count first)
output lines are prefixed by their occurrence count (which may actually be desirable).
While you can pipe Grzegorz Żur's answer to head to limit the number of lines shown, you can't assume a fixed number of lines in general.
Building on Grzegorz's answer, here's a generic solution that shows all most-frequently-occurring lines - however many there are - and only them:
sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1'
If you don't want the output lines prefixed with the occurrence count:
sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1' |
sed 's/^ *[0-9]\{1,\} //'
Explanation of Grzegorz Żur's answer:
uniq -c outputs the set of unique input lines prefixed with their respective occurrence count (-c), followed by a single space.
sort -n -r then sorts the resulting lines numerically (-n), in descending order (-r), so that the most frequently occurring line(s) are at the top.
Note that sort, if -k is not specified, will generally try to sort by the entire input line, but -n causes only the longest prefix that is recognized as an integer to be used for sorting, which is exactly what's needed here.
Explanation of my awk command:
NR==1 {prev=$1} stores the 1st whitespace-separated field ($1) in variable prev for the first input line (NR==1)
$1!=prev {exit} terminates processing, if the 1st whitespace-separated field is not the same as the previous line's - this means that a non-topmost line has been reached, and no more lines need printing.
1 is shorthand for { print } meaning that the input line at hand should be printed as is.
Explanation of my sed command:
^ *[0-9]\{1,\} matches the numeric prefix (denoting the occurrence count) of each output line, as (originally) produced by uniq -c
applying s/...// means that the prefix is replaced with an empty string, i.e., effectively removed.

Shell Linux : grep exact sentence with NULL character

I have a file like
key\0value\n
akey\0value\n
key2\0value\n
I have to create a script that take as argument a word. I have to return every lines having a key exactly the same than the argument.
I tried
grep -aF "$key\x0"
but grep seems to do not understand the \x0 (\0 same result). Futhermore, I have to check that the line begins with "$key\0"
I only can use sed grep and tr and other no maching commands
To have the \0 taken into account try :
grep -Pa "^key\x0"
it works for me.
Using sed
sed will work:
$ sed -n '/^key1\x00/p' file
key1value
The use of \x00 to represent a hex character is a GNU extension to sed. Since this question is tagged linux, that is not a problem.
Since the null character does not display well, one might (or might not) want to improve the display with something like this:
$ sed -n 's/^\(akey\)\x00/\1-->/p' file
akey-->value
Using sed with keys that contain special characters
If the key itself can contain sed or shell active characters, then we must escape them first and then run sed against the input file:
#!/bin/bash
printf -v script '/^%s\\x00/p' "$(sed 's:[]\[^$.*/]:\\&:g' <<<"$1")"
sed -n "$script" file
To use this script, simply supply the key as the first argument on the command line, enclosed in single-quotes, of course, to prevent shell processing.
To see how it works, let's look at the pieces in turn:
sed 's:[]\[^$.*/]:\\&:g' <<<"$1"
This puts a backslash escape in front of all sed-active characters.
printf -v script '/^%s\\x00/p' "$(sed 's:[]\[^$.*/]:\\&:g' <<<"$1")"
This creates a sed command using the escaped key and stores it in the shell variable script.
sed -n "$script" file
This runs sed using the shell variable script as the sed command.
Using awk
The question states that awk is not an acceptable tool. For completeness, though, here is an awk solution:
$ awk -F'\x00' -v k=key1 '$1 == k' file
key1value
Explanation:
-F'\x00'
awk divides the input up into records (lines) and divides the records up into fields. Here, we set the field separator to the null character. Consequently, the first field, denoted $1, is the key.
-v k=key1
This creates an awk variable, called k, and sets it to the key that we are looking for.
$1 == k
This statement looks for records (lines) for which the first field matches our specified key. If a match is found, the line is printed.

How to delete 5 lines before and 6 lines after pattern match using Sed?

I want to search for a pattern "xxxx" in a file and delete 5 lines before this pattern and 6 lines after this match. How can i do this using Sed?
This might work for you (GNU sed):
sed ':a;N;s/\n/&/5;Ta;/xxxx/!{P;D};:b;N;s/\n/&/11;Tb;d' file
Keep a rolling window of 5 lines and on encountering the specified string add 6 more (11 in total) and delete.
N.B. This is a barebones solution and will most probably need tailoring to your specific needs. Questions such as: what if there are multiple string throughout the file? What if the string is within the first five lines or multiple strings are within five lines of each other etc etc etc.
Here's one way you could do it using awk. I assume that you also want to delete the line itself and that the file is small enough to fit into memory:
awk '{a[NR]=$0}/xxxx/{f=NR}END{for(i=1;i<=NR;++i)if(i<f-5||i>f+6)print a[i]}' file
Store every line into the array a. When the pattern /xxxx/ is matched, save the line number. After the whole file has been processed, loop through the array, only printing the lines you want to keep.
Alternatively, you can use grep to obtain the line number first:
grep -n 'xxxx' file | awk -F: 'NR==FNR{f=$1}NR<f-5||NR>f+6' - file
In both cases, the lines deleted will be surrounding the last line where the pattern is matched.
A third option would be to use grep to obtain the line number then use sed to delete the lines:
line=$(grep -nm1 'xxxx' file | cut -d: -f1)
sed "$((line-5)),$((line+6))d" file
In this case I've also added the -m switch so grep exits after finding the first match.
if you know, the line number (what is not difficult to obtain), you can use something like that:
filename="test"
start=`expr $curr_line - 5`
end=`expr $curr_line + 6`
sed "${start},${end}d" $filename (optionally sed -i)
of course, you have to remember about additional conditions like start shouldn't be less than 1 and end greater than number of lines in file.
Another - maybe more easy to follow - solution would be to use grep to find the keyword and the corresponding line:
grep -n 'KEYWORD' <file>
then use sed to get the line number only like this:
grep -n 'KEYWORD' <file> | sed 's/:.*//'
Now that you have the line number simply use sed like this:
sed -i "$(LINE_START),$(LINE_END) d" <file>
to remove lines before and/or after! With only the -i you will override the <file> (no backup).
A script example could be:
#!/bin/bash
KEYWORD=$1
LINES_BEFORE=$2
LINES_AFTER=$3
FILE=$4
LINE_NO=$(grep -n $KEYWORD $FILE | sed 's/:.*//' )
echo "Keyword found in line: $LINE_NO"
LINE_START=$(($LINE_NO-$LINES_BEFORE))
LINE_END=$(($LINE_NO+$LINES_AFTER))
echo "Deleting lines $LINE_START to $LINE_END!"
sed -i "$LINE_START,$LINE_END d" $FILE
Please note that this will work only if the keyword is found once! Adapt the script to your needs!

Count the number of occurrences in a string. Linux

Okay so what I am trying to figure out is how do I count the number of periods in a string and then cut everything up to that point but minus 2. Meaning like this:
string="aaa.bbb.ccc.ddd.google.com"
number_of_periods="5"
number_of_periods=`expr $number_of_periods-2`
string=`echo $string | cut -d"." -f$number_of_periods`
echo $string
result: "aaa.bbb.ccc.ddd"
The way that I was thinking of doing it was sending the string to a text file and then just greping for the number of times like this:
grep -c "." infile
The reason I don't want to do that is because I want to avoid creating another text file for I do not have permission to do so. It would also be simpler for the code I am trying to build right now.
EDIT
I don't think I made it clear but I want to make finding the number of periods more dynamic because the address I will be looking at will change as the script moves forward.
If you don't need to count the dots, but just remove the penultimate dot and everything afterwards, you can use Bash's built-in string manuipulation.
${string%substring}
Deletes shortest match of $substring from back of $string.
Example:
$ string="aaa.bbb.ccc.ddd.google.com"
$ echo ${string%.*.*}
aaa.bbb.ccc.ddd
Nice and simple and no need for sed, awk or cut!
What about this:
echo "aaa.bbb.ccc.ddd.google.com"|awk 'BEGIN{FS=OFS="."}{NF=NF-2}1'
(further shortened by helpful comment from #steve)
gives:
aaa.bbb.ccc.ddd
The awk command:
awk 'BEGIN{FS=OFS="."}{NF=NF-2}1'
works by separating the input line into fields (FS) by ., then joining them as output (OFS) with ., but the number of fields (NF) has been reduced by 2. The final 1 in the command is responsible for the print.
This will reduce a given input line by eliminating the last two period separated items.
This approach is "shell-agnostic" :)
Perhaps this will help:
#!/bin/sh
input="aaa.bbb.ccc.ddd.google.com"
number_of_fields=$(echo $input | tr "." "\n" | wc -l)
interesting_fields=$(($number_of_fields-2))
echo $input | cut -d. -f-${interesting_fields}
grep -o "\." <<<"aaa.bbb.ccc.ddd.google.com" | wc -l
5

Resources