Extract data between two words using sed or awk - linux

I have a log file and I am trying to extract data between 2 words of that log file.
username=#$^#$^&###%^&==&employeeid
There is data before and after these words but I am only interested in the data between them. Thus the expected output is (just the value between the username= and &employeeid
#$^#$^&###%^&==
I want to grep the file first and then search using sed in that file. Something like below. This is not working for me exactly..
grep "e553bb57-b94b-cb0f-f4ba-eb9a02ab0050" /path/abc/logfile.txt | sed -n '/username=/{s/.*username=//;s/\S*=.*//;p}'

How about
echo 'username=#$^#$^&###%^&==&employeeid' | sed 's/username=\(.*\)==&employeeid/\1/'
The output is
#$^#$^&###%^&
The matched part would be in \1.

Related

Comparing two csv files with different lengths but output only the line where it matches the same value in two different column

I've been trying to compare two csv file using simple shell script but I think the code that I was using is not doing it's job. what I want to do is, compare the two files using Column 6 from first.csv and Column 2 in second.csv and when it matches, it will output the line from first.csv. see below as an example
first.csv
1,0,3210820,0,536,7855712
1,0,3523340820,0,36,53712
1,0,321023423i420,0,336,0255712
1,0,321082234324,0,66324,027312
second.csv
14,7855712,Whie,Black
124,7855712,Green,Black
174,1197,Black,Orange
1284,98132197,Yellow,purple
35384,9811123197,purple,purple
13354,0981123131197,green,green
183434,0811912313127,white,green
Output should be from the first file:
1,0,3210820,0,536,7855712
I've been using the code below.
cat first.csv | while read line
do
cat second.csv | grep $line > output_file
done
please help. Thank you
Your question is not entirely clear, but here is what I think you want:
cat first.csv | while read LINE; do
VAL=`echo "$LINE" | cut -d, -f6`
grep -q "$VAL" second.csv && echo $LINE
done
The first line in the loop extracts the 6th field from the line and stores it in VAL. The next line checks (quietly), if VAL occurs in second.csv and if so, outputs the line.
Note that grep will check for any occurence in second.csv, not only in field 2. To check only against field 2, change it to:
cut -d, -f2 second.csv | grep -q "$VAL" && echo $LINE
Unrelated to your question I would like to comment, that those things can much more efficiency be solved in a language like python.
Well... If you have bash with process substitution you can treat all the 2nd fields in second.csv (with a $ appended to anchor the search at the end of the line) as input from a file. Then using grep -f match data from the 2nd column of second.csv with the end of the line in first.csv doing what you intend.
You can use the <(process) form to redirect the 2nd field as a file using:
grep -f <(awk -F, '{print $2"$"}' second.csv) first.csv
Example Output
With the data you show in first.csv and second.csv you get:
1,0,3210820,0,536,7855712
Adding the "$" anchor as part of the 2nd field from second.csv should satisfy the match only in the 6th field (end of line) in first.csv.
The benefit here being there is but a single call to grep and awk, not an additional subshell spawned per-iteration. Doesn't matter with small files like your sample input, but with millions of lines, we are talking hours (or days) of processing time difference.

How to Grep the complete sequences containing a specific motif in a fasta file?

How to Grep the complete sequences containing a specific motif in a fasta file or txt file with one linux command and write them into another file? Also, I want to include the lines beginning with a ">" before these target sequences.
Example:I have a fasta file of 10000 sequences.
$cat file.fa
>name1
AEDIA
>name2
ALKME
>name3
AAIII
I want to grep sequences containing KME, so I should get:
>name2
ALKME
Attached is the current way I am using based on the answers I got. Maybe others may find it helpful. Thanks to Pierre Lindenbaum, Philipp Bayer, cpad0112 and batMan.
Preprocessing the fasta file first and get each sequence into one line (which is very important)
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa > file1.fa
Get rid of the first empty line
tail -n +2 file1.fa > file2.fa
Extract the target sequences containing the substring including their names and save it into another file
LC_ALL=C grep -B 1 KME file2.fa > result.txt
Note: Take KME as the target substring as an example
if you have multiline fasta files. First linearize with awk, and use another awk to filter the sequence containing the motif. using grep would be dangerous a sequence name contains a short motif.
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '{if(index($2,"KME")!=0) printf("%s\n%s\n",$1,$2);}'
grep -B1 KME file > output_file
-B1 : prints 1 line before the match as well

To find and print a matching word between two commas using either sed, grep or cut

I have a log and it has data like :
ProcessID='3940', Key='1', Number='5547', TotalNumberOfInputMessages='1', TotalElapsedTime='1332',
there are many other such info in log, but I am interested in particularly printing only TotalNumberOfInputMessages='1' occurrences ... There are many such occurrences in the log file with this value changing for TotalNumberOfInputMessages.
I want output like :
TotalNumberOfInputMessages='1'
TotalNumberOfInputMessages='diff value'
TotalNumberOfInputMessages='diff value'
TotalNumberOfInputMessages='diff value'
How can i achieve it by cut, sed or grep ?
You can use grep -Eo:
grep -Eo "\bTotalNumberOfInputMessages='[^']*'" file
TotalNumberOfInputMessages='1'

How to delete 5 lines before and 6 lines after pattern match using Sed?

I want to search for a pattern "xxxx" in a file and delete 5 lines before this pattern and 6 lines after this match. How can i do this using Sed?
This might work for you (GNU sed):
sed ':a;N;s/\n/&/5;Ta;/xxxx/!{P;D};:b;N;s/\n/&/11;Tb;d' file
Keep a rolling window of 5 lines and on encountering the specified string add 6 more (11 in total) and delete.
N.B. This is a barebones solution and will most probably need tailoring to your specific needs. Questions such as: what if there are multiple string throughout the file? What if the string is within the first five lines or multiple strings are within five lines of each other etc etc etc.
Here's one way you could do it using awk. I assume that you also want to delete the line itself and that the file is small enough to fit into memory:
awk '{a[NR]=$0}/xxxx/{f=NR}END{for(i=1;i<=NR;++i)if(i<f-5||i>f+6)print a[i]}' file
Store every line into the array a. When the pattern /xxxx/ is matched, save the line number. After the whole file has been processed, loop through the array, only printing the lines you want to keep.
Alternatively, you can use grep to obtain the line number first:
grep -n 'xxxx' file | awk -F: 'NR==FNR{f=$1}NR<f-5||NR>f+6' - file
In both cases, the lines deleted will be surrounding the last line where the pattern is matched.
A third option would be to use grep to obtain the line number then use sed to delete the lines:
line=$(grep -nm1 'xxxx' file | cut -d: -f1)
sed "$((line-5)),$((line+6))d" file
In this case I've also added the -m switch so grep exits after finding the first match.
if you know, the line number (what is not difficult to obtain), you can use something like that:
filename="test"
start=`expr $curr_line - 5`
end=`expr $curr_line + 6`
sed "${start},${end}d" $filename (optionally sed -i)
of course, you have to remember about additional conditions like start shouldn't be less than 1 and end greater than number of lines in file.
Another - maybe more easy to follow - solution would be to use grep to find the keyword and the corresponding line:
grep -n 'KEYWORD' <file>
then use sed to get the line number only like this:
grep -n 'KEYWORD' <file> | sed 's/:.*//'
Now that you have the line number simply use sed like this:
sed -i "$(LINE_START),$(LINE_END) d" <file>
to remove lines before and/or after! With only the -i you will override the <file> (no backup).
A script example could be:
#!/bin/bash
KEYWORD=$1
LINES_BEFORE=$2
LINES_AFTER=$3
FILE=$4
LINE_NO=$(grep -n $KEYWORD $FILE | sed 's/:.*//' )
echo "Keyword found in line: $LINE_NO"
LINE_START=$(($LINE_NO-$LINES_BEFORE))
LINE_END=$(($LINE_NO+$LINES_AFTER))
echo "Deleting lines $LINE_START to $LINE_END!"
sed -i "$LINE_START,$LINE_END d" $FILE
Please note that this will work only if the keyword is found once! Adapt the script to your needs!

Quickest way to remove 70+ strings from a file?

I have 70+ strings I need to find and delete in a file. I need to remove the entire line in the file that the string appears in.
I know I can use sed -i '/string to remove/d' fileA.txt to remove them one at a time. However, considering I have 70+, it will take some time doing it this way.
Is there a way I can put these 70+ strings in a file and have sed go through them one by one? Or if I create a file containing the strings, is there a way to compare the two files so it removes any line from fileA that contains one of the strings?
You could use grep:
grep -vf file_with_words.txt file.txt
where file_with_words.txt would be the file containing the list of words, each word being on a different line and file.txt is the file that you want to remove the lines from.
If your list of words contains regex metacharacters, then tell grep to consider those as fixed strings (if that is what you want):
grep -F -vf file_with_words.txt file.txt
Using sed, you'd need to say:
sed '/word1\|word2\|word3/d' file.txt
or
sed -E '/word1|word2|word3/d' file.txt
You could use command substitution to construct the pattern too:
sed -E "/$(paste -sd'|' file_with_words.txt)/d" file.txt
but grep is clearly the tool to use in this case.
If you want to do the job in bash, here's how:
search=fileA.txt
queries=queries.txt
while read query
do
sed -i '' "/$query/d" $search
done < "$queries"
where queries.txt looks like
I
want
to
delete
these
lines

Resources