GREP to search for second column in a text file - linux

I need to GREP second column (path name) from the text file. I have a text file which has a md5checksum filepath size . Such as:
ce75d423203a62ed05fe53fe11f0ddcf kart/pan/mango.sh 451b
8e6777b67f1812a9d36c7095331b23e2 kart/hey/local 301376b
e0ddd11b23378510cad9b45e3af89d79 yo/cat/so 293188b
4e0bdbe9bbda41d76018219f3718cf6f asuo/hakl 25416b
the above is the text file, I used grep -Eo '[/]' file.txt but it prints only / , but i want the output like this:
kart/pan/mango.sh
kart/hey/local
yo/cat/so
asuo/hakl
Lastly I have to use GREP.

If you can live with spaces before and after, you can use:
grep -o "\s[[:alnum:]/]*\s"
If you need the spaces removed, you will need some zero-width look-ahead/look-behind which is only available with -P (perl regexes), if you have that you can use:
grep -Po "(?<=\s)[[:alnum:]/]+(?=\s)"
(?<=\s) - look-behind to see if there is a space preceding the string, but not capture it
(?=\s) - look-ahead to see if there is a space after the match, but not capture it
[:alnum:] - match alpha numeric chars
[[:alnum:]/] - match alphanumeric chars and /
+ - match one or more
However, grep is not the right tool for this, cut/sed/awk are way better

cut -d ' ' -f 2
-d ' ' means your delimiter is the space
-f 2 meant you want to only print field number two

Use awk instead.
awk '{print $2}' file.txt

When you are allowed to combine grep with other tools and your input file only has slashes in the second field, you can use
tr " " "\n" < file.txt | grep '/'

Related

String split and extract the last field in bash

I have a text file FILENAME. I want to split the string at - of the first column field and extract the last element from each line. Here "$(echo $line | cut -d, -f1 | cut -d- -f4)"; alone is not giving me the right result.
FILENAME:
TWEH-201902_Pau_EX_21-1195060301,15cef8a046fe449081d6fa061b5b45cb.final.cram
TWEH-201902_Pau_EX_22-1195060302,25037f17ba7143c78e4c5a475ee98e25.final.cram
TWEH-201902_Pau_T-1383-1195060311,267364a6767240afab2b646deec17a34.final.cram
code I tried:
while read line; do \
DNA="$(echo $line | cut -d, -f1 | cut -d- -f4)";
echo $DNA
done < ${FILENAME}
Result I want
1195060301
1195060302
1195060311
Would you please try the following:
while IFS=, read -r f1 _; do # set field separator to ",", assigns f1 to the 1st field and _ to the rest
dna=${f1##*-} # removes everything before the rightmost "-" from "$f1"
echo "$dna"
done < "$FILENAME"
Well, I had to do with the two lines of codes. May be someone has a better approach.
while read line; do \
DNA="$(echo $line| cut -d, -f1| rev)"
DNA="$(echo $DNA| cut -d- -f1 | rev)"
echo $DNA
done < ${FILENAME}
I do not know the constraints on your input file, but if what you are looking for is a 10-digit number, and there is only ever one 10-digit number per line... This should do niceley
grep -Eo '[0-9]{10,}' input.txt
1195060301
1195060302
1195060311
This essentially says: Show me all 10 digit numbers in this file
input.txt
TWEH-201902_Pau_EX_21-1195060301,15cef8a046fe449081d6fa061b5b45cb.final.cram
TWEH-201902_Pau_EX_22-1195060302,25037f17ba7143c78e4c5a475ee98e25.final.cram
TWEH-201902_Pau_T-1383-1195060311,267364a6767240afab2b646deec17a34.final.cram
A sed approach:
sed -nE 's/.*-([[:digit:]]+)\,.*/\1/p' input_file
sed options:
-n: Do not print the whole file back, but only explicit /p.
-E: Use Extend Regex without need to escape its grammar.
sed Extended REgex:
's/.*-([[:digit:]]+)\,.*/\1/p': Search, capture one or more digit in group 1, preceded by anything and a dash, followed by a comma and anything, and print only the captured group.
Using awk:
awk -F[,] '{ split($1,arr,"-");print arr[length(arr)] }' FILENAME
Using , as a separator, take the first delimited "piece" of data and further split it into an arr using - as the delimiter and awk's split function. We then print the last index of arr.

How can I give all words from one file to 'tr' for searching and deleting in text from another file?

How to give all words from one file to tr for searching and deleting in text from another file?
For example, I have a file vocabulary.txt and loveStroty.txt. I'm trying to delete all words that in are vocabulary from love Story.
$ voc="one free" #files look like this strings
$ love="one two free four"
$ tr "$voc" '' <<< $love
Example for output (doesn't matter if it is with separators or with new line separated):
two
four
I'm assuming your input files look like this:
$ cat lovestory.txt
one two free four
$ cat vocabulary.txt
one free
In Bash, I can then use grep, process substitution and tr to remove every word from lovestory.txt that exists in vocabulary.txt like this:
$ grep -vFxf <(tr ' ' '\n' < vocabulary.txt) <(tr ' ' '\n' < lovestory.txt)
two
four
tr ' ' '\n' < file replaces every space in file with a newline; grep -vFx removes matches of complete lines (fixed strings, no regular expressions).
If files are not big enough, you could give sed utility a try:
# Define the text which replaces the searched words
replace="<Replacement string here>"
for word in $(cat /path/to/<file_containing_words>); do
sed -i "s/${word}/${replace}/g" <file_to_be_replaced>
done
So, for your specific example
replace=""
for word in $(cat /path/to/voc); do
sed -i "s/${word}/${replace}/g" /path/to/love
done
With GNU awk for multi-char RS:
$ awk -v RS='\\s+' 'NR==FNR{a[$0];next} !($0 in a)' vocabulary.txt lovestory.txt
two
four

return all lines that match String1 in a file after the last matching String2 in the same file

I figured out how to get the line number of the last matching word in the file :
cat -n textfile.txt | grep " b " | tail -1 | cut -f 1
It gave me the value of 1787. So, I passed it manually to the sed command to search for the lines that contains the sentence "blades are down" after that line number and it returned all the lines successfully
sed -n '1787,$s/blades are down/&/p' myfile.txt
Is there a way that I can pass the line number from the first command to the second one through a variable or a file so I can but them in the script to be executed automatically ?
Thank you.
You can do this by just connecting your two commands with xargs. 'xargs -I %' allows you to take the stdin from a previous command and place it whenever you want in the next command. The '%' is where your '1787' will be written:
cat -n textfile.txt | grep " b " | tail -1 | cut -f 1 | xargs -I % sed -n %',$s/blades are down/&/p' myfile.txt
You can use:
command substitution to capture the result of the first command in a variable.
simple string concatenation to use the variable in your sed comand
startLine=$(grep -n ' b ' textfile.txt | tail -1 | cut -d ':' -f1)
sed -n ${startLine}',$s/blades are down/&/p' myfile.txt
You don't strictly need the intermediate variable - you could simply use:
sed $(grep -n ' b ' textfile.txt | tail -1 | cut -d ':' -f1)',$s/blades are down/&/p' myfile.txt`
but it may make sense to do error checking on the result of the command substitution first.
Note that I've streamlined the first command by using grep's -n option, which puts the line number separated with : before each match.
First we can get "half" of the file after the last match of string2, then you can use grep to match all the string1
tac your_file | awk '{ if (match($0, "string2")) { exit ; } else {print;} }' | \
grep "string1"
but the order is reversed if you don't care about the order. But if you do care, just add another tac at the end with a pipe |.
This might work for you (GNU sed):
sed -n '/\n/ba;/ b /h;//!H;$!d;x;//!d;s/$/\n/;:a;/\`.*blades are down.*$/MP;D' file
This reads through the file storing all lines following the last match of the first string (" b ") in the hold space.
At the end of file, it swaps to the hold space, checks that it does indeed have at least one match, then prints out those lines that match the second string ("blades are down").
N.B. it makes the end case (/\n/) possible by adding a new line to the end of the hold space, which will eventually be thrown away. This also caters for the last line edge condition.

How to concatenate multiple lines of output to one line?

If I run the command cat file | grep pattern, I get many lines of output. How do you concatenate all lines into one line, effectively replacing each "\n" with "\" " (end with " followed by space)?
cat file | grep pattern | xargs sed s/\n/ /g
isn't working for me.
Use tr '\n' ' ' to translate all newline characters to spaces:
$ grep pattern file | tr '\n' ' '
Note: grep reads files, cat concatenates files. Don't cat file | grep!
Edit:
tr can only handle single character translations. You could use awk to change the output record separator like:
$ grep pattern file | awk '{print}' ORS='" '
This would transform:
one
two
three
to:
one" two" three"
Piping output to xargs will concatenate each line of output to a single line with spaces:
grep pattern file | xargs
Or any command, eg. ls | xargs. The default limit of xargs output is ~4096 characters, but can be increased with eg. xargs -s 8192.
grep xargs
In bash echo without quotes remove carriage returns, tabs and multiple spaces
echo $(cat file)
This could be what you want
cat file | grep pattern | paste -sd' '
As to your edit, I'm not sure what it means, perhaps this?
cat file | grep pattern | paste -sd'~' | sed -e 's/~/" "/g'
(this assumes that ~ does not occur in file)
This is an example which produces output separated by commas. You can replace the comma by whatever separator you need.
cat <<EOD | xargs | sed 's/ /,/g'
> 1
> 2
> 3
> 4
> 5
> EOD
produces:
1,2,3,4,5
The fastest and easiest ways I know to solve this problem:
When we want to replace the new line character \n with the space:
xargs < file
xargs has own limits on the number of characters per line and the number of all characters combined, but we can increase them. Details can be found by running this command: xargs --show-limits and of course in the manual: man xargs
When we want to replace one character with another exactly one character:
tr '\n' ' ' < file
When we want to replace one character with many characters:
tr '\n' '~' < file | sed s/~/many_characters/g
First, we replace the newline characters \n for tildes ~ (or choose another unique character not present in the text), and then we replace the tilde characters with any other characters (many_characters) and we do it for each tilde (flag g).
Here is another simple method using awk:
# cat > file.txt
a
b
c
# cat file.txt | awk '{ printf("%s ", $0) }'
a b c
Also, if your file has columns, this gives an easy way to concatenate only certain columns:
# cat > cols.txt
a b c
d e f
# cat cols.txt | awk '{ printf("%s ", $2) }'
b e
I like the xargs solution, but if it's important to not collapse spaces, then one might instead do:
sed ':b;N;$!bb;s/\n/ /g'
That will replace newlines for spaces, without substituting the last line terminator like tr '\n' ' ' would.
This also allows you to use other joining strings besides a space, like a comma, etc, something that xargs cannot do:
$ seq 1 5 | sed ':b;N;$!bb;s/\n/,/g'
1,2,3,4,5
Here is the method using ex editor (part of Vim):
Join all lines and print to the standard output:
$ ex +%j +%p -scq! file
Join all lines in-place (in the file):
$ ex +%j -scwq file
Note: This will concatenate all lines inside the file it-self!
Probably the best way to do it is using 'awk' tool which will generate output into one line
$ awk ' /pattern/ {print}' ORS=' ' /path/to/file
It will merge all lines into one with space delimiter
paste -sd'~' giving error.
Here's what worked for me on mac using bash
cat file | grep pattern | paste -d' ' -s -
from man paste .
-d list Use one or more of the provided characters to replace the newline characters instead of the default tab. The characters
in list are used circularly, i.e., when list is exhausted the first character from list is reused. This continues until
a line from the last input file (in default operation) or the last line in each file (using the -s option) is displayed,
at which time paste begins selecting characters from the beginning of list again.
The following special characters can also be used in list:
\n newline character
\t tab character
\\ backslash character
\0 Empty string (not a null character).
Any other character preceded by a backslash is equivalent to the character itself.
-s Concatenate all of the lines of each separate input file in command line order. The newline character of every line
except the last line in each input file is replaced with the tab character, unless otherwise specified by the -d option.
If ‘-’ is specified for one or more of the input files, the standard input is used; standard input is read one line at a time,
circularly, for each instance of ‘-’.
On red hat linux I just use echo :
echo $(cat /some/file/name)
This gives me all records of a file on just one line.

UNIX Shell Script remove one column from the file

I have a file like the following:
Header1:value1|value2|value3|
Header2:value4|value5|value6|
The column number is unknown and I have a function which can return the column number.
And I want to write a script which can remove one column from the file. For exampple, after removing column 1, I will get:
Header1:value2|value3|
Header2:value5|value6|
I use cut to achieve this and so far I can give the values after removing one column but without the headers. For example
value2|value3|
value5|value6|
Could anyone tell me how can I add headers back? Or any command can do that directly? Thanks.
Replace the colon with a pipe, do your cut command, then replace the first pipe with a colon again:
sed 's/:/|/' input.txt | cut ... | sed 's/|/:/'
You may need to adjust the column number for the cut command, to ensure you don't count the header.
Turn the ':' into '|', so that the header is another field, rather than part of the first field. You can do that either in whatever generates the data to begin with, or by passing the data through tr ':' '|' before cut. The rest of your fields will be offset by +1 then, but that should be easy enough to compensate for.
Your problem is that HeaderX are followed by ':' which is not the '|' delimiter you use in cut.
You could separate first your lines in two parts with :, with something like
"cut -f 1 --delimiter=: YOURFILE", then remove the first column and then put back the headers.
awk can handle multiple delimiters. So another alternative is...
jkern#ubuntu:~/scratch$ cat ./data188
Header1:value1|value2|value3|
Header2:value4|value5|value6|
jkern#ubuntu:~/scratch$ awk -F"[:|]" '{ print $1 $3 $4 }' ./data188
Header1value2value3
Header2value5value6
you can do it just with sed without cut:
sed 's/:[^|]*|/:/' input.txt
My solution:
$ sed 's,:,|,' data | awk -F'|' 'BEGIN{OFS="|"}{$2=""; print}' | sed 's,||,:,'
Header1:value2|value3|
Header2:value5|value6|
replace : with |
-F'|' tells awk to use | symbol as field separator
in each line we replace 2nd (because header now becomes first) field with empty string and printing result line with new field separator (|)
return back header by replacing first | with :
Not perfect, but should works.
$ cat file.txt | grep 'Header1' | awk -F"1" '{ print $1 $2 $3 $4}'
This will print all values in separate columns. You can print any number of columns.
Just chiming in with a Perl solution:
(rearrange/remove fields as needed)
-l effectively adds a newline to every print statement
-a autosplit mode splits each line using the -F expression into array #F
-n adds a loop around the -e code
-e your 'one liner' follows this option
$ perl -F[:\|] -lane 'print "$F[0]:$F[1]|$F[2]|$F[3]"' input.txt

Resources