Search and print record in shell script - linux

I am using grep to search a text based database that holds contact info. However, it prints the : delimiter from the text file. How do I remove the delimiter or change it to a tab?
DATAFILE
Name:Address:Phone:Email
Stan Marsh:123 South Park:456:sm#cc.com
Each line has its own record. I tried using grep, awk, cut, tr, but I can't get it to output without the darn : I saw a lot of tutorials on printing a databases first column or even several but I just need print without delimiters(or replace them) after searching for it. I saw how to print a whole file with no delimiters or even replacing them but I'm having a hard time combining that with a grep search:/

use sed to replace it:
cat yourFileName|grep YourKeyWord | sed -e 's/:/\t/g'

Related

Extract text from each line from a multiple-line text file based on a condition, Linux

I have a txt file with only one column that each line represent a different fastq.gz file from a sequence output. See an example below:
36108-ABZG339L_S237_L001_R1_001.fastq.gz
36108-ABZG339L_S237_L001_R2_001.fastq.gz
36108-ABZGM_S7_L001_R1_001.fastq.gz
36108-ABZGM_S7_L001_R2_001.fastq.gz
First of all, I would like to convert the first "-" symbol to underscore "_".
I achieved that through the following command:
sed 's/[-]/_/Ig' inputfile.txt > outputfile.txt
Then the outputfile.txt is:
36108_ABZG339L_S237_L001_R1_001.fastq.gz
36108_ABZG339L_S237_L001_R2_001.fastq.gz
36108_ABZGM_S7_L001_R1_001.fastq.gz
36108_ABZGM_S7_L001_R2_001.fastq.gz
Afterwards, I would like to extract in a new txt file only the text between first and second underscore, so:
ABZG339L
ABZG339L
ABZGM
ABZGM
How can I achieve? I tried through sed, awk but I cannot find out.
Thanks on advance for your aid,
MagĂ­
1st solution: To get your shown expected sample output you need not to first substitute - to - and then print, we can use power of awk here to create multiple field separators and then print needed value accordingly.
awk -F'-|_' '{print $2}' Input_file
Explanation: Simple explanation of above awk program would be, making _ and - as field separators for whole Input_file then printing 2nd field/column in it.
2nd solution: Using sed solution, using sed's back reference capability here.
sed -E 's/^[^-]*-([^_]*).*/\1/' Input_file
Explanation: Using sed's -E option here to enable ERE(extended regular expression) here. In main program of sed then from starting of value till 1st occurrence of - matching it and then creating 1st back reference(temp location in memory to be retrieved later on while performing substitution) and then matching anything till last of value. While substitution, substituting whole line value with only matched value to get desired results.
3rd solution: Using GNU grep here. Using GNU grep's -oP options here to enable PCRE regex engine in this program. In main program matching everything from starting to till - and forgetting that match with \k option of GNU grep. Then matching everything just before - and printing it.
grep -oP '^.*?-\K[^_]*' Input_file

Wildcard sed search/remove within other text in the same line

I'm trying to remove a matching string with partial wildcards using sed, and the searches I've done for answers on this site either don't seem to apply or I can't convert them to my situation.
Below is the string of text I need to remove:
www.foo.com.cp123.bar.com
It is in a file with other entries on the same line. The line that has my entries always starts with serveralias:, however, as below:
serveralias: www.domain.com mail.domain.com www.foo.com.cp123.bar.com domain.com
I can identify what I need to remove via the 'cp123.bar.com' text as that always stays the same. It's the preceding 'www.foo.com' that changes. It can appear just once or multiple times within the line, but it will always end in 'cp123.bar.com'. I've tried the following two commands based on my research:
sed 's/\ .*cp123.bar.com\ //g' file.txt
sed 's/\ [^:]+$cp123.bar.com\ //g' file.txt
I'm using the spaces between each entry as the start and stop point for the find/replace(delete), but that's a band-aid and not always going to work since the entry I need to delete is occasionally at the end of the line (without a space afterward). If I don't include the spaces, though, everything gets removed since I'm using wildcards, including the www.domain.com, mail.domain.com, etc. text I need to keep there. Running either of the sed commands above doesn't do anything, just prints what's currently in the file.
Any ideas on what I need to change? I'm happy to clarify anything if need be.
Sed requires an -r flag to be able to use enhanced regular expressions. Without the -r, the + won't work in the regexps. Thus, a
sed -r 's/ +[^ ]+\.cp123\.bar\.com//g'
will do what you want. It removes the following substrings:
one or more space
followed by one or more non-space
followed by .cp123.bar.com

How can I remove a doubled section of a string?

I'm having trouble with data manipulation in a txt file. My file currently looks like this:
HG02239 -23.42333333
NA06985NA06985 -20.125
NA06991NA06991 -20.92
This shows some of my tab-delimited data. Half the entries are in the correct seven-characters (letterletternumbernumbernumbernumbernumber) format, but some are doubled up. I want to go into the second column (first column is empty for a reason!) and remove the repeats in the string so it would read
HG02239 -23.42333333
NA06985 -20.125
NA06991 -20.92
I can't work out how to do this with sed/awk on a per column basis. I feel like I should be able to write a regex, but because the data is a repeat, I don't want to lose the first half of the string; and I can't work out how to cut on a specific column, or I would just delete the 7th character. Any help much appreciated!
Solution
You can solve this with a backreference. For example, using GNU sed:
$ cat << EOF | sed --regexp-extended 's/(.{7})\1/\1/'
HG02239 -23.42333333
NA06985NA06985 -20.125
NA06991NA06991 -20.92
EOF
HG02239 -23.42333333
NA06985 -20.125
NA06991 -20.92
If you aren't using GNU sed, you may need to escape the capture groups. In addition, you can tune the regular expression if you need a more accurate character match.
Explanation
The cat pipeline is just a here-document to make it easy to display and test the code. You can call sed directly on your file, or use the -i flag to perform an in-place edit when you're comfortable with the results.
The sed script does the following:
It stores any group of 7 consecutive characters in a capture group using an "interval expression" (the number in the curly braces).
The \1 is a backreference that matches the first capture group.
The match looks for "a capture group followed by a copy of the capture group."
The substitution replaces the match with a single copy of the capture group.
One way, using awk:
awk '{ print substr($1, 1, 7), $2 }' file.txt
Output:
HG02239 -23.42333333
NA06985 -20.125
NA06991 -20.92
You could use something like that:
sed -i 's|\([A-Z]\{2\}[0-9]\{5\}\)[A-Z0-9]*\s*\(.*\)|\1 \2|g' <your-file>

Replacing comma on specific lines only

I have a dataset that is comma separated. But I have a little problem with its format. I want everything to be in the form x,x,x
Below is a sample of my dataset:
995970,16779453
995971,16828069
995972,
995973,16828069
995974,16827226
As you can see, most of my dataset is in the proper format but I have those commas on single id#'s also (my data is in form id#, connection#). How would I go about removing the commas on those single id#'s? I can't seem to figure it out just using a text editor. Any suggestions?
Edit: can I use some sort of regex expression to only remove it from those ids that have a specified length?
Edit2: Ok I figured it out using some regex, thanks for all the help!
In vi one would do something like
:%s/,$//
This means
: (enter a line mode command)
% (try the command on every line)
s (substitute)
,$ (match a comma at the end of a line)
(empty replacement text)
Sometimes you need something like /, *$/ do match a comma followed by 0 or more trailing spaces. You can get vi on windows in various different ways; one way is to install Cygwin.
You can select regular expression mode in Notepad++ and do find and replace using the following regex ,$. Leave the replace field blank.
With the sed command:
sed 's/, *//' < FILE
or inplace (requires GNU sed):
sed -ie 's/, *//' FILE

Simple Text Search Bash

I have a text file with 10 k lines. How do I extract all the lines where a certain keyword appears? It's fundamental that I am able to select the entire line where a certain text pattern shows up. How can I do this in bash?
Use grep to search for text and print matching lines:
grep yourKeyword yourFile.txt
If the pattern consists of several words, you must quote the pattern:
grep "your key string" yourFile.txt
Besides using grep you can also use awk. Plus, awk has the advantage of doing processing as it searches the lines..
awk '/pattern/{ do stuff }' file

Resources