How can I remove a doubled section of a string? - string

I'm having trouble with data manipulation in a txt file. My file currently looks like this:
HG02239 -23.42333333
NA06985NA06985 -20.125
NA06991NA06991 -20.92
This shows some of my tab-delimited data. Half the entries are in the correct seven-characters (letterletternumbernumbernumbernumbernumber) format, but some are doubled up. I want to go into the second column (first column is empty for a reason!) and remove the repeats in the string so it would read
HG02239 -23.42333333
NA06985 -20.125
NA06991 -20.92
I can't work out how to do this with sed/awk on a per column basis. I feel like I should be able to write a regex, but because the data is a repeat, I don't want to lose the first half of the string; and I can't work out how to cut on a specific column, or I would just delete the 7th character. Any help much appreciated!

Solution
You can solve this with a backreference. For example, using GNU sed:
$ cat << EOF | sed --regexp-extended 's/(.{7})\1/\1/'
HG02239 -23.42333333
NA06985NA06985 -20.125
NA06991NA06991 -20.92
EOF
HG02239 -23.42333333
NA06985 -20.125
NA06991 -20.92
If you aren't using GNU sed, you may need to escape the capture groups. In addition, you can tune the regular expression if you need a more accurate character match.
Explanation
The cat pipeline is just a here-document to make it easy to display and test the code. You can call sed directly on your file, or use the -i flag to perform an in-place edit when you're comfortable with the results.
The sed script does the following:
It stores any group of 7 consecutive characters in a capture group using an "interval expression" (the number in the curly braces).
The \1 is a backreference that matches the first capture group.
The match looks for "a capture group followed by a copy of the capture group."
The substitution replaces the match with a single copy of the capture group.

One way, using awk:
awk '{ print substr($1, 1, 7), $2 }' file.txt
Output:
HG02239 -23.42333333
NA06985 -20.125
NA06991 -20.92

You could use something like that:
sed -i 's|\([A-Z]\{2\}[0-9]\{5\}\)[A-Z0-9]*\s*\(.*\)|\1 \2|g' <your-file>

Related

Extract text from each line from a multiple-line text file based on a condition, Linux

I have a txt file with only one column that each line represent a different fastq.gz file from a sequence output. See an example below:
36108-ABZG339L_S237_L001_R1_001.fastq.gz
36108-ABZG339L_S237_L001_R2_001.fastq.gz
36108-ABZGM_S7_L001_R1_001.fastq.gz
36108-ABZGM_S7_L001_R2_001.fastq.gz
First of all, I would like to convert the first "-" symbol to underscore "_".
I achieved that through the following command:
sed 's/[-]/_/Ig' inputfile.txt > outputfile.txt
Then the outputfile.txt is:
36108_ABZG339L_S237_L001_R1_001.fastq.gz
36108_ABZG339L_S237_L001_R2_001.fastq.gz
36108_ABZGM_S7_L001_R1_001.fastq.gz
36108_ABZGM_S7_L001_R2_001.fastq.gz
Afterwards, I would like to extract in a new txt file only the text between first and second underscore, so:
ABZG339L
ABZG339L
ABZGM
ABZGM
How can I achieve? I tried through sed, awk but I cannot find out.
Thanks on advance for your aid,
Magí
1st solution: To get your shown expected sample output you need not to first substitute - to - and then print, we can use power of awk here to create multiple field separators and then print needed value accordingly.
awk -F'-|_' '{print $2}' Input_file
Explanation: Simple explanation of above awk program would be, making _ and - as field separators for whole Input_file then printing 2nd field/column in it.
2nd solution: Using sed solution, using sed's back reference capability here.
sed -E 's/^[^-]*-([^_]*).*/\1/' Input_file
Explanation: Using sed's -E option here to enable ERE(extended regular expression) here. In main program of sed then from starting of value till 1st occurrence of - matching it and then creating 1st back reference(temp location in memory to be retrieved later on while performing substitution) and then matching anything till last of value. While substitution, substituting whole line value with only matched value to get desired results.
3rd solution: Using GNU grep here. Using GNU grep's -oP options here to enable PCRE regex engine in this program. In main program matching everything from starting to till - and forgetting that match with \k option of GNU grep. Then matching everything just before - and printing it.
grep -oP '^.*?-\K[^_]*' Input_file

how to transpose values two by two using shell?

I have my data in a file store by lines like this :
3.172704445659,50.011996744997,3.1821975358417,50.012335988197,3.2174797791605,50.023182479597
And I would like 2 columns :
3.172704445659 50.011996744997
3.1821975358417 50.012335988197
3.2174797791605 50.023182479597
I know sed command for delete ','(sed "s/,/ /") but I don't know how to "back to line" every two digits ?
Do you have any ideas ?
One in awk:
$ awk -F, '{for(i=1;i<=NF;i++)printf "%s%s",$i,(i%2&&i!=NF?OFS:ORS)}' file
Output:
3.172704445659 50.011996744997
3.1821975358417 50.012335988197
3.2174797791605 50.023182479597
Solution viable for those without knowledge of awk command - simple for loop over an array of numbers.
IFS=',' read -ra NUMBERS < file
NUMBERS_ON_LINE=2
INDEX=0
for NUMBER in "${NUMBERS[#]}"; do
if (($INDEX==$NUMBERS_ON_LINE-1)); then
INDEX=0
echo "$NUMBER"
else
((INDEX++))
echo -n "$NUMBER "
fi
done
Since you already tried sed, here is a solution using sed:
sed -r "s/(([^,]*,){2})/\1\n/g; s/,\n/\n/g" YOURFILE
-r uses sed extended regexp
there are two substitutions used:
the first substitution, with the (([^,]*,){2}) part, captures two comma separated numbers at once and store them into \1 for reuse: \1 holds in your example at the first match: 3.172704445659,50.011996744997,. Notice: both commas are present.
(([^,]*,){2}) means capture a sequence consisting of NOT comma - that is the [^,]* part followed by a ,
we want two such sequences - that is the (...){2} part
and we want to capture it for reuse in \1 - that is the outer pair of parentheses
then substitute with \1\n - that just inserts the newline after the match, in other words a newline after each second comma
as we have now a comma before the newline that we need to get rid of, we do a second substitution to achieve that:
s/,\n/\n/g
a comma followed by newline is replace with only newline - in other words the comma is deleted
awk and sed are powerful tools, and in fact constitute programming languages in their own right. So, they can, of course, handle this task with ease.
But so can bash, which will have the benefits of being more portable (no outside dependencies), and executing faster (as it uses only built-in functions):
IFS=$', \n'
values=($(</path/to/file))
printf '%.13f %.13f\n' "${values[#]}"

How to use invert "-v" in grep when I do not have a file but a long string that is just one line?

Supposed I have
echo "The first part. The second part. The third part."
and want to remove The first part and The third part to get:
The second part.
I tried:
echo "The first part. The second part. The third part." | grep -v -e "The first part." -e "The third part."
but the inverting flag appears to work only for files with multiple lines. How can I do it for a single string?
Use sed instead:
echo "The first part. The second part. The third part." \
| sed -e 's/[[:space:]]*The first part\.[[:space:]]*//g' \
-e 's/[[:space:]]*The third part\.[[:space:]]*//g'
grep is a tool which works line-based and is more as a select-lines-which-satesfy-condition tool, The task you want to implement is more remove-substrings-from-file. This is in the area of substitutions and not in the area of selection: The best tool for this task is to use sed
sed 's/string_to_get_rid_of//g' file
Of course it is possible that your file is structured in records and you want to remove all records which contain a particular word, then there is another option. Assume that your file is split into various records which are delimited by a unique character (eg. <full-stop>-character (.)). The it is better to use awk for this. Awk allows you to redefine it's record separator from a new-line (default) to anything you want by defining RS and ORS (the latter for the output):
awk 'BEGIN{RS=ORS="."}/string_that_should_not_appear/{next}1' file
Assume you have a file with the content:
foo.bar.baz.qux
quux.quuz.corge
If we want to remove all the records which do not contain qux, we do:
awk 'BEGIN{RS=ORS="."}/qux/{next}1' file
which returns
foo.bar.baz.quuz.corge.
Notice that the record containing "cux" contained a newline and that an extra ORS is added at the end. Also you might get
foo.bar.baz.quuz.corge
.
Which is due to the POSIX standard that files should end with a newline
In case of the OP, it would read:
awk 'BEGIN{RS=ORS="."}/The first part/{next}/The third part/{next}1' file

Using sed to obtain pattern range through multiple files in a directory

I was wondering if it was possible to use the sed command to find a range between 2 patterns (in this case, dates) and output these lines in the range to a new file.
Right now, I am just looking at one file and getting lines within my time range of the file FileMoverTransfer.log. However, after a certain time period, these logs are moved to new log files with a suffix such as FileMoverTransfer.log-20180404-xxxxxx.gz. Here is my current code:
sed -n '/^'$start_date'/,/^'$end_date'/p;/^'$end_date'/q' FileMoverTransfer.log >> /public/FileMoverRoot/logs/intervalFMT.log
While this doesn't work, as sed isn't able to look through all of the files in the directory starting with FileMoverTransfer.log?
sed -n '/^'$start_date'/,/^'$end_date'/p;/^'$end_date'/q' FileMoverTransfer.log* >> /public/FileMoverRoot/logs/intervalFMT.log
Any help would be greatly appreciated. Thanks!
The range operator only operates within a single file, so you can't use it if the start is in one file and the end is in another file.
You can use cat to concatenate all the files, and pipe this to sed:
cat FileMoverTransfer.log* | sed -n "/^$start_date/,/^$end_date/p;/^$end_date/q" >> /public/FileMoverRoot/logs/intervalFMT.log
And instead of quoting and unquoting the sed command, you can use double quotes so that the variables will be expanded inside it. This will also prevent problems if the variables contain whitespace.
awk solution
As the OP confirmed that an awk solution would be acceptable, I post it.
(gunzip -c FileMoverTransfer.log-*.gz; cat FileMoverTransfer.log ) \
|awk -v st="$start_date" -v en="$end_date" '$1>=st&&$1<=en{print;next}$1>en{exit}'\
>/public/FileMoverRoot/logs/intervalFMT.log
This solution is functionally almost identical to Barmar’s sed solution, with the difference that his solution, like the OP’s, will print and quit at the first record matching the end date, while mine will print all lines matching the end date and quit at the first record past the end date, without printing it.
Some remarks:
The OP didn't specify the date format. I suppose it is a format compatible with ordinary string order, otherwise some conversion function should be used.
The files FileMoverTransfer.log-*.gz must be named in such a way that their alphabetical ordering corresponds to the chronological order (which is probably the case.)
I suppose that the dates are separated from the rest of the line by whitespace. If they aren’t, you have to supply the -F option to awk. E.g., if the dates are separated by -, you must write awk -F- ...
awk is much faster than sed in this case, because awk simply looks for the separator (whitespace or whatever was supplied with -F) while sed performs a regexp match.
There is no concept of range in my code, only date comparison. The only place where I suppose that the lines are ordered is when I say $1>en{exit}, that is exit when a line is newer than the end date. If you remove that final pattern and its action, the code will run through the whole input, but you could drop the requirement that the files be ordered.

Prefix search names to output in bash

I have a simple egrep command searching for multiple strings in a text file which outputs either null or a value. Below is the command and the output.
cat Output.txt|egrep -i "abc|def|efg"|cut -d ':' -f 2
Output is:-
xxx
(null)
yyy
Now, i am trying to prefix my search texts to the output like below.
abc:xxx
def:
efg:yyy
Any help on the code to achieve this or where to start would be appreciated.
-Abhi
Since I do not know exactly your input file content (not specified properly in the question), I will put some hypothesis in order to answer your question.
Case 1: the patterns you are looking for are always located in the same column
If it is the case, the answer is quite straightforward:
$ cat grep_file.in
abc:xxx:uvw
def:::
efg:yyy:toto
xyz:lol:hey
$ egrep -i "abc|def|efg" grep_file.in | cut -d':' -f1,2
abc:xxx
def:
efg:yyy
After the grep just use the cut with the 2 columns that you are looking for (here it is 1 and 2)
REMARK:
Do not cat the file, pipe it and then grep it, since this is doing the work twice!!! Your grep command will already read the file so do not read it twice, it might not be that important on small files but you will feel the difference on 10GB files for example!
Case 2: the patterns you are looking for are NOT located in the same column
In this case it is a bit more tricky, but not impossible. There are many ways of doing, here I will detail the awk way:
$ cat grep_file2.in
abc:xxx:uvw
::def:
efg:yyy:toto
xyz:lol:hey
If your input file is in this format; with your pattern that could be located anywhere:
$ awk 'BEGIN{FS=":";ORS=FS}{tmp=0;for(i=1;i<=NF;i++){tmp=match($i,/abc|def|efg/);if(tmp){print $i;break}}if(tmp){printf "%s\n", $2}}' grep_file
2.in
abc:xxx
def:
efg:yyy
Explanations:
FS=":";ORS=FS define your input/output field separator at : Then on each line you define a test variable that will become true when you reach your pattern, you loop on all the fields of the line until you reach it if it is the case you print it, break the loop and print the second field + an EOL char.
If you do not meet your pattern you do nothing.
If you prefer the sed way, you can use the following command:
$ sed -n '/abc\|def\|efg/{h;s/.*\(abc\|def\|efg\).*/\1:/;x;s/^[^:]*:\([^:]*\):.*/\1/;H;x;s/\n//p}' grep_file2.in
abc:xxx
def:
efg:yyy
Explanations:
/abc\|def\|efg/{} is used to filter the lines that contain only one of the patterns provided, then you execute the instructions in the block. h;s/.*\(abc\|def\|efg\).*/\1:/; save the line in the hold space and replace the line with one of the 3 patterns, x;s/^[^:]*:\([^:]*\):.*/\1/; is used to exchange the pattern and hold space and extract the 2nd column element. Last but not least, H;x;s/\n//p is used to regroup both extracted elements on 1 line and print it.
try this
$ egrep -io "(abc|def|efg):[^:]*" file
will print the match and the next token after delimiter.
If we can assume that there are only two fields, that abc etc will always match in the first field, and that getting the last match on a line which contains multiple matches is acceptable, a very simple sed script could work.
sed -n 's/^[^:]*\(abc\|def\|efg\)[^:]*:\([^:]*\)/\1:\2/p' file
If other but similar conditions apply (e.g. there are three fields or more but we don't care about matches in the first two) the required modifications are trivial. If not, you really need to clarify your question.

Resources