Grep the first line from each contiguous group of matching lines - linux

I have a data file which looks like this:
a separator
interesting line 1
interesting line 2
a comment
interesting line 3
interesting line 4
interesting line 5
a non interesting line
some other data
interesting line 6
.
.
.
and I would like to extract the first interesting line from each contiguous group, no matter how many lines are in the group is or how many extra lines separate the groups.
For the test input above the output would be:
interesting line 1
interesting line 3
interesting line 6
I could easily do this in python by having a state variable that triggers when I match a line, and resets when I encounter a non-matching line, but what about a one-line shell script? Is there a not-too-obscure way to do this?

You can use grep with a greedy regex, then print the first line of every match with :
grep -Pzo '([^\n]*interesting line[^\n](\n|$))+' file |
while IFS='' read -d '' -r match
do
head -n1 <<< "$match"
done
grep parameters:
-P : Use Perl Compatible regular expression (instead of the default basic regular expression) for the \n in the regex.
-z : Treat the input as a set of lines, each terminated by a zero byte. An ASCII NUL character will separate each match, allowing us to reliably separate the matches.
the regex ([^\n]*blablabla[^\n]*(\n|$))+ will match each group of contiguous lines containing blablabla.
In the while condition command, the IFS is emptied for the read. Otherwise, with the default IFS, the last newline character of each match would be eaten by read (that might not be a problem). It's a good practice to always clear IFS in "while read" to get the text in the variable exactly as it is read (leading spaces are also easily eaten up).
read parameters:
-d '' : Use the empty string as delimiter (= the ASCII NUL character). This is equivalent to -d $'\0' (see https://unix.stackexchange.com/q/61029/283498).
-r : don't interpret any backslash in the lines (see https://unix.stackexchange.com/q/192786/283498).
match : just a variable name I chose, which is used in the body of the loop.
And in the body of the loop: head -n1 <<< "$match" prints only the first line of the current match (the command head with -n 1 prints the first 1 line of its input). Side note: <<< is a bashism ; the command is equivalent to echo "$match" | head -n1.

Related

Using echo in bash puts last variable in front of the output

I'm trying to write a script and one of the parts of the script requires me to concatenate some variables together to create a URL.
REPO_URL='https://github.com/Example/Repo.Game/'
FILENAME='Example.Game-linux.zip'
latest_version="$(curl -LIs "${REPO_URL}/releases/latest" | grep -i '^location:' | cut -d' ' -f2 | cut -d'/' -f8)"
echo "$latest_version"
echo "$FILENAME"
echo "$REPO_URL"
echo "${REPO_URL}releases/download/${latest_version}/${FILENAME}"
Output:
2.0.5164
Example.Game-linux.zip
https://github.com/Example/Repo.Game/
/Example.Game-linux.ziple/Repo.Game/releases/download/2.0.5164
My actual output:
2.0.5164
Oxide.Rust-linux.zip
https://github.com/OxideMod/Oxide.Rust/
/Oxide.Rust-linux.zipideMod/Oxide.Rust/releases/download/2.0.5164
It looks like some kind of overflow problem? I'm not exactly sure. I added abcabc to the filename and the output became
/Oxide.Rust-linux.zipabcabc/Oxide.Rust/releases/download/2.0.5164
Any help would be appreciated.
I resolved the problem by removing the carriage return value from the variable.
tr -d '\r' seems to have resolved it. I'm not sure where the variable came from and if anyone has advice on how to clean up this mess I would love some advice.
latest_version="$(curl -LIs "${REPO_URL}/releases/latest" | grep -i '^location:' | cut -d' ' -f2 | cut -d'/' -f8 | tr -d '\r')
You can use ANSI quoting, and variable substitution to remove control characters from variables without having to invoke sub-shells.
ANSI quoting uses the special format $'\*' to represent special characters. For example use $'\t' for tab, $'\n' for new-line and $'\r' for carriage-return.
Variable substitution uses extra characters at the end of the variable name to perform actions on the variable. For example
${variable//[pattern]/[substitution]} will replace all instances of [pattern] in ${variable} with [substitution].
${variable%[pattern]} will remove [pattern] from ${variable} if it is at the end.
By combining these two, you can remove carriage-return characters from the end of your variable like this:
echo ${variable%$'\r'}
Note: Variable substitution doesn't actually change the contents of the variable. To do that, you have to re-assign the result back to the variable:
variable="${variable%$'\r'}"
There is a cleaner way to get the version number, minus any trailing carriage-return, from github using sed.
latest_version =$(curl -LIs "${REPO_URL}/releases/latest" | sed -n 's/^Location:.*\/\([^\r]*\).*$/\1/p')
sed reads every line of input (STDIN by default) and performs operations on it defined by the action string parameter. The action string is a little tricky to explain in this case, but here goes:
The -n option suppresses the printing of each input line. Output will then only happen if it is explicitly stated in the action string.
The s/[pattern]/[substitution]/p construct says whenever you find [pattern], replace it with [substitution] and print it. Our [pattern] is ^Location:.*\/\(.*\)$, and our [substitution] is \1.
The expression ^ matches the beginning of the line.
The expression . means any single character, and the expression .* means any number of characters (including zero). This will match the largest possible string, so, for example .*/ will match abc/def/ in the string abc/def/ghi.
The expression \/ just escapes the forward slash (because we are using backslash as a delimiter, we have to escape it).
The expression \([pattern]\) says any time you find [pattern], remember it. in our case, it will remember whatever matches [^\r].
The expression [{chars}] matches any one of the characters in {chars}. [^{chars}] matches any character that is not in {chars}. so [^\r]* matches any number of characters that is not a carriage return.
The expression $ matches the end of a line.
The expression \1 is replaced by the first remembered pattern.
So altogether, our action string says:
If you find a line that starts with Location:, followed by any number of characters, followed by a /, followed by any number of characters that are not a carriage return (which will be remembered), followed by any number of characters, followed by an end of line, then print the remembered characters.

How to extract a specific text from gz file?

I need to extract the 5 to 11 characters from my fastq.gz data this data is just too large for running in R. So I was wondering if I can do it directly in Linux command line?
The fastq file looks like this:
#NB501399:67:HFKTCBGX5:1:11101:13202:1044 1:N:0:CTTGTA
GAGGTNACGGAGTGGGTGTGTGCAGGGCCTGGTGGGAATGGGGAGACCCGTGGACAGAGCTTGTTAGAGTGTCCTAGAGCCAGGGGGAACTCCAGGCAGGGCAAATTGGGCCCTGGATGTTGAGAAGCTGGGTAACAAGTACTGAGAGAAC
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE6
#NB501399:67:HFKTCBGX5:1:11101:1109:1044 1:N:0:CTTGTA
TAGGCNACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATGCCGAACTTAGTGCGGACACCCGATCGGCATAGCGCACTACAGCCCAGAACTCCTGGACTCAAGCGATCCTCCAGCCTCAGCCTCCCGAGTAGCTGGGACTACAG
+
And I only want to extract the 5 to 11 character which located in sequence part (for the first one is TNACGG, for the second is CNACCT) and makes it a new txt file. Can I do that?
You can use GNU sed with zcat:
zcat fastq.gz | sed -n '2~5{s/.\{4\}\(.\{6\}\).*/\1/;p}'
-n means lines are not printed by default
2~5 means start with line 2, match every fifth line
when the "address" matches, the substitution remembers the fifth to tenth character in \1 and replaces the whole line with it, p prints the result
Another using zgrep and positive lookbehind:
$ zgrep -oP "(?<=^[ACTGN]{4})[ACTGN]{6}" foo.gz
TNACGG
CNACCT
Explained:
zgrep : man zgrep: search possibly compressed files for a regular expression
-o Print only the matched (non-empty) parts of a matching line
-P Interpret the pattern as a Perl-compatible regular expression (PCRE).
(?<=^[ACTGN]{4}) positive lookbehind
[ACTGN]{6} match 6 named characters that are preceeded by above
foo.gz my test file
$ zcat fastq.gz | awk '(NR%5)==2{print substr($0,5,6)}'
TNACGG
CNACCT

How can I get the length of each output line of grep

I am very new to bash scripting.
I have a network trace file I want to parse. Part of the trace file is (two packets):
[continues...]
+---------+---------------+----------+
05:00:00,727,744 ETHER
|0
|00|03|a0|09|5c|1c|00|10|07|df|a4|20|08|00|45|00|00|38|e7|55|
+---------+---------------+----------+
05:00:00,727,751 ETHER
|0
|00|03|a0|09|5c|1c|00|10|07|df|a4|20|08|00|45|00|00|38|e7|56|00|00|3a|01|
[continues...]
For each packet, I want to print the time stamp, and the length of the packet (the hex values coming on the next line after |0 header) so the output will look like:
05:00:00.727744 20 bytes
05:00:00.727751 24 bytes
I can get the line with time stamp and the packets separately using grep in bash:
times=$(grep '..\:..\:' $fileName)
packets=$(grep '..|..|' $fileName)
But I can't work with the separate output lines after that. The whole result is concatenated in the two variables "times" and "packets". How can I get the length of each packet?
P.S. a good reference that really explains how to do bash programming, rather than just doing examples would be appreciated.
Okay, with plain old shell...
You can get the length of the line like this:
line="|00|03|a0|09|5c|1c|00|10|07|df|a4|20|08|00|45|00|00|38|e7|55|"
wc -c<<<$line
62
There are sixty two characters in that line. Think of each character as |00 where 00 can be any digit. In that case, there's an extra | on the end. Plus, the wc -c includes the NL on the end.
So, if we take the value of wc -c, and subtract 2, we get 60. If we divide that by 3, we get 20 which is the number of characters.
Okay, now we need a little loop, figure out the various lines, and then parse them:
#! /bin/bash
while read line
do
if [[ $line =~ ^[[:digit:]]{2} ]]
then
echo -n "${line% *}"
elif [[ $line =~ ^\|[[:digit:]]{2} ]]
then
length=$(wc -c<<<$line)
((length-=2))
((length=length/3))
echo "$length bytes"
fi
done < test.txt
There a PURE BASH solution to your problems!
You're a beginning Bash programmer, and you have no idea what's going on...
Let's take this one step at a time:
A common way to loop through a file in BASH is using a while read loop. This combines the while with a read:
while read line
do
echo "My line is '$line'"
done < test.txt
Each line in test.txt is being read into the $line shell variable.
Let's take the next one:
if [[ $line =~ ^[[:digit:]]{2} ]]
This is an if statement. Always use the [[ ... ]] brackets because they fix issues with the shell interpolating stuff. Plus, they have a bit more power.
The =~ is a regular expression match. The [[:digit:]] matches any digit. The ^ anchors the regular expression to the beginning of the line, and {2} means I want exactly two of these. This says if I match a line that starts with two digits (which is your timestamp line), execute this if clause.
${line% *} is a pattern filter. The % says to match the (glob) smallest glob pattern to the right and filter it from my $line variable. I use this to remove the ETHER from my line. The -n tells echo not to do a NL.
Let's take my elif which is an else if clause.
elif [[ $line =~ ^\|[[:digit:]]{2} ]]
Again, I am matching a regular expression. This regular expression starts with (The ^) a |. I have to put a backslash in front because | is a magical regular expression character and \ kills the magic. It's now just a pipe. Then, that's followed by two digits. Note this skips |0 but catches |00.
Now, we have to do some calculations:
length=$(wc -c<<<$line)
The $(...) say to execute the enclosed command and resubstitute it back in the line. The wc -c counts the characters and <<<$line is what we're counting. This gave us 62 characters. We have to subtract 2, then divide by 3. That's the next two lines:
((length-=2))
((length/=3))
The ((...)) allows me to do integer based math. The first subtracts 2 from $length and the next divides it by 3. Now, I can echo this out:
echo "$length bytes"
And that's our pure Bash answer to this question.
You really don't want to do such things with your shell.
You want to write a real parser that understands the format to output the needed informations.
For a quick and dirty hack you can do something like that:
perl -wne 'print "$& " if /^\d\S*/; print split(/\|/)-2, " bytes\n" if /^\|..\|/'

Match a string that contains a newline using sed

I have a string like this one:
#
pap
which basically translates to a \t#\n\tpap and I want to replace it with:
#
pap
python
which translates to \t#\n\tpap\n\tpython.
Tried this with sed in a lot of ways but it's not working maybe because sed uses new lines in a different way. I tried with:
sed -i "s/\t#\n\tpap/\t#\tpython\n\tpap/" /etc/freeradius/sites-available/default
...and many different other ways with no result. Any idea how can I do my replace in this situation?
try this line with gawk:
awk -v RS="\0" -v ORS="" '{gsub(/\t#\n\tpap/,"yourNEwString")}7' file
if you want to let sed handle new lines, you have to read the whole file first:
sed ':a;N;$!ba;s/\t#\n\tpap/NewString/g' file
This might work for you (GNU sed):
sed '/^\t#$/{n;/^\tpap$/{p;s//\tpython/}}' file
If a line contains only \t# print it, then if the next line contains only \tpap print it too, then replace that line with \tpython and print that.
A GNU sed solution that doesn't require reading the entire file at once:
sed '/^\t#$/ {n;/^\tpap$/a\\tpython'$'\n''}' file
/^\t#$/ matches comment-only lines (matching \t# exactly), in which case (only) the entire {...} expression is executed:
n loads and prints the next line.
/^\tpap/ matches that next line against \tpap exactly.
in case of a match, a\\tpython will then output \n\tpython before the following line is read - note that the spliced-in newline ($'\n') is required to signal the end of the text passed to the a command (you can alternatively use multiple -e options).
(As an aside: with BSD sed (OS X), it gets cumbersome, because
Control chars. such as \n and \t aren't directly supported and must be spliced in as ANSI C-quoted literals.
Leading whitespace is invariably stripped from the text argument to the a command, so a substitution approach must be used: s//&\'$'\n\t'python'/ replaces the pap line with itself plus the line to append:
sed '/^'$'\t''#$/ {n; /^'$'\t''pap$/ s//&\'$'\n\t'python'/;}' file
)
An awk solution (POSIX-compliant) that also doesn't require reading the entire file at once:
awk '{print} /^\t#$/ {f=1;next} f && /^\tpap$/ {print "\tpython"} {f=0}' file
{print}: prints every input line
/^\t#$/ {f=1;next}: sets flag f (for 'found') to 1 if a comment-only line (matching \t# exactly) is found and moves on to the next line.
f && /^\tpap$/ {print "\tpython"}: if a line is preceded by a comment line and matches \tpap exactly, outputs extra line \tpython.
{f=0}: resets the flag that indicates a comment-only line.
A couple of pure bash solutions:
Concise, but somewhat fragile, using parameter expansion:
in=$'\t#\n\tpap\n' # input string
echo "${in/$'\t#\n\tpap\n'/$'\t#\n\tpap\n\tpython\n'}"
Parameter expansion only supports patterns (wildcard expressions) as search strings, which limits the matching abilities:
Here the assumption is made that pap is followed by \n, whereas no assumption is made about what precedes \t#, potentially resulting in false positives.
If the assumption could be made that \t#\n\tpap is always enclosed in \n, echo "${in/$'\n\t#\n\tpap\n'/$'\n\t#\n\tpap\n\tpython\n'}" would work robustly; otherwise, see below.
Robust, but verbose, using the =~ operator for regex matching:
The =~ operator supports extended regular expressions on the right-hand side and thus allows more flexible and robust matching:
in=$'\t#\n\tpap' # input string
# Search string and string to append after.
search=$'\t#\n\tpap'
append=$'\n\tpython'
out=$in # Initialize output string to input string.
if [[ $in =~ ^(.*$'\n')?("$search")($'\n'.*)?$ ]]; then # perform regex matching
out=${out/$search/$search$append} # replace match with match + appendage
fi
echo "$out"
You can just translate the character \n to another one, then apply sed, then apply the reverse translation. If tr is used, it must be a 1-byte character, for instance \v (vertical tabulation, nowadays almost unused).
cat FILE|tr '\n' '\v'|sed 's/\t#\v\tpap/&\v\tpython/'|tr '\v' '\n'|sponge FILE
or, without sponge:
cat FILE|tr '\n' '\v'|sed 's/\t#\v\tpap/&\v\tpython/'|tr '\v' '\n' >FILE.bak && mv FILE.bak FILE

Convert string to hexadecimal on command line

I'm trying to convert "Hello" to 48 65 6c 6c 6f in hexadecimal as efficiently as possible using the command line.
I've tried looking at printf and google, but I can't get anywhere.
Any help greatly appreciated.
Many thanks in advance,
echo -n "Hello" | od -A n -t x1
Explanation:
The echo program will provide the string to the next command.
The -n flag tells echo to not generate a new line at the end of the "Hello".
The od program is the "octal dump" program. (We will be providing a flag to tell it to dump it in hexadecimal instead of octal.)
The -A n flag is short for --address-radix=n, with n being short for "none". Without this part, the command would output an ugly numerical address prefix on the left side. This is useful for large dumps, but for a short string it is unnecessary.
The -t x1 flag is short for --format=x1, with the x being short for "hexadecimal" and the 1 meaning 1 byte.
If you want to do this and remove the spaces you need:
echo -n "Hello" | od -A n -t x1 | sed 's/ *//g'
The first two commands in the pipeline are well explained by #TMS in his answer, as edited by #James. The last command differs from #TMS comment in that it is both correct and has been tested. The explanation is:
sed is a stream editor.
s is the substitute command.
/ opens a regular expression - any character may be used. / is
conventional, but inconvenient for processing, say, XML or path names.
/ or the alternate character you chose, closes the regular expression and
opens the substitution string.
In / */ the * matches any sequence of the previous character (in this
case, a space).
/ or the alternate character you chose, closes the substitution string.
In this case, the substitution string // is empty, i.e. the match is
deleted.
g is the option to do this substitution globally on each line instead
of just once for each line.
The quotes keep the command parser from getting confused - the whole
sequence is passed to sed as the first option, namely, a sed script.
#TMS brain child (sed 's/^ *//') only strips spaces from the beginning of each line (^ matches the beginning of the line - 'pattern space' in sed-speak).
If you additionally want to remove newlines, the easiest way is to append
| tr -d '\n'
to the command pipes. It functions as follows:
| feeds the previously processed stream to this command's standard input.
tr is the translate command.
-d specifies deleting the match characters.
Quotes list your match characters - in this case just newline (\n).
Translate only matches single characters, not sequences.
sed is uniquely retarded when dealing with newlines. This is because sed is one of the oldest unix commands - it was created before people really knew what they were doing. Pervasive legacy software keeps it from being fixed. I know this because I was born before unix was born.
The historical origin of the problem was the idea that a newline was a line separator, not part of the line. It was therefore stripped by line processing utilities and reinserted by output utilities. The trouble is, this makes assumptions about the structure of user data and imposes unnatural restrictions in many settings. sed's inability to easily remove newlines is one of the most common examples of that malformed ideology causing grief.
It is possible to remove newlines with sed - it is just that all solutions I know about make sed process the whole file at once, which chokes for very large files, defeating the purpose of a stream editor. Any solution that retains line processing, if it is possible, would be an unreadable rat's nest of multiple pipes.
If you insist on using sed try:
sed -z 's/\n//g'
-z tells sed to use nulls as line separators.
Internally, a string in C is terminated with a null. The -z option is also a result of legacy, provided as a convenience for C programmers who might like to use a temporary file filled with C-strings and uncluttered by newlines. They can then easily read and process one string at a time. Again, the early assumptions about use cases impose artificial restrictions on user data.
If you omit the g option, this command removes only the first newline. With the -z option sed interprets the entire file as one line (unless there are stray nulls embedded in the file), terminated by a null and so this also chokes on large files.
You might think
sed 's/^/\x00/' | sed -z 's/\n//' | sed 's/\x00//'
might work. The first command puts a null at the front of each line on a line by line basis, resulting in \n\x00 ending every line. The second command removes one newline from each line, now delimited by nulls - there will be only one newline by virtue of the first command. All that is left are the spurious nulls. So far so good. The broken idea here is that the pipe will feed the last command on a line by line basis, since that is how the stream was built. Actually, the last command, as written, will only remove one null since now the entire file has no newlines and is therefore one line.
Simple pipe implementation uses an intermediate temporary file and all input is processed and fed to the file. The next command may be running in another thread, concurrently reading that file, but it just sees the stream as a whole (albeit incomplete) and has no awareness of the chunk boundaries feeding the file. Even if the pipe is a memory buffer, the next command sees the stream as a whole. The defect is inextricably baked into sed.
To make this approach work, you need a g option on the last command, so again, it chokes on large files.
The bottom line is this: don't use sed to process newlines.
echo hello | hexdump -v -e '/1 "%02X "'
Playing around with this further,
A working solution is to remove the "*", it is unnecessary for both the original requirement to simply remove spaces as well if substituting an actual character is desired, as follows
echo -n "Hello" | od -A n -t x1 | sed 's/ /%/g'
%48%65%6c%6c%6f
So, I consider this as an improvement answering the original Q since the statement now does exactly what is required, not just apparently.
Combining the answers from TMS and i-always-rtfm-and-stfw, the following works under Windows using gnu-utils versions of the programs 'od', 'sed', and 'tr':
echo "Hello"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
or in a CMD file as:
#echo "%1"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
A limitation on my solution is it will remove all double quotes (").
"tr -d '\42'" removes quote marks that the Windows 'echo' will include.
"tr -d '\r'" removes the carriage return, which Windows includes as well as '\n'.
The pipe (|) character must follow immediately after the string or the Windows echo will add that space after the string.
There is no '-n' switch to the Windows echo command.

Resources