sed explanation so I can recreate a bit of code? - linux

Can someone please explain the following sed command?
title=$(wget -q -O - https://twitter.com/intent/user?user_id=$ID | sed -n 's/^.*<title>\(.*\) on Twitter<.title>.*$/\1/p')
printf "%s\n" "$title"
I tried (and failed terribly) to recreate it because I thought I understood what was going on in the code. So I wrote (well, more modded) it to be the following:
data-user-id=$(wget -q -O - https://twitter.com/$Username | sed -n 's/^.*"data-user-id">\([^<]*\)<.*$/\1/p')
printf "%s\n" "$data-user-id"
Obviously it errored because the syntax is wrong or something. But I'm trying to understand what is going on so I can make my own variant of it.
P.S. I can't just use the API for this due to how everything needs to be configured.

Give a try to this:
wget -q -O - https://twitter.com/"${Username}" | sed -n '/data-screen-name=.'"${Username}"'".*data-user-id=/I {s/^.*data-screen-name=.'"${Username}"'".*data-user-id="\([0-9]*\)".*$/\1/Ip;q}'
128700677
data-user-id is present in several lines, so it is needed to select a line where data-screen-name=Username
sed is using regular expression, there are 2 good tutorials to start with:
Regular Expressions
Sed - An Introduction and Tutorial by Bruce Barnett
A different sed script with a different output:
Username="StackOverflow"
wget -q -O - https://twitter.com/"${Username}" | sed -n '/data-screen-name=.'"${Username}"'".*data-user-id=/I {p;q}'
data-screen-name="StackOverflow" data-name="Stack Overflow" data-user-id="128700677"
-n instructs sed to not print anything, except when p command is used.
. means any char.
* applies to the previous char in the regex and it means zero or any number of this char.
.* means zero or any number of any char.
/data-screen-name=.'"${Username}"'".*data-user-id=/ select lines which contains data-screen-name= and any one char (.) and StackOverflow and " char and zero or any number of any char (.*) and data-user-id=.
/I means ignore case.
{p;q} are commands executed when above regex is true.
p prints the current line.
q exits the sed script.
The first sed script at the top contains an additional s/regex/replacement/ to clean up the line.
The additional elements used:
^ means the start of the line.
\( ... \) are used to define a group.
"\([0-9]*\)" is a group made of only digits, surrended with 2 " which are not part of the group. It is the first group found in the regex, so it can be referenced in the replacement part with \1.

Assuming the title of the page is "foo on Twitter", it extracts "foo" from it.
But use XMLStarlet instead, since it allows you to specify XPath to extract the data instead of having to poke around with regular expressions.

Related

Extract path from a entire string in bash shell script

I need to extract path from a string. I found examples in another post, but missing additional steps.
I have a string as below:
title="test test good dskgkdh hdfyr /rlsmodules/svnrepo/SOURCE/CBL/MQ/BASELINE/MQO000.CBL kdlkfg nsfgf trhrnrt"
cobsrc=$(awk '{match($0,/\/[^"]*/,a);print a[0]}' <<< $title)
echo $cobsrc
Output is
/rlsmodules/svnrepo/SOURCE/CBL/MQ/BASELINE/MQO000.CBL kdlkfg nsfgf trhrnrt
I need only
/rlsmodules/svnrepo/SOURCE/CBL/MQ/BASELINE/MQO000.CBL
What modification is required?
An existing post on similar query:
how to extract path from string in shell script
Four solutions, in order of my own preference.
First option would be simple parameter expansion, in two steps:
$ title="/${title#*/}"
$ title="${title%% *}"
$ echo "$title"
/rlsmodules/svnrepo/SOURCE/CBL/MQ/BASELINE/MQO000.CBL
The first line removes everything up to the first slash (while prepending a slash to replace the one that's stripped", the second line removes everything from the first bit of whitespace that remains.
Or, if you prefer, use a regex:
$ [[ $title =~ ^[^/]*(/[^ ]+)\ ]]
$ echo ${BASH_REMATCH[1]}
/rlsmodules/svnrepo/SOURCE/CBL/MQ/BASELINE/MQO000.CBL
The regex translates as:
null at the beginning of the line,
a run of zero or more non-slashes,
an atom:
a slash followed by non-space characters
a space, to end the previous atom.
The $BASH_REMATCH array contains the content of the bracketed atom.
Next option might be grep -o:
$ grep -o '/[^ ]*' <<<"$title"
(Result redacted -- you know what it'll be.)
You could of course assign this output to a variable using command substitution, which you already know about.
Last option is another external tool...
$ sed 's:^[^/]*::;s/ .*//' <<<"$title"
This is the same functionality as is handled by the parameter expansion (at the top of the answer) only in a sed script, which requires a call to an external program. Included only for pedantry. :)
Could you please try following.
echo "$title" | awk 'match($0,/\/.*\/[^ ]*/){print substr($0,RSTART,RLENGTH)}'
Output will be as follows.
/rlsmodules/svnrepo/SOURCE/CBL/MQ/BASELINE/MQO000.CBL
Solution 2nd: Considering that your variable don't have space in between its value then following may help you too.
echo "$title" | awk '{sub(/[^/]* /,"");sub(/ .*/,"")} 1'

Do not print unmatched text with sed

I want to print only matched lines and strip unmatched ones, but with following:
$ echo test12 test | sed -n 's/^.*12/**/p'
I always get:
** test
instead of:
**
What am I doing wrong?
[edit1]
I provide more information of what I need - and actually I should start with it. So, I have a command which produced lots of lines of output, I want to grab only parts of the lines - the ones that matches, and strip the result. So in the above example 12 was meant to find end of matched part of the line, and instead of ** I should have put & which represents matched string. So the full example is:
echo test12 test | sed -n 's/^.*12/&/p'
which produces exactly the same output as input:
test12 test
the expected output is:
test12
As suggested I started to find a grep alternative and the following looks promising:
$ echo test12 test | grep -Eo "^.*12"
but I dont see how to format the matched part, this only strips unmatched text.
EDIT: In some cases, the -E flag might be needed for sed. But then the brackets don't need to be escaped anymore. check your sed's man page.
I think what you are looking for is this:
echo test12 test | sed -n 's/^\(.*12\).*$/\1/p'
if you want to discard the rest of the line, you have to match it as well, but not include it in the output. the \( and \) denote a group that is then referenced by the \1.
Good luck :)
Additional information on sed:
sed works on lines, and the ampersand characters represents the entire line that was matched by the given regular expression. if a regex is "open" at the end (i.e. doesn't end with the endline character ($), it acts as if .*$ is appended to the match string. (not sure if that is how it is implemented, but could very well be.)
Try:
echo test12 test | sed -n 's/^.*/**/p'
You don't need to match the number 12, since that is already being done in your regex.
Your regular expression is matching anything from the beginning of the line until the expression '12'. All the matched expression is replaced with '**', that is why you get '** test'. If you want only match I recommend you using grep.

Line numbering in Grep

I have command in Grep:
cat nastava.html | grep '<td>[A-Z a-z]*</td><td>[0-9/]*</td>' | sed 's/[ \t]*<td>\([A-Z a-z]*\)<\/td><td>\([0-9]\{1,3\}\)\/[0-9]\{2\}\([0-9]\{2\}\)<\/td>.*/\1 mi\3\2 /'
|sort|grep -n ".*" | sed -r 's/(.*):(.*)/\1. \2/' >studenti.txt
I don't understand second line, sort is ok, grep -n means to num that sorted list, but why do we use here ".*"? It won't work without it, and i don't understand why.
The grep is used purely for the side effect of the line numbering with the -n option here, so the main thing is really to use a regular expression which matches all the input lines. As such, .* is not very elegant -- ^ would work without scanning every line, and $ trivially matches every line as well. Since you know the input lines are not empty, thus contain at least one character, the simple regular expression . would work perfectly, too.
However, as the end goal is to perform line numbering, a better solution is to use a dedicated tool for this purpose.
... | sort | nl -ba -s '. '
The -ba option specifies to number all lines (the default is to only add a line number to non-empty lines; we know there are no empty lines, so it's not strictly necessary here, but it's good to know) and the -s option specifies the separator string to put after the number.
A possible minor complication is that the line number format is whitespace-padded, so in the end, this solution may not work for you if you specifically want unpadded numbers. (But a sed postprocessor to fix that up is a lot simpler than the postprocessor for grep you have now -- just sed 's/^ *//' will remove leading whitespace).
... As an aside, the ugly cat | grep | sed pipeline can be abbreviated to just
sed -n 's%[ \t]*<td>\([A-Z a-z]*\)</td><td>\([0-9]\{1,3\}\)/[0-9]\{2\}\([0-9]\{2\}\)</td>.*%\1 mi\3\2 %p' nastava.html
The cat was never necessary in the first place, and the sed script can easily be refactored to only print when a substitution was performed (your grep regular expression was not exactly equivalent to the one you have in the sed script but I assume that was the intent). Also, using a different separator avoids having to backslash the slashes.
... And of course, if nastava.html is your own web page, the whole process is umop apisdn. You should have the students results in a machine-readable form, and generate a web page from that, rather than the other way around.
grep needs a regular expression to match. You can't run grep with no expression at all. If you want to number all the lines, just specify an expression that matches anything. I'd probably use ^ instead of .*.

how to replace a special characters by character using shell

I have a string variable x=tmp/variable/custom-sqr-sample/test/example
in the script, what I want to do is to replace all the “-” with the /,
after that,I should get the following string
x=tmp/variable/custom/sqr/sample/test/example
Can anyone help me?
I tried the following syntax
it didnot work
exa=tmp/variable/custom-sqr-sample/test/example
exa=$(echo $exa|sed 's/-///g')
sed basically supports any delimiter, which comes in handy when one tries to match a /, most common are |, # and #, pick one that's not in the string you need to work on.
$ echo $x
tmp/variable/custom-sqr-sample/test/example
$ sed 's#-#/#g' <<< $x
tmp/variable/custom/sqr/sample/test/example
In the commend you tried above, all you need is to escape the slash, i.e.
echo $exa | sed 's/-/\//g'
but choosing a different delimiter is nicer.
The tr tool may be a better choice than sed in this case:
x=tmp/variable/custom-sqr-sample/test/example
echo "$x" | tr -- - /
(The -- isn't strictly necessary, but keeps tr (and humans) from mistaking - for an option.)
In bash, you can use parameter substitution:
$ exa=tmp/variable/custom-sqr-sample/test/example
$ exa=${exa//-/\/}
$ echo $exa
tmp/variable/custom/sqr/sample/test/example

Linux command line: split a string

I have long file with the following list:
/drivers/isdn/hardware/eicon/message.c//add_b1()
/drivers/media/video/saa7134/saa7134-dvb.c//dvb_init()
/sound/pci/ac97/ac97_codec.c//snd_ac97_mixer_build()
/drivers/s390/char/tape_34xx.c//tape_34xx_unit_check()
(PROBLEM)/drivers/video/sis/init301.c//SiS_GetCRT2Data301()
/drivers/scsi/sg.c//sg_ioctl()
/fs/ntfs/file.c//ntfs_prepare_pages_for_non_resident_write()
/drivers/net/tg3.c//tg3_reset_hw()
/arch/cris/arch-v32/drivers/cryptocop.c//cryptocop_setup_dma_list()
/drivers/media/video/pvrusb2/pvrusb2-v4l2.c//pvr2_v4l2_do_ioctl()
/drivers/video/aty/atyfb_base.c//aty_init()
/block/compat_ioctl.c//compat_blkdev_driver_ioctl()
....
It contains all the functions in the kernel code. The notation is file//function.
I want to copy some 100 files from the kernel directory to another directory, so I want to strip every line from the function name, leaving just the filename.
It's super-easy in python, any idea how to write a 1-liner in the bash prompt that does the trick?
Thanks,
Udi
cat "func_list" | sed "s#//.*##" > "file_list"
Didn't run it :)
You can use pure Bash:
while read -r line; do echo "${line%//*}"; done < funclist.txt
Edit:
The syntax of the echo command is doing the same thing as the sed command in Eugene's answer: deleting the "//" and everything that comes after.
Broken down:
"echo ${line}" is the same as "echo $line"
the "%" deletes the pattern that follows it if it matches the trailing portion of the parameter
"%" makes the shortest possible match, "%%" makes the longest possible
"//*" is the pattern to match, "*" is similar to sed's ".*"
See the Parameter Expansion section of the Bash man page for more information, including:
using ${parameter#word} for matching the beginning of a parameter
${parameter/pattern/string} to do sed-style replacements
${parameter:offset:length} to retrieve substrings
etc.
here's a one liner in (g)awk
awk -F"//" '{print $1}' file
Here's one using cut and rev
cat file | rev | cut -d'/' -f2-| rev

Resources