Script to extract strings between two strings in linux - string

I am trying to write a little script that will let me "org-capture" articles from my rss-reader (newsboat). So my scenario is this: I will pipe the article to a script; however, the article gets piped in one line, like this:
Title: ABC boss quits over Australian political interference claims Author: Date: Thu, 27 Sep 2018 09:39:16 +0200 Link: https://www.bbc.co.uk/news/world-australia-45661871 The broadcaster's chair quits amid allegations the government leaned on him to dismiss two journalists.
So what I need to do is to consistently store the link and the title in a variable and then call a command with these variables (emacsclient org-protocol:/ ...)
So basically I need this:
TITLE="ABC boss quits over Australian political interference claims"
URL="https://www.bbc.co.uk/news/world-australia-45661871"
I considered using awk or sed, but they work best for separate lines. So, I thought maybe split the single line at 'Title:', 'Author:', 'Date:' and 'Link:' and then extract with awk/sed.
I found similar use cases and questions here, but not quite the same. I want a pretty minimal script without necessarily using python.
Am I on the right track?
Thanks for helping out.

With GNU awk for the 3rd arg to match():
$ cat tst.awk
match($0,/^Title:\s*(.*)\s+Author:\s*(.*)\s+Date:\s*(.*)\s+Link:\s*(\S+)\s+(.*)/,a) {
printf "TITLE=\"%s\"\n", a[1]
printf "URL=\"%s\"\n", a[4]
}
$ awk -f tst.awk file
TITLE="ABC boss quits over Australian political interference claims"
URL="https://www.bbc.co.uk/news/world-australia-45661871"
I showed how to save all the other fields too so you can also do anything else you need to with your input.

This might work for you (GNU sed):
sed -r 's/^Title: (.*) Author:.* Link: (\S+).*/TITLE="\1"\nURL="\2"/' file
Use pattern matching to extract the fields required. The first may contain spaces so match on the key Author:. The second is a string of non-space characters following the key Link:.

Related

Embedding quotation marks in command string generated by AWK?

I need to match all instances of strings in one file, with a master list in another. However, if my string is abc I want only that, not abcdef, abc1234 and so on.
So, a word boundary for the regex? Right now, I'm using a simple awk one liner:
cat results_file| sort -k 1| awk -F" " '{ print $1" /home/owner/file_2_search"}'|
xargs -L 1 /bin/grep -i
However, to force a word boundary, I'd need to grep string\b and the quotes (single or double) seem to be required.
In awk, \b is a special character, you need \\b ... And the quoted quotes ... (arg) ... Or am I missing something and overdoing this?
This is a Linux box, so presumably gawk. I have gone over quoting rules for awk, and realize this has got to be simple (and not complex ... but), but am not seeing it.
Had meant to post as an answer, not a comment. Will try to pose a more readable question, but confess to having second thoughts about doing this as a one-liner in the first place -- may be best to follow an alternate method. Appreciate the willingness to help.
--Joe

Delete Repeated Characters without back-referencing with SED

Let's say we have a file that contains
HHEELLOO
HHYYPPOOTTHHEESSIISS
and we want to delete repeated characters. To my knowledge we can do this with
s/\([A-Z]\)\1/\1/g
This is a homework problem and the professor said he wants us to try the exercises without back-referencing or extended regular expressions. Is that possible on this one? I would appreciate it if anyone could point me in the right direction, thanks!
The only reasonable way to do this is to use the right tool for the job, in this case tr:
$ tr -s 'A-Z' < file
HELO
HYPOTHESIS
If you were going to use sed for that specific problem though then it'd just be:
$ sed 's/\(.\)./\1/g' file
HELO
HYPOTHESIS
If that's not what you're looking for then edit your question to show more truly representative sample input and expected output.
Here's one way:
s/AA/A/g
s/BB/B/g
...
s/ZZ/Z/g
As a one-liner:
sed 's/AA/A/g; s/BB/B/g; ...'

Search a string in a file, then print the lines that start with that string

I have an assignment and I have no idea when it comes to managing files, reading and writing. Here's my main problem:
I have a script that manages a address book, at the moment the menu is finished, functions are being used but I don't know how to search or write a file.
The first "option" gives the user the option (duh!) to search the address book by the contact name. The pattern I want to use is something along the lines of "name:address:email:phone", letting the user to put Spaces in the name, address but not email nor phone, and only numbers in the last one. I believe I could achieve this with Regular Expressions, which I understand a bit from Java lessons.
How can I do this, then? I know grep may be useful, but I don' know of the parameters even after reading the man pages. Parsing line by line could be done with for line in $(file) but still not sure.
If you're allowed to use grep, then you probably may use awk, and that's what I would prefer for most parts of your assignment.
Looking up a contact by name:
awk -v name="Anton Kovalenko" -F: '$1==name' "$file"
Here's one way to do it:
grep "^something" $file | while read line
do
echo $line; #do whatever you want with your $line here
done

What are the differences among grep, awk & sed? [duplicate]

This question already has answers here:
What are the differences between Perl, Python, AWK and sed? [closed]
(5 answers)
What is the difference between sed and awk? [closed]
(3 answers)
Closed last month.
I am confused about the differences between grep, awk and sed in terms of their role in Unix/Linux system administration and text processing.
Short definition:
grep: search for specific terms in a file
#usage
$ grep This file.txt
Every line containing "This"
Every line containing "This"
Every line containing "This"
Every line containing "This"
$ cat file.txt
Every line containing "This"
Every line containing "This"
Every line containing "That"
Every line containing "This"
Every line containing "This"
Now awk and sed are completly different than grep.
awk and sed are text processors. Not only do they have the ability to find what you are looking for in text, they have the ability to remove, add and modify the text as well (and much more).
awk is mostly used for data extraction and reporting. sed is a stream editor
Each one of them has its own functionality and specialties.
Example
Sed
$ sed -i 's/cat/dog/' file.txt
# this will replace any occurrence of the characters 'cat' by 'dog'
Awk
$ awk '{print $2}' file.txt
# this will print the second column of file.txt
Basic awk usage:
Compute sum/average/max/min/etc. what ever you may need.
$ cat file.txt
A 10
B 20
C 60
$ awk 'BEGIN {sum=0; count=0; OFS="\t"} {sum+=$2; count++} END {print "Average:", sum/count}' file.txt
Average: 30
I recommend that you read this book: Sed & Awk: 2nd Ed.
It will help you become a proficient sed/awk user on any unix-like environment.
Grep is useful if you want to quickly search for lines that match in a file. It can also return some other simple information like matching line numbers, match count, and file name lists.
Awk is an entire programming language built around reading CSV-style files, processing the records, and optionally printing out a result data set. It can do many things but it is not the easiest tool to use for simple tasks.
Sed is useful when you want to make changes to a file based on regular expressions. It allows you to easily match parts of lines, make modifications, and print out results. It's less expressive than awk but that lends it to somewhat easier use for simple tasks. It has many more complicated operators you can use (I think it's even turing complete), but in general you won't use those features.
I just want to mention a thing, there are many tools can do text processing, e.g.
sort, cut, split, join, paste, comm, uniq, column, rev, tac, tr, nl, pr, head, tail.....
they are very handy but you have to learn their options etc.
A lazy way (not the best way) to learn text processing might be: only learn grep , sed and awk. with this three tools, you can solve almost 99% of text processing problems and don't need to memorize above different cmds and options. :)
AND, if you 've learned and used the three, you knew the difference. Actually, the difference here means which tool is good at solving what kind of problem.
a more lazy way might be learning a script language (python, perl or ruby) and do every text processing with it.

How do I grep for entire, possibly wrapped, lines of code?

When searching code for strings, I constantly run into the problem that I get meaningless, context-less results. For example, if a function call is split across 3 lines, and I search for the name of a parameter, I get the parameter on a line by itself and not the name of the function.
For example, in a file containing
...
someFunctionCall ("test",
MY_CONSTANT,
(some *really) - long / expression);
grepping for MY_CONSTANT would return a line that looked like this:
MY_CONSTANT,
Likewise, in a comment block:
/////////////////////////////////////////
// FIXMESOON, do..while is the wrong choice here, because
// it makes the wrong thing happen
/////////////////////////////////////////
Grepping for FIXMESOON gives the very frustrating answer:
// FIXMESOON, do..while is the wrong choice here, because
When there are thousands of hits, single line results are a little meaningless. What I would like to do is have grep be aware of the start and stop points of source code lines, something as simple as having it consider ";" as the line separator would be a good start.
Bonus points if you can make it return the entire comment block if the hit is inside a comment.
I know you can't do this with grep alone. I also am aware of the option to have grep return a certain number of lines of context. Any suggestions on how to accomplish under Linux? FYI my preferred languages are C and Perl.
I'm sure I could write something, but I know that somebody must have already done this.
Thanks!
You can use pcregrep with the -M option (multiline matching; pcregrep is grep with Perl-compatible regular expressions). Something like:
pcregrep -M ";*\R*.*thingtosearchfor*\R*.*;.*"
Here's an example using awk.
$ cat file
blah1
blah2
function1 ("test",
MY_CONSTANT,
(some *really) - long / expression);
function2( one , two )
blah3
blah4
$ awk -vRS=")" '/function1/{gsub(".*function1","function1");print $0RT}' file
function1 ("test",
MY_CONSTANT,
(some *really)
the concept behind: RS is record separator. by setting it to ")", then every record in your file is separated by ")" instead of newline. This make it easy to find your "function1" since you can then "grep" for it. If you don't use awk, the same concept can be applied using "splitting" on ")".
You can write a command line using grep with the options that give you the line number and the filename, then xarg these results into awk to parse these columns and then use a little script from you to display the N lines surrounding that line? :)
If this isn't an academic endeavour you could just use cscope (for C code only though). If you are willing to drop the requirement to search in comments ctags should be enough (and it also supports Perl).
I had a situation in which I had an xml file full of the names of zip files in an xml style format, that is, with carrots bracketing the names of the files, say example.zip<\stuff>
I used awk to change all carrots into newlines then used grep :)

Resources