Remove Words Shorter Than 4 Characters Using Linux - linux

I have read the following and tried to rework the command logic for what I want. But, I just haven't been able to get it right.
Delete the word whose length is less than 2 in bash
Tried: echo $example | sed -e 's/ [a-zA-Z0-9]\{4\} / /g'
Remove all words bigger than 6 characters using sed
Tried: sed -e s'/[A-Za-z]\{,4\}//g'
Please help me with a simple awk or sed command for the following:
Here is an example line of fantastic data
And get:
Here example line fantastic data

$ echo Here is an example line of fantastic data | sed -E 's/\b\(\w\)\{,3\}\b\s*//g'
Here is an example line of fantastic data

If you store the sentence in a variable, you can iterate through it in a for loop. Then you can evaluate if each word is greater than 2 characters.
sentence="Here is an example line of fantastic data";
for word in $sentence; do
if [ ${#word} -gt 2]; then
echo -n $word;
echo -n " ";
fi
done

This is a BASH example of how to do it if you have a lot of sentences in a file which would be the most common case right?
SCRIPT (Remove words with two letters or shorter)
#!/bin/bash
while read line
do
echo "$line" | sed -E 's/\b\w{1,2}\b//g'
done < <( cat sentences.txt )
INPUT
$ cat sentences.txt
Edgar Allan Poe (January 19, 1809 to October 7, 1849) was an
American writer, poet, critic and editor best known for evocative
short stories and poems that captured the imagination and interest
of readers around the world. His imaginative storytelling and tales
of mystery and horror gave birth to the modern detective story.
Many of Poe’s works, including “The Tell-Tale Heart” and
“The Fall of the House of Usher,” became literary classics. Some
aspects of Poe’s life, like his literature, is shrouded in mystery,
and the lines between fact and fiction have been blurred substantially
since his death.
OUTPUT
$ ./grep_tests.sh
Edgar Allan Poe (January , 1809 October , 1849) was
American writer, poet, critic and editor best known for evocative
short stories and poems that captured the imagination and interest
readers around the world. His imaginative storytelling and tales
mystery and horror gave birth the modern detective story.
Many Poe’ works, including “The Tell-Tale Heart” and
“The Fall the House Usher,” became literary classics. Some
aspects Poe’ life, like his literature, shrouded mystery,
and the lines between fact and fiction have been blurred substantially
since his death.

Related

Grep for string and read content until next match string

I am trying to read a file and search for a string using grep. Once I find the string, I want to read everything after the string until I match another string. So in my example, I am searching for ...SUMMARY... and I want to read everything until the occurrence of ... Here is an example:
**...SUMMARY...**
Severe thunderstorms are most likely across north-central/northeast
Texas and the Ark-La-Tex region during the late afternoon and
evening. Destructive hail and wind, along with a few tornadoes are
possible. Severe thunderstorms are also expected across the
Mid-South and Ohio Valley.
**...**North-central/northeast TX and southeast OK/ArkLaTex...
In the wake of a decaying MCS across the Lower Mississippi River
Valley, a northwestward-extending outflow boundary will continue to
modify/drift northward with rapid/strong destabilization this
afternoon particularly along and south of it. A quick
reestablishment of lower/some middle 70s F surface dewpoints will
occur into prior-MCS-impacted areas, with MLCAPE in excess of 4000
J/kg expected for parts of north-central/northeast Texas into far
southeast Oklahoma and the nearby ArkLaTex. Special 19Z observed
soundings are expected from Fort Worth/Shreveport to help better
gauge/confirm this destabilization trend and the degree of capping.
I have tried using the following code but only displays the ...SUMMARY... and the next line.
sed -n '/...SUMMARY.../,/.../p'
What can I do to solve this?
=======================================================================
Followup:
This is the result I am trying to get. Only show the paragraph under ...SUMMARY... and end at the next ... so this is what I should get in the end:
Severe thunderstorms are most likely across north-central/northeast
Texas and the Ark-La-Tex region during the late afternoon and
evening. Destructive hail and wind, along with a few tornadoes are
possible. Severe thunderstorms are also expected across the
Mid-South and Ohio Valley.
I have tried the following based on a recommendation Shellter:
sed -n '/...SUMMARY.../,/**...**/p'
But I get everything.
You may use
sed -n '/^[[:blank:]]*\.\.\.SUMMARY\.\.\./,/^[[:blank:]]*\.\.\./{//!p;}' file
See this online sed demo.
NOTES:
Escape literal dots
Literal asterisks should also be escaped, and they are necessary as /.../ just matches a line with any 3 chars
^ matches the start of a line and [[:space:]]* matches any 0+ whitespace chars
{//!p;} gets you the contents between two lines excluding those lines (see How to print lines between two patterns, inclusive or exclusive (in sed, AWK or Perl)?)

Script to extract strings between two strings in linux

I am trying to write a little script that will let me "org-capture" articles from my rss-reader (newsboat). So my scenario is this: I will pipe the article to a script; however, the article gets piped in one line, like this:
Title: ABC boss quits over Australian political interference claims Author: Date: Thu, 27 Sep 2018 09:39:16 +0200 Link: https://www.bbc.co.uk/news/world-australia-45661871 The broadcaster's chair quits amid allegations the government leaned on him to dismiss two journalists.
So what I need to do is to consistently store the link and the title in a variable and then call a command with these variables (emacsclient org-protocol:/ ...)
So basically I need this:
TITLE="ABC boss quits over Australian political interference claims"
URL="https://www.bbc.co.uk/news/world-australia-45661871"
I considered using awk or sed, but they work best for separate lines. So, I thought maybe split the single line at 'Title:', 'Author:', 'Date:' and 'Link:' and then extract with awk/sed.
I found similar use cases and questions here, but not quite the same. I want a pretty minimal script without necessarily using python.
Am I on the right track?
Thanks for helping out.
With GNU awk for the 3rd arg to match():
$ cat tst.awk
match($0,/^Title:\s*(.*)\s+Author:\s*(.*)\s+Date:\s*(.*)\s+Link:\s*(\S+)\s+(.*)/,a) {
printf "TITLE=\"%s\"\n", a[1]
printf "URL=\"%s\"\n", a[4]
}
$ awk -f tst.awk file
TITLE="ABC boss quits over Australian political interference claims"
URL="https://www.bbc.co.uk/news/world-australia-45661871"
I showed how to save all the other fields too so you can also do anything else you need to with your input.
This might work for you (GNU sed):
sed -r 's/^Title: (.*) Author:.* Link: (\S+).*/TITLE="\1"\nURL="\2"/' file
Use pattern matching to extract the fields required. The first may contain spaces so match on the key Author:. The second is a string of non-space characters following the key Link:.

How to reverse each word in a text file with linux commands without changing order of words

There's lots of questions indicating how to reverse each word in a sentence, and I could readily do this in Python or Javascript for example, but how can I do it with Linux commands? It looks like tac might be an option, but seems like this would likely reverse lines as well as words, rather than just words? What other tools can do this? I literally have no idea. I know rev and tac and awk all seem like contenders...
So I'd like to go from:
cat dog sleep
pillow green blue
to:
tac god peels
wollip neerg eulb
**slight followup
From this reference it looks like I could use awk to break each field up into an array of single characters and then write a for loop to reverse manually each word in this way. This is quite awkward. Surely there's a better/more succinct way to do this?
Try this on for size:
sed -e 's/\s+/ /g' -e 's/ /\n/g' < file.txt | rev | tr '\n' ' ' ; echo
It collapses all the space and counts punctuation as part of "words", but it looks like it (at least mostly) works. Hooray for sh!

Grep (a.txt - En word list, b.txt - one string in each line) Q: string from b.txt built only from words or not?

I have a list with English words (1 in each line, around 100.000)-> a.txt and a b.txt contains strings (around 50.000 line, one string in each line, can contain pure words, word+something, garbage). I would like to know which strings from b.txt contains English words only (without any additional chars).
Can I do this with grep?
Example:
a.txt:
apple
pie
b.txt:
applepie
applebs
bspie
bsabcbs
Output:
c.txt:
applepie
Since your question is underspecified, maybe this answer can help as a shot in the dark to clarify your question:
c='cat b.txt'
while IFS='' read -e line
do
c="$c | grep '$line'"
done < a.txt
eval "$c" > c.txt
But this would also match a line like this is my apply on a pie. I don't know if that's what you want.
This is another try:
re=''
while IFS='' read -e line
do
re="$re${re:+|}$line"
done < a.txt
grep -E "^($re)*$" b.txt > c.txt
This will let pass only the lines which have nothing but a concatenation of these words. But it will also let pass things like 'appleapplepieapplepiepieapple'. Again, I don't know if this is what you want.
Given your latest explanation in the question I would propose another approach (because building such a list out of 100000+ words is not going to work).
A working approach for this amount of words could be to remove all recognized words from the text and see which lines get emptied in the process. This can easily be done iteratively without exploding the memory usage or other resources. It will take time, though.
cp b.txt inprogress.txt
while IFS='' read -e line
do
sed -i "s/$line//g" inprogress.txt
done < a.txt
for lineNumber in $(grep -n '^$' inprogress.txt | sed 's/://')
do
sed -n "${lineNumber}p" b.txt
done
rm inprogress.txt
But this still would not really solve your issue; consider if you have the words to and potato in your list, and removing the to would occur first, then this would leave a word pota in your text file, and pota is not a word which would then be removed.
You could address that issue by sorting your word file by word length (longest words first) but that still would be problematic in some cases of compound words, e. g. redart (being red + art) but dart would be removed first, so re would remain. If that is not in your word list, you would not recognize this word.
Actually, your problem is one of logical programming and natural language processing and probably does not fit to SO. You should have a look at the language Prolog which is designed around such problems as yours.
I will post this as an answer as well since I feel this is the correct answer to your specific question.
Your requirement is to find non-English words in a file (b.txt) based on a word list ( a.txt) which contains a list of English words. Based on the example in your question said word list does not contain compound words (e.g. applepie) but you would still like to match the file against compound words based on words in your word list (e.g. apple and pie).
There are two problem you are facing:
Not every permutation of words in a.txt will be a valid English compound word so just based on this your problem is already impossible to solve.
If you, nonetheless, were to attempt building a list of compound words yourself by compiling a list of all possible permutations you cannot easily do this because of the size of your wordlist (and resulting memory problems). You would most probably have to store your words in a more complex data structure, e.g. a tree, and build permutations on the fly by traversing the tree which is not doable in shell scripting.
Because of these points and your actual question being "can this be done with grep?" the answer is no, this is not possible.

What are the differences among grep, awk & sed? [duplicate]

This question already has answers here:
What are the differences between Perl, Python, AWK and sed? [closed]
(5 answers)
What is the difference between sed and awk? [closed]
(3 answers)
Closed last month.
I am confused about the differences between grep, awk and sed in terms of their role in Unix/Linux system administration and text processing.
Short definition:
grep: search for specific terms in a file
#usage
$ grep This file.txt
Every line containing "This"
Every line containing "This"
Every line containing "This"
Every line containing "This"
$ cat file.txt
Every line containing "This"
Every line containing "This"
Every line containing "That"
Every line containing "This"
Every line containing "This"
Now awk and sed are completly different than grep.
awk and sed are text processors. Not only do they have the ability to find what you are looking for in text, they have the ability to remove, add and modify the text as well (and much more).
awk is mostly used for data extraction and reporting. sed is a stream editor
Each one of them has its own functionality and specialties.
Example
Sed
$ sed -i 's/cat/dog/' file.txt
# this will replace any occurrence of the characters 'cat' by 'dog'
Awk
$ awk '{print $2}' file.txt
# this will print the second column of file.txt
Basic awk usage:
Compute sum/average/max/min/etc. what ever you may need.
$ cat file.txt
A 10
B 20
C 60
$ awk 'BEGIN {sum=0; count=0; OFS="\t"} {sum+=$2; count++} END {print "Average:", sum/count}' file.txt
Average: 30
I recommend that you read this book: Sed & Awk: 2nd Ed.
It will help you become a proficient sed/awk user on any unix-like environment.
Grep is useful if you want to quickly search for lines that match in a file. It can also return some other simple information like matching line numbers, match count, and file name lists.
Awk is an entire programming language built around reading CSV-style files, processing the records, and optionally printing out a result data set. It can do many things but it is not the easiest tool to use for simple tasks.
Sed is useful when you want to make changes to a file based on regular expressions. It allows you to easily match parts of lines, make modifications, and print out results. It's less expressive than awk but that lends it to somewhat easier use for simple tasks. It has many more complicated operators you can use (I think it's even turing complete), but in general you won't use those features.
I just want to mention a thing, there are many tools can do text processing, e.g.
sort, cut, split, join, paste, comm, uniq, column, rev, tac, tr, nl, pr, head, tail.....
they are very handy but you have to learn their options etc.
A lazy way (not the best way) to learn text processing might be: only learn grep , sed and awk. with this three tools, you can solve almost 99% of text processing problems and don't need to memorize above different cmds and options. :)
AND, if you 've learned and used the three, you knew the difference. Actually, the difference here means which tool is good at solving what kind of problem.
a more lazy way might be learning a script language (python, perl or ruby) and do every text processing with it.

Resources