Use awk or sed to eliminate repeat entries in text file - linux

I have a text file like this...
apples
berries
berries
cherries
and I want it to look like this...
apples
berries
cherries
That's it. I just want to eliminate doubled entries. I would prefer for this to be an awk or sed "one-liner" but if there's some other common bash tool that I have overlooked that would be fine.

sort -u file
if in case you are not worried about the order of the output.
Remove duplicates by retaining the order:
awk '!a[$1]++' file

There is a special command for this task, called uniq:
$ uniq file
apples
berries
cherries
This requires that common lines are adjacent, not adjacent equal lines are not removed.

Related

Find two lines and replace with one

I am looking for a solution that would allow me to search text files on a linux server that would look a file and find a pattern such as:
Text 123
Blue Green
And then replaces it with one line, every time it finds it in a file...
Order Blue Green
I am not sure what would be the easiest way to solve this. I have seen many guides using SED but only for finding one line and replacing it.
You ask about sed, here is an answer in sed.
Let me mention however, that while sed is fun for this kind of exercise, you probably should choose something else, more flexible and easier to learn; perl for example.
look for first line /Text 123/
when found start a loop :a
concat next line N
replace twins of searched text with single copy and print it
s/Text 123\nText 123/Text 123/p;
loop while that replaces ta;
try to replace s///
rely on concat being printed unchanged if replace does not trigger
Code:
sed "/Text 123/{:a;N;s/Text 123\nText 123/Text 123/p;ta;s/Text 123\nBlue Green/Order Blue Green/}"
Test input:
Text 123
Do not replace
Lala
Text 123
Blue Green
lulu
Text 123
Do not replace either
Text 123
Text 123
Blue Green
preceding should be replaced
Output:
Text 123
Do not replace
Lala
Order Blue Green
lulu
Text 123
Do not replace either
Text 123
Order Blue Green
preceding should be replaced
Platform: Windows and GNU sed version 4.2.1
Note:
On that platform the sed line allows to use the environment variables for the two text fragments, which you probably want to do:
sed "/%EnvVar2%/{:a;N;s/%EnvVar2%\n%EnvVar2%/%EnvVar2%/p;ta;s/%EnvVar2%\n%EnvVar%/Order %EnvVar%/}"
Platform2:
still Windows
using bash GNU bash, version 3.1.17(1)-release (i686-pc-msys)
GNU sed version 4.2.1 (same)
On this platform, variables can e.g. be used like:
sed "/${EnvVar2}/{:a;N;s/${EnvVar2}\n${EnvVar2}/${EnvVar2}/p;ta;s/${EnvVar2}\n${EnvVar}/Order ${EnvVar}/}"
On this platform it is important to use "..." in order to be able to use variables,
it does not work with '...'.
As #edMorton has hinted, on all platforms be careful however with trying to replace (using variables) text which looks like using a variable. E.g. with "Text $123" in bash. In that case, not using variables but trying to replace text which looks like variables, using '...' instead of "..." is the way to go.
sed is for simple substitutions on individual lines, that is all. If you find yourself trying to use constructs other than s, g, and p (with -n) then you are on the wrong track as all other sed constructs became obsolete in the mid-1970s when awk was invented.
Your problem is not doing substitutions on individual lines, it's on a multi-line record and to do that with GNU awk for multi-char RS is:
$ awk -v RS='^$' -v ORS= '{gsub(/Text 123\nBlue Green/,"Order Blue Green")}1' file
Order Blue Green
but there are several other approaches depending on your real needs.

How to reverse each word in a text file with linux commands without changing order of words

There's lots of questions indicating how to reverse each word in a sentence, and I could readily do this in Python or Javascript for example, but how can I do it with Linux commands? It looks like tac might be an option, but seems like this would likely reverse lines as well as words, rather than just words? What other tools can do this? I literally have no idea. I know rev and tac and awk all seem like contenders...
So I'd like to go from:
cat dog sleep
pillow green blue
to:
tac god peels
wollip neerg eulb
**slight followup
From this reference it looks like I could use awk to break each field up into an array of single characters and then write a for loop to reverse manually each word in this way. This is quite awkward. Surely there's a better/more succinct way to do this?
Try this on for size:
sed -e 's/\s+/ /g' -e 's/ /\n/g' < file.txt | rev | tr '\n' ' ' ; echo
It collapses all the space and counts punctuation as part of "words", but it looks like it (at least mostly) works. Hooray for sh!

How can I remove lines that contain more than N words

Is there a good one-liner in bash to remove lines containing more than N words from a file?
example input:
I want this, not that, but thank you it is very nice of you to offer.
The very long sentence finding form ordering system always and redundantly requires an initial, albeit annoying and sometimes nonsensical use of commas, completion of the form A-1 followed, after this has been processed by the finance department and is legal, by a positive approval that allows for the form B-1 to be completed after the affirmative response to the form A-1 is received.
example output:
I want this, not that, but thank you it is very nice of you to offer.
In Python I would code something like this:
if len(line.split()) < 40:
print line
To only show lines containing less than 40 words, you can use awk:
awk 'NF < 40' file
Using the default field separator, each word is treated as a field. Lines with less than 40 fields are printed.
Note this answer assumes the first approach of the question: how to print those lines being shorter than a given number of characters
Use awk with length():
awk 'length($0)<40' file
You can even give the length as a parameter:
awk -v maxsize=40 'length($0) < maxsize' file
A test with 10 characters:
$ cat a
hello
how are you
i am fine but
i would like
to do other
things
$ awk 'length($0)<10' a
hello
things
If you feel like using sed for this, you can say:
sed -rn '/^.{,39}$/p' file
This checks if the line contains less than 40 characters. If so, it prints it.

What are the differences among grep, awk & sed? [duplicate]

This question already has answers here:
What are the differences between Perl, Python, AWK and sed? [closed]
(5 answers)
What is the difference between sed and awk? [closed]
(3 answers)
Closed last month.
I am confused about the differences between grep, awk and sed in terms of their role in Unix/Linux system administration and text processing.
Short definition:
grep: search for specific terms in a file
#usage
$ grep This file.txt
Every line containing "This"
Every line containing "This"
Every line containing "This"
Every line containing "This"
$ cat file.txt
Every line containing "This"
Every line containing "This"
Every line containing "That"
Every line containing "This"
Every line containing "This"
Now awk and sed are completly different than grep.
awk and sed are text processors. Not only do they have the ability to find what you are looking for in text, they have the ability to remove, add and modify the text as well (and much more).
awk is mostly used for data extraction and reporting. sed is a stream editor
Each one of them has its own functionality and specialties.
Example
Sed
$ sed -i 's/cat/dog/' file.txt
# this will replace any occurrence of the characters 'cat' by 'dog'
Awk
$ awk '{print $2}' file.txt
# this will print the second column of file.txt
Basic awk usage:
Compute sum/average/max/min/etc. what ever you may need.
$ cat file.txt
A 10
B 20
C 60
$ awk 'BEGIN {sum=0; count=0; OFS="\t"} {sum+=$2; count++} END {print "Average:", sum/count}' file.txt
Average: 30
I recommend that you read this book: Sed & Awk: 2nd Ed.
It will help you become a proficient sed/awk user on any unix-like environment.
Grep is useful if you want to quickly search for lines that match in a file. It can also return some other simple information like matching line numbers, match count, and file name lists.
Awk is an entire programming language built around reading CSV-style files, processing the records, and optionally printing out a result data set. It can do many things but it is not the easiest tool to use for simple tasks.
Sed is useful when you want to make changes to a file based on regular expressions. It allows you to easily match parts of lines, make modifications, and print out results. It's less expressive than awk but that lends it to somewhat easier use for simple tasks. It has many more complicated operators you can use (I think it's even turing complete), but in general you won't use those features.
I just want to mention a thing, there are many tools can do text processing, e.g.
sort, cut, split, join, paste, comm, uniq, column, rev, tac, tr, nl, pr, head, tail.....
they are very handy but you have to learn their options etc.
A lazy way (not the best way) to learn text processing might be: only learn grep , sed and awk. with this three tools, you can solve almost 99% of text processing problems and don't need to memorize above different cmds and options. :)
AND, if you 've learned and used the three, you knew the difference. Actually, the difference here means which tool is good at solving what kind of problem.
a more lazy way might be learning a script language (python, perl or ruby) and do every text processing with it.

What is the easiest way to join 12 columns?

I have 12 columns separated by a tab. How can I join them side-by-side?
[Added] You can also tell me other methods as AWK: the faster the better.
Since you asked specifically about awk (there are tools better suited to the job), the following is a first-cut solution:
awk '{print $1$2$3$4$5$6$7$8$9$10$11$12}'
A more complicated and configurable solution, where you could change the number of columns used for output, would be:
awk -v lim=12 '{for(x=1;x<lim;x++){printf "%s",$x};print ""}'
Other possibilities, if you're not restricted to awk, are:
tr -d '\011' # to combine ALL columns on the line.
cut --output-delimiter='' -f1-12 # more general (1-12 or 3-7 or 1-6,9).
Based on your edit and comments, I suggest cut is the best tool for the job. Use "man cut", "info cut" or "cut --help" for more details (this depends on your platform).
If you are just using awk to concatenate the columns I would use 'tr' and delete tab
cat file1 | tr -d '\011'> file2
Try this:
{
print $1$2$3$4$5$6$7$8$9$(10)$(11)$(12)
}
I'm not an awk genius so I don't know if there's some sort of looping construct you can use.
Well, it depends on your editor/command of choice. But generally, it boild down to replacing the character with nothing.
For example, in vim: ":%s/\t//g"
You did not mention what tool you would like to use but any text editor would be able to replace the tab to an empty character, I guess that would work, that's what I usually do.

Resources