How to delete double lines in bash - linux

Given a long text file like this one (that we will call file.txt):
EDITED
1 AA
2 ab
3 azd
4 ab
5 AA
6 aslmdkfj
7 AA
How to delete the lines that appear at least twice in the same file in bash? What I mean is that I want to have this result:
1 AA
2 ab
3 azd
6 aslmdkfj
I do not want to have the same lines in double, given a specific text file. Could you show me the command please?

Assuming whitespace is significant, the typical solution is:
awk '!x[$0]++' file.txt
(eg, The line "ab " is not considered the same as "ab". It is probably simplest to pre-process the data if you want to treat whitespace differently.)
--EDIT--
Given the modified question, which I'll interpret as only wanting to check uniqueness after a given column, try something like:
awk '!x[ substr( $0, 2 )]++' file.txt
This will only compare columns 2 through the end of the line, ignoring the first column. This is a typical awk idiom: we are simply building an array named x (one letter variable names are a terrible idea in a script, but are reasonable for a one-liner on the command line) which holds the number of times a given string is seen. The first time it is seen, it is printed. In the first case, we are using the entire input line contained in $0. In the second case we are only using the substring consisting of everything including and after the 2nd character.

Try this simple script:
cat file.txt | sort | uniq
cat will output the contents of the file,
sort will put duplicate entries adjacent to each other
uniq will remove adjcacent duplicate entries.
Hope this helps!

The uniq command will do what you want.
But make sure the file is sorted first, it only checks for consecutive lines.
Like this:
sort file.txt | uniq

Related

linux shell script delimiter

How to change delimiter from current comma (,) to semicolon (;) inside .txt file using linux command?
Here is my ME_1384_DataWarehouse_*.txt file:
Data Warehouse,ME_1384,Budget for HW/SVC,13/05/2022,10,9999,13/05/2022,27,08,27,08
Data Warehouse,ME_1384,Budget for HW/SVC,09/05/2022,10,9999,09/05/2022,45,58,45,58
Data Warehouse,ME_1384,Budget for HW/SVC,25/05/2022,10,9999,25/05/2022,7,54,7,54
Data Warehouse,ME_1384,Budget for HW/SVC,25/05/2022,10,9999,25/05/2022,7,54,7,54
It is very important that value of last two columns is number with 2 decimal places, so value of last 2 columns in first row for example is:"27,08"
That could be the main problem why delimiter couldn't be change in proper way.
I tried with:
sed 's/,/;/g' ME_1384_DataWarehouse_*.txt
and every comma sign has been changed, including mentioned value of the last 2 columns.
Is there anyone who can help me out with this issue?
With sed you can replace the nth occurrence of a certain lookup string. Example:
$ sed 's/,/;/4' file
will replace the 4th comma with a semicolon.
So, if you know you have 11 fields (10 commas), you can do
$ sed 's/,/;/g;s/;/,/10;s/;/,/8' file
Example:
$ seq 1 11 | paste -sd, | sed 's/,/;/g;s/;/,/10;s/;/,/8'
1;2;3;4;5;6;7;8,9;10,11
Your question is somewhat unclear, but if you are trying to say "don't change the last comma, or the third-to-last one", a solution to that might be
perl -pi~ -e 's/,(?![^,]+(?:,[^,]+,[^,]+)?$)/;/g' ME_1384_DataWarehouse_*.txt
Perl in isolation does not perform any loop over the input lines, but the -p option says to loop over input one line at a time, like sed, and print every line (there is also -n to simulate the behavior of sed -n); the -i~ says to modify the file, but save the original with a tilde added to its file name as a backup; and the regex uses a negative lookahead (?!...) to protect the two fields you want to exempt from the replacement. Lookaheads are a modern regex feature which isn't supported by older tools like sed.
Once you are satisfied with the solution, you can remove the ~ after -i to disable the generation of backups.
You can do this with awk:
awk -F, 'BEGIN {OFS=";"} {a=$NF;NF-=1; printf "%s,%s\n",$0,a} ' input_file
This should work with most awk version (do not count on Solaris standard awk)
The idea is to store the last element from row in variable, decrease the number of fields and then print using new delimiter, comma and stored last field.

how to print two strings in a line one with space delimiter and another between two strings in Linux

I have a file with more than 100 lines.
But only some lines have specific pattern like abc.
My question is that I want two things to print
5th word of line which has pattern abc.
words between 2 distinct strings (xxx, yyy).
Say for example my file has the content below:
This is first line.
Second line has abc pattern with xxx as first separator and yyy as second separator.
This is third line.
Again fourth line has same pattern abc with separators xxx and yyy.
And so on.
The required output is like below:
pattern as first separator and
same and
I tried many ways in Linux but if I was able to print 5th word then content between xxx and yyy I was not able to print and vice versa.
Can any one help me please?
Let me answer to your question:
My question is that I want two things to print
5th word of line which has pattern abc.
words between 2 distinct strings (xxx, yyy).
You can use awk for both parts of your question:
awk '/abc/{print $5}' input_file.txt
awk '/xxx.*yyy/{if(match($0,"xxx.*yyy)){print substr($0,RSTART,RLENGTGH)}}' input_file.txt
if you need to combine both requirements in one command:
awk '/abc/{print $5} /xxx.*yyy/{if(match($0,"xxx.*yyy)){print substr($0,RSTART,RLENGTGH)}}'
OUTPUT:
pattern
xxx as first separator and yyy
same
xxx and yyy

AWK - Show lines where column contains a specific string

I have a document (.txt) composed like that.
info1: info2: info3: info4
And I want to show some information by column.
For example, I have some different information in "info3" shield, I want to see only the lines who are composed by "test" in "info3" column.
I think I have to use sort but I'm not sure.
Any idea ?
The previous answers are assuming that the third column is exactly equal to test. It looks like you were looking for columns where the value included test. We need to use awk's match function
awk -F: 'match($3, "test")' file
You can use awk for this. Assuming your columns are de-limited by : and column 3 has entries having test, below command lists only those lines having that value.
awk -F':' '$3=="test"' input-file
Assuming that the spacing is consistent, and you're looking for only test in the third column, use
grep ".*:.*: test:.*" file.txt
Or to take care of any spacing that might occur
grep ".*:.*: *test *:.*" file.txt

Extracting two columns and search for specific words in the first column remaining without cuting the ones remaining

I have a .csv file filled with names of people, their group, the city they live in, and the day they are able to work, these 4 informations are separated with this ":".
For e.g
Dennis:GR1:Thursday:Paris
Charles:GR3:Monday:Levallois
Hugues:GR2:Friday:Ivry
Michel:GR2:Tuesday:Paris
Yann:GR1:Monday:Pantin
I'd like to cut the 2nd and the 3rd columns, and prints all the lines containing names ending with "s", but without cutting the 2nd column remaining.
For e.g, I would like to have something like that :
Dennis:Paris
Charles:Levallois
Hugues:Ivry
I tried to this with grep and cut, and but using cut ends with having just the 1st remaining.
I hope that I've been able to make myself understood !
It sounds like all you need is:
$ awk 'BEGIN{FS=OFS=":"} $1~/s$/{print $1, $4}' file
Dennis:Paris
Charles:Levallois
Hugues:Ivry
To address your comment requesting a grep+cut solution:
$ grep -E '^[^:]+s:' file | cut -d':' -f1,4
Dennis:Paris
Charles:Levallois
Hugues:Ivry
but awk is the right way to do this.

How do I sort input with a variable number of fields by the second-to-last field?

Editor's note: The original title of the question mentioned tabs as the field separators.
In a text such as
500 east 23rd avenue Toronto 2 890 400000 1
900 west yellovillage blvd Mississauga 3 800 600090 3
how would you sort in ascending order of the second to last column?
Editor's note: The OP later provided another sample input line, 500 Jackson Blvd Toronto 3 700 40000 2, which contains only 8 whitespace-separated input fields (compared to the 9 above), revealing the need to deal with a variable number of fields in the input.
Note: There are several, potentially separate questions:
Update: Question C was the relevant one.
Question A: As implied by the question's title only: how can you use the tab character (\t) as the field separator?
Question B: How can you sort input by the second-to-last field, without knowing that field's specific index up front, given a fixed number of fields?
Question C: How can you sort input by the second-to-last field, without knowing that field's respective index up front, given a variable number of fields?
Answer to question A:
sort's -t option allows you to specify a field separator.
By default, sort uses any run of line-interior whitespace as the separator.
Assuming Bash, Ksh, or Zsh, you can use an ANSI C-quoted string ($'...') to specify a single tab as the field separator ($'\t'):
sort -t $'\t' -n -k8,8 file # -n sorts numerically; omit for lexical sorting
Answer to question B:
Note: This assumes that all input lines have the same number of fields, and that input comes from file file:
# Determine the index of the next-to-last column, based on the first
# line, using Awk:
nextToLastColNdx=$(head -n 1 file | awk -F '\t' '{ print NF - 1 }')
# Sort numerically by the next-to-last column (omit -n to sort lexically):
sort -t $'\t' -n -k$nextToLastColNdx,$nextToLastColNdx file
Note: To sort by a single field, always specify it as the end field too (e.g., -k8,8), as above, because sort, given only a start field index (e.g., -k8), sorts from the specified field through the remainder of the line.
Answer to question C:
Note: This assumes that input lines may have a variable number of fields, and that on each line it is that line's second-to-last field that should act as the sort field; input comes from file file:
awk '{ printf "%s\t%s\n", $(NF-1), $0 }' file |
sort -n -k1,1 | # omit -n to perform lexical sorting
cut -f2-
The awk command extracts each line's second-to-last field and prepends it to the input line on output, separated by a tab.
The result is sorted by the first field (i.e., each input line's second-to-last field).
Finally, the artificially prepended sort field is removed again, using cut.
I suggest looking at "man sort".
You will see how to specify a field separator and how to specify the field index that should be used as a key for sorting.
You can use sort -k 2
For example :
echo -e '000 west \n500 east\n500 east\n900 west' | sort -k 2
The result is :
500 east
500 east
900 west
000 west
You can find more informations in the man page of sort. Take a look a the end of the man page. Just before author you have some interesting informations :)
Bye

Resources