Remove duplicate lines with a twist gnuwin32 - linux

Okay so I want remove duplicate lines but it's a bit more complicated than that..
I have a file named users.txt, example of file is:
users:email#email.com
users1:email#email.com
Now due to a bug in my system people were able to register with the same email as someone else, so I want to remove if lines have the same email more than once, example of issue:
user:display:email#email.com
user2:email#email.com
user3:email#email.com
user4:email#email.com
Notice how user, user2, user3, user4 all have the same email.. well I want to remove user2, user3, user4 but keep user.. or vice versa ( first one to be picked up by request ) remove any other lines containing same email..
so if
email#email.com is in 20 lines remove 19
spam#spam.com is in 555 lines remove 554
and so fourth..

This can be done with awk:
awk '!a["user:display:email#email.com"]++' filename
++ means, turn to True. So, after it matches print finding.
! is used in this case, to turn that around. So after match it turns to false. (as in do not print after match)
example:
$ awk 'a["user:display:email#email.com"]++' filename
user2:email#email.com
user3:email#email.com
user4:email#email.com
line_random1
linerandom_2_
Now with !
$ awk '!a["user:display:email#email.com"]++' filename
user:display:email#email.com
So, now you just need to filter out what to awk on. No idea how big your file is, to count at least the entries I would do the following:
$ grep -o 'email#email.com' filename | wc -l
4
If you know what to awk on, just write it to a new file - just to be save.
awk '!a["user:display:email#email.com"]++' filename >> new_filename

awk to the rescue!
$ awk -F: '!a[$NF]++' file
user:display:email#email.com

Related

Linux : remove duplicate line

I have a file txt. I would like to remove all duplicate line.
I tried these, but did not work
sort -ur file.txt
or
uniq -D -f 2 file.txt
file.txt
34.78.54.21 websrv1 nameweb
34.78.54.21 nameweb
I just need one line
From your input I assume you are referring to the first field (34.78.54.21) as a duplicate. If you just want to keep the first occurrence of each number then this works for you:
awk '!a[$1]++' file.txt
Output:
34.78.54.21 websrv1 nameweb
This command looks if $1 is not as a key in the array. If it is not then it will be added to the array and the default print will happen. For the next line $1 is in the array and the whole thing will evaluate to false and not print.

renaming files using loop in unix

I have a situation here.
I have lot of files like below in linux
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaac
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaag
I want to remove the $line and make a counter from 0001 to 6000 for my 6000 such files in its place.
Also i want to remove the trailer 3 characters after this is done for each file.
After fix file should be like
SIPTV_FIPTV_ID0000001_T20141003195717_C0000001000_FWD148_IPV_001.DAT
SIPTV_FIPTV_ID0000002_T20141003195717_C0000001000_FWD148_IPV_001.DAT
Please help.
With some assumption, I think this should do it:
1. list of the files is in a file named input.txt, one file per line
2. the code is running in the directory the files are in
3. bash is available
awk '{i++;printf "mv \x27"$0"\x27 ";printf "\x27"substr($0,1,16);printf "%05d", i;print substr($0,22,47)"\x27"}' input.txt | bash
from the command prompt give the following command
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}'
%
and check the output, if it looks OK
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}' | sh
%
A commentary: echo *.DAT??? is meant to give as input to awk a list of all the filenames that you want to modify, you may want something more articulated if the example names you gave aren't representative of the whole spectrum... regarding the awk script itself, I used sprintf to generate a string with the correct number of zeroes for the replacement of $line, the idiom `"\\$..." with two backslashes to quote the dollar sign is required by gawk and does no harm in mawk, and as a last remark I have to say that in similar cases I prefer to make at least a dry run before passing the commands to the shell...

What is the usage of sorted command?

I have read most of the example comes with sort command. How ever I am not sure what is the usage of sort command in this style?
sort <word> sorted
That would just be two file names, as in
sort file1 file2 file3...
If you pass multiple file names, sort concatenates them and sorts all of them together.
If you're asking how to sort a string with the sort command:
echo "tatoine" | grep -o . | sort | tr -d "\n"
aeinott
because sort operate on lines so you've got to cut the string in multiple lines with one letter on each (grep -o .) and after sorting you just delete the new lines with the tr command.
Are those < and > symbols explicit, or do they indicate a parameter that is to be replaced? If the latter, then you're reading from a file called "word", and writing the sorted data to a file called "sorted".
Are you trying to save the content in a sorted order?
Let's say you have a file name.txt with the following content.
Zoe
John
Amy
Mary
Mark
Peter
You can use the sort commmand "sort name.txt" and the output goes to the console
You can save the output using "sort name.txt -o sortedname.txt"
e.g.
Amy
John
Mark
mary
Peter
Zoe
You can found more option with the command "man sort" and "info sort"
rojomoke was right about the > and < commands. Those are redirection commands.
We usually read the data from standard input (stdin) and output goes to standard output aka the screen (stdout)
< means get the data from somewhere else. e.g. a file.
> means redirect the output to somewhere else e.g. a file.
So for the command above "sort name.txt -o sortedname.txt", I could have written as follow.
sort < name.txt > sortedname.txt
You can read more about the redirection in this wiki entry.
https://en.wikipedia.org/wiki/Redirection_(computing)
Commands like | >> will come in handy down the road.

How to find the particular text stored in the file "data.txt" and it occurs only once

The line I seek is stored in the file data.txt and is the only line of text that occurs only once.
How do I go about finding that particular line using linux?
This is a little bit old, but I think you are looking for this...
cat data.txt | sort | uniq -u
This will show the unique values that only occur once in the file. I assume you are familiar with "over the wire" if you are asking?? If so, this is what you are looking for.
To provide some context (I need more rep to comment) this is a question that features in an online "wargame" called Bandit that involves using the command line to discover passwords on an online Linux server to advance up the levels.
For those who would like to see data.txt in full I've Pastebin'd it here however it looks like this:
NN4e37KW2tkIb3dC9ZHyOPdq1FqZwq9h
jpEYciZvDIs6MLPhYoOGWQHNIoQZzE5q
3rpovhi1CyT7RUTunW30goGek5Q5Fu66
JOaWd4uAPii4Jc19AP2McmBNRzBYDAkO
JOaWd4uAPii4Jc19AP2McmBNRzBYDAkO
9WV67QT4uZZK7JHwmOH0jnhurJMwoGZU
a2GjmWtTe3tTM0ARl7TQwraPGXgfkH4f
7yJ8imXc7NNiovDuAl1ZC6xb0O0mMBx1
UsvVyFSfZZWbi6wgC7dAFyFuR6jQQUhR
FcOJhZkHlnwqcD8QbvjRyn886rCrnWZ7
E3ugYDa6Wh2y8C8xQev7vOS8O3OgG1Hw
E3ugYDa6Wh2y8C8xQev7vOS8O3OgG1Hw
ME7nnzbId4W3dajsl6Xtviyl5uhmMenv
J5lN3Qe4s7ktiwvcCj9ZHWrAJcUWEhUq
aouHvjzagN8QT2BCMB6e9rlN4ffqZ0Qq
ZRF5dlSuwuVV9TLhHKvPvRDrQ2L5ODfD
9ZjR3NTHue4YR6n4DgG5e0qMQcJjTaiM
QT8Bw9ofH4x3MeRvYAVbYvV1e1zq3Xim
i6A6TL6nqvjCAPvOdXZWjlYgyvqxmB7k
tx7tQ6kgeJnC446CHbiJY7fyRwrwuhrs
One way to do it is to use:
sort data.txt | uniq -u
The sort command is like cat in that it displays the contents of the file however it sorts the file lexicographically by lines (it reorders them alphabetically so that matching ones are together).
The | is a pipe that redirects the output from one command into another.
The uniq command reports or omits repeated lines and by passing it the -u argument we tell it to report only unique lines.
Used together like this, the command will sort data.txt lexicographically by each line, find the unique line and print it back in the terminal for you.
sort -u data.txt | while read line; do if [ $(grep -c $line data.txt) == 1 ] ;then echo $line; fi; done
was mine solution, until I saw here easy one:
sort data.txt | uniq -u
Add more information to you post.
How data.txt look like?
Like this:
11111111
11111111
pass1111
11111111
Or like this
afawfdgd
password
somethin
gelse...
And, do you know the password is in file or you search for not repeat string.
If you know password, use something like this
cat data.txt | grep 'password'
If you don`t know the password and this password is only unique line in file you must create a script.
For example in Python
file = open("data.txt","r")
f = file.read()
for line in f:
if 'pass' in line:
print pass
Of course replace pass with something else.
For example some slice from line.
And one with only one tool in use, awk:
awk '{a[$1]++}END{for(i in a){if(a[i] == 1){print i} }}' data.txt
sort data.txt | uniq -c | grep 1\ ?*
and it will print the only text that occurs only one time
do not forget to put space after the backslash
sort data.txt | uniq -c | grep 1
you will find only one that accures one time

using grep in a If statement to get all items, ignoring spaces

This is part of a homework problem in a beginning bash class.
I need to bring in the passwd file, which I have done with my passfile variable, then I need to be able to extract certain pieces of it and display the different fields. When I manually grep from CLI using this statement below it works fine. I'm wanting all the variables and I get them all.
grep 1000 passfile | cut -c1-
However, when I do this from the script it stops or breaks or starts over at the first 'blank space' in the users full name. John D. Doe will return 3 lines when I only want one. I see this by echoing the value of i and the following.
for i in `grep 1000 ${passfile} | cut -c1-
user=`echo $1 | cut -d : -f1`
userID=`echo $1 | cut -d : -f3`
For example, if the line reads
jdoe:x:123:1000:John D Doe:/home/jdoe:/bin/bash
I get the following:
i = jdoe:x:123:1000:John
which gives me:
User is jdoe, UID is 509
but then in the next line i starts at R.
i = R. so User is R., UID is R.
next line
i = Johnson:/home/jjohnson:/bin/bash
which returns User is Johnson, UID is /bin/bash
The passwd file holds many users so I need to use the for loop to process them all. I think if I can get it to ignore the space I can get it. But not knowing a whole lot about linux, I'm not sure if I'm even going down the right path. Thanks in Advance for guidence/help.
By default, cut splits on spaces, not colons. If you continue to use it, specify the separator.
You probably want to use IFS=: and a read statement in a while loop to get the values in:
while IFS=: read user password uid gid comment home shell
do
...whatever...
done < /etc/passwd
Or you can pipe the output of grep into the while loop.
Are you allowed to use any external program? If so, I'd recommend awk
UID=1000
awkcmd="\$4==\"$UID\" {print \"user:\",\$1}"
cat $PASSWORDFILE | awk -F ":" "$awkcmd"
when parsing structured files with specific field delimiters such as passwd file, the appropriate tool for the job is awk.
UID=1000
awk -vuid="$UID" '$4==uid{print "user: "$1}' /etc/passwd
you do not have to use grep or cut or anything else. ( Of course, you can also use pure bash while read loops as demonstrated.)

Resources