How to remove 1 instance of each (identical) line in a text file in Linux? - linux

There is a file:
Mary
Mary
Mary
Mary
John
John
John
Lucy
Lucy
Mark
I need to get
Mary
Mary
Mary
John
John
Lucy
I cannot get the lines ordered according to how many times each line is repeated in the text, i.e. the most frequently occurring lines must be listed first.

If your file is already sorted (most-frequent words at top, repeated words only in consecutive lines) – your question makes it look like that's the case – you could reformulate your problem to: "Skip a word when it is encountered for the first time". Then a possible (and efficient) awk solution would be:
awk 'prev==$0{print}{prev=$0}'
or if you prefer an approach that looks more familiar if coming from other programming languages:
awk '{if(prev==$0)print;prev=$0}'
Partially working solutions below. I'll keep them for reference, maybe they are helpful to somebody else.
If your file is not too big, you could use awk to count identical lines and then output each group the number of times it occurred, minus 1.
awk '
{ lines[$0]++ }
END {
for (line in lines) {
for (i = 1; i < lines[line]; ++i) {
print line
}
}
}
'
Since you mentioned that the most frequent line must come first, you have to sort first:
sort | uniq -c | sort -nr | awk '{count=$1;for(i=1;i<count;++i){$1="";print}}' | cut -c2-
Note that the latter will reformat your lines (e.g. collapsing/squeezing repeated spaces). See Is there a way to completely delete fields in awk, so that extra delimiters do not print?

don't sort for no reason :
nawk '_[$-__]--'
gawk '__[$_]++'
mawk '__[$_]++'
Mary
Mary
Mary
John
John
Lucy
for 1 GB+ files, u can speed things up a bit by preventing FS from splitting unnecessary fields
mawk2 '__[$_]++' FS='\n'
for 100 GB inputs, one idea would be to use parallel to create, say, 10 instances of awk, piping the full 100 GB to each instance, but assigning each of them a particular range to partition on their end
(e.g. instance 4 handle lines beginning with F-Q, etc), but instead of outputting it all THEN attempt to sort the monstrosity, what one could do is simply have them tally up, and only print out a frequency report of how many copies ("Nx") of each unique line ("Lx") has been recorded.
From there one could sort a much smaller file along the column holding the Lx's, THEN pipe it to one more awk that would print out Nx# copies of each line Lx.
probably a lot faster than trying to sort 100 GB
I created a test scenario by cloning 71 shuffled copies of a raw file with these stats :
uniq rows = 8125950. | UTF8 chars = 160950688. | bytes = 160950688.
—- 8.12 mn unique rows spanning 154 MB
……resulting in a 10.6 GB test file :
in0: 10.6GiB 0:00:30 [ 354MiB/s] [ 354MiB/s] [============>] 100%
rows = 576942450. | UTF8 chars = 11427498848. | bytes = 11427498848.
even when using just 1 single instance of awk, it finished filtering the 10.6 GB in ~13.25 mins - reasonable given the fact it's tracking 8.1 mn unique hash keys.
in0: 10.6GiB 0:13:12 [13.7MiB/s] [13.7MiB/s] [============>] 100%
out9: 10.5GiB 0:13:12 [13.6MiB/s] [13.6MiB/s] [<=> ]
( pvE 0.1 in0 < testfile.txt | mawk2 '__[$_]++' FS='\n' )
783.31s user 15.51s system 100% cpu 13:12.78 total
5e5f8bbee08c088c0c4a78384b3dd328 stdin

Related

How do I use grep to get numbers larger than 50 from a txt file

I am relatively new to grep and unix. I am trying to get the names of people who have won more than 50 races from a txt file. So far the code I have used is, cat file.txt|grep -E "[5-9][0-9]$" but this is only giving me numbers from 50-99. How could I get it from 50-200. Thank you!!
driver
races
wins
Some_Man
90
160
Some_Man
10
80
the above is similar to the format of the data, although it is not tabulated.
Do you have to use grep? you could use awk like this:
awk '{if($[replace with the field number]>50)print$2}' < file.txt
assuming your fields are delimited by spaces, otherwise you could use -F flag to specify delimiter.
if you must use grep, then it's regular expression like you did. to make it 50 to 200 you will do:
cat file.txt|grep -E "(\b[5-9][0-9]|\b1[0-9][0-9])$"
Input:
Rank Country Driver Races Wins
1 [United_Kingdom] Lewis_Hamilton 264 94
2 [Germany] Sebastian_Vettel 254 53
3 [Spain] Fernando_Alonso 311 32
4 [Finland] Kimi_Raikkonen 326 21
5 [Germany] Nico_Rosberg 200 23
Awk would be a better candidate for this:
awk '$4>=50 && $4<=200 { print $0 }' file
Check to see if the fourth space delimited field ($4 - Change to what ever field number this actually is) if both greater than or equal to 50 and less than or equal to 200 and print the line ($0) if the condition is met

Grep logs for occurrences per second

I am trying to search logs for a range of time looking for the number of occurrences a specific account has. For instance I am running this now:
sed ‘/23:50:28/,/23:55:02/! d’ log.log | grep account_number | wc -l
Which nicely returns the total number of times this account might have entries given the time frame per second. My question is how can I also get a list of all those occurrences by each time entry? Example:
23:50:28 - 2
23:50:29 - 1
23:50:30 - 3
etc.
etc.
Thanks
awk to the rescue!
awk ‘/23:50:28/,/23:55:02/{if(/account_number/) a[$1]++}
END{for(k in a) print k " - " a[k]}' log | sort
obviously not tested since there is no sample input.

How to sort lines in textfile according to a second textfile

I have two text files.
File A.txt:
john
peter
mary
alex
cloey
File B.txt
peter does something
cloey looks at him
franz is the new here
mary sleeps
I'd like to
merge the two
sort one file according to the other
put the unknown lines of B at the end
like this:
john
peter does something
mary sleeps
alex
cloey looks at him
franz is the new here
$ awk '
NR==FNR { b[$1]=$0; next }
{ print ($1 in b ? b[$1] : $1); delete b[$1] }
END { for (i in b) print b[i] }
' fileB fileA
john
peter does something
mary sleeps
alex
cloey looks at him
franz is the new here
The above will print the remaining items from fileB in a "random" order (see http://www.gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array for details). If that's a problem then edit your question to clarify your requirements for the order those need to be printed in.
It also assumes the keys in each file are unique (e.g. peter only appears as a key value once in each file). If that's not the case then again edit your question to include cases where a key appears multiple times in your ample input/output and additionally explain how you want the handled.

Extract rows and substrings from one file conditional on information of another file

I have a file 1.blast with coordinate information like this
1 gnl|BL_ORD_ID|0 100.00 33 0 0 1 3
27620 gnl|BL_ORD_ID|0 95.65 46 2 0 1 46
35296 gnl|BL_ORD_ID|0 90.91 44 4 0 3 46
35973 gnl|BL_ORD_ID|0 100.00 45 0 0 1 45
41219 gnl|BL_ORD_ID|0 100.00 27 0 0 1 27
46914 gnl|BL_ORD_ID|0 100.00 45 0 0 1 45
and a file 1.fasta with sequence information like this
>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
...
>100000
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTG
I am searching now a script that takes from 1.blast the first column and extracts those sequence IDs (=first column $1) plus sequence and then from the sequence itself all but those positions between $7 and $8 from the 1.fasta file, meaning from the first two matches the output would be
>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
GTAGATAGAGATAGAGAGAGAGAGGGGGGAGA
...
(please notice that the first three entries from >1 are not in this sequence)
The IDs are consecutive, meaning I can extract the required information like this:
awk '{print 2*$1-1, 2*$1, $7, $8}' 1.blast
This gives me then a matrix that contains in the first column the right sequence identifier row, in the second column the right sequence row (= one after the ID row) and then the two coordinates that should be excluded. So basically a matrix that contains all required information which elements from 1.fasta shall be extracted
Unfortunately I do not have too much experience with scripting, hence I am now a bit lost, how to I feed the values e.g. in the suitable sed command?
I can get specific rows like this:
sed -n 3,4p 1.fasta
and the string that I want to remove e.g. via
sed -n 5p 1.fasta | awk '{print substr($0,2,5)}'
But my problem is now, how can I pipe the information from the first awk call into the other commands so that they extract the right rows and remove from the sequence rows then the given coordinates. So, substr isn't the right command, I would need a command remstr(string,start,stop) that removes everything between these two positions from a given string, but I think that I could do in an own script. Especially the correct piping is a problem here for me.
If you do bioinformatics and work with DNA sequences (or even more complicated things like sequence annotations), I would recommend having a look at Bioperl. This obviously requires knowledge of Perl, but has quite a lot of functionality.
In your case you would want to generate Bio::Seq objects from your fasta-file using the Bio::SeqIO module.
Then, you would need to read the fasta-entry-numbers and positions wanted into a hash. With the fasta-name as the key and the value being an array of two values for each subsequence you want to extract. If there can be more than one such subsequence per fasta-entry, you would have to create an array of arrays as the value entry for each key.
With this data structure, you could then go ahead and extract the sequences using the subseq method from Bio::Seq.
I hope this is a way to go for you, although I'm sure that this is also feasible with pure bash.
This isn't an answer, it is an attempt to clarify your problem; please let me know if I have gotten the nature of your task correct.
foreach row in blast:
get the proper (blast[$1]) sequence from fasta
drop bases (blast[$7..$8]) from sequence
print blast[$1], shortened_sequence
If I've got your task correct, you are being hobbled by your programming language (bash) and the peculiar format of your data (a record split across rows). Perl or Python would be far more suitable to the task; indeed Perl was written in part because multiple file access in awk of the time was really difficult if not impossible.
You've come pretty far with the tools you know, but it looks like you are hitting the limits of their convenient expressibility.
As either thunk and msw have pointed out, more suitable tools are available for this kind of task but here you have a script that can teach you something about how to handle it with awk:
Content of script.awk:
## Process first file from arguments.
FNR == NR {
## Save ID and the range of characters to remove from sequence.
blast[ $1 ] = $(NF-1) " " $NF
next
}
## Process second file. For each FASTA id...
$1 ~ /^>/ {
## Get number.
id = substr( $1, 2 )
## Read next line (the sequence).
getline sequence
## if the ID is one found in the other file, get ranges and
## extract those characters from sequence.
if ( id in blast ) {
split( blast[id], ranges )
sequence = substr( sequence, 1, ranges[1] - 1 ) substr( sequence, ranges[2] + 1 )
## Print both lines with the shortened sequence.
printf "%s\n%s\n", $0, sequence
}
}
Assuming your 1.blasta of the question and a customized 1.fasta to test it:
>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
>27620
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTGTTTGCGA
Run the script like:
awk -f script.awk 1.blast 1.fasta
That yields:
>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
TTTGCGA
Of course I'm assumming some things, the most important that fasta sequences are not longer than one line.
Updated the answer:
awk '
NR==FNR && NF {
id=substr($1,2)
getline seq
a[id]=seq
next
}
($1 in a) && NF {
x=substr(a[$1],$7,$8)
sub(x, "", a[$1])
print ">"$1"\n"a[$1]
} ' 1.fasta 1.blast

Bash Script - Divide Colum 2 by Colum in the middle but keep 1 and 4 on either side

I have a list that has an ID, population, area and province, that looks like this:
1:517000:405212:Newfoundland and Labrador
2:137900:5660:Prince Edward Island
3:751400:72908:New Brunswick
4:938134:55284:Nova Scotia
5:7560592:1542056:Quebec
6:12439755:1076359:Ontario
7:1170300:647797:Manitoba
8:996194:651036:Saskatchewan
9:3183312:661848:Alberta
10:4168123:944735:British Comumbia
11:42800:1346106:Northwest Territories
12:31200:482443:Yukon Territories
13:29300:2093190:Nunavut
I need display the names of the provinces with the lowest and highest population density (population/area). How can I divide column 1 by column 2 (2 decimal places) but keep the file information in tact on either side (eg. 1:1.28:Newfoundland and Labrador). After that I figure I can just pump it into sort -t: -nk2 | head -n 1 and sort -t: -nrk2 | head -n 1 to pull them.
The recommended command given was grep.
Since you seem to have the sorting and extraction under control, here's an example awk script that should work for you:
#!/usr/bin/env awk -f
BEGIN {
FS=":"
OFS=":"
OFMT="%.2f"
}
{
print $1,$2/$3,$4
}

Resources