Remove duplicates from INSANE BIG WORDLIST - linux

What is the best way of doing this? It's a 250GB Text file 1 word per line
Input:
123
123
123
456
456
874
875
875
8923
8932
8923
Output wanted:
123
456
874
875
8923
8932
I need to get 1 copy of each duplicated line I DON'T WANT if there are 2 of the SAME LINES, REMOVE BOTH, just remove 1, always keeping 1 unique line.
What I do now:
$ cat final.txt | sort | uniq > finalnoduplicates.txt
In a screen, this is working? I don't know, because when I check the size of output file, and it's 0:
123user#instance-1:~$ ls -l
total 243898460
-rw-rw-r-- 1 123user 249751990933 Sep 3 13:59 final.txt
-rw-rw-r-- 1 123user 0 Sep 3 14:26 finalnoduplicates.txt
123user#instance-1:~$
But when I check htop cpu value of the screen running this command is at 100%.
Am I doing something wrong?

You can do this using just sort.
$ sort -u final.txt > finalnoduplicates.txt
You can simplify this further and just have sort do all of it:
$ sort -u final.txt -o finalnoduplicates.txt
Finally, since your input file is purely just numerical data, you can tell sort via the -n switch this to further improve the overall performance of this task:
$ sort -nu final.txt -o finalnoduplicates.txt
sort's man page
-n, --numeric-sort
compare according to string numerical value
-u, --unique
with -c, check for strict ordering; without -c, output only the
first of an equal run
-o, --output=FILE
write result to FILE instead of standard output

I found out about this awesome tool called Duplicut. The entire point of the project was to combine the advantages of unique sorting and increasing the memory limit for wordlists.
It is pretty simple to install, this is the GitHub link
https://github.com/nil0x42/duplicut

Related

Merge text files in a numeric order

I'm stuck with a problem. I would like to merge two text files given a specific onset time.
For example:
Text1 (in column):
30
100
200
Text2 (in column):
10
50
70
My output should be 1 text file (single column) like this:
10
30
50
70
100
200
I can use cat or merge to combine files, but not sure how to take care of the order for the onset time.
Thank you in advance for all your help!
Like this:
sort -n file1 file2
Most sort commands (e.g. GNU coreutils, free BSD, open BSD, mac osx, uutils) have a merge option for creating one sorted file from multiple files that are already sorted.
sort -m -n text1 text2
The only sort without such an option I could find is from busybox. But even that version tolerates an -m option, ignores it, sorts the files as usual, and therefore still gives the expected result.
I would have assumed that using -m doesn't really matter that much compared to just sorting the concatenated files like busybox does, since sorting algorithms should have optimizations for already sorted parts. However, a small test on my system with GNU coreutils 8.28 proved the contrary:
shuf -i 1-1000000000 -n 10000000 | sort -n > text1 # 10M lines, 95MB
shuf -i 1-1000000000 -n 10000000 | sort -n > text2
time sort -m -n text1 text2 | md5sum # real: 2.0s (used only 1 CPU core)
time sort -n text1 text2 | md5sum # real: 4.5s (used 2 CPU cores)
Although you could just pipe both files to sort -n it seems inelegant not to use the fact that your input files are already sorted. If it is indeed the case that your inputs are sorted, you could do something like:
awk 'BEGIN{n = getline a < "text2"} {
while( n && a < $1) { print a; n = getline a < "text2"}
} 1 ' text1

Count directories and subdirectories

I want to combine directories and sub-directories and sum-up the first column as follows:
original output:
8 ./.g/apps/panel/mon/lt/prefs
12 ./.g/apps/panel/mon/lt
40 ./.g/apps/panel/mon
44 ./.g/apps/panel
88 ./.g/apps
112 ./.g
4 ./.g
4 ./.pof
20 ./.local/share/applications
4 ./.local/share/m/packages
8 ./.local/share/m
4 ./.local/share/Trash/info
4 ./.local/share/Trash/files
12 ./.local/share/Trash
44 ./.local/share
new output:
308 ./.g
4 ./.pof
96 ./.local/share
the original command: du -k, and I'm trying with awk and cut commands but fails.
edit:- I got up to here:
du -k | awk '{print $1}' | cut -d "/" -f 1
Now, I'm struggling to merge similar lines and sum-up the first column.
p.s this is just an output example*
thank you.
Use du -d 1 to list accumulative content of 1 directory bellow current.
du -h -d 1
Provide a human readable count.
You can try with command:
du -sh *
Try
du -sk .g .pof .local/share
The -s switch is summary, that is, du will search all the files, all the way down the folders inside, and report just the grand total. (The -k switch print the size in kilobytes; thanks Romeo Ninov).
You have to manually specify each folder you want to know the grand total of.
If you type, for example
du -sk .
it will output just a single number, accounting for the current folder (and below) file sizes.
If you type
du -sk *
the result will depend on what your shell expands * to (usually all the files and folders not starting with a dot (.) in the current folder).

How can I list (ls) the 5 last modified files in a directory?

I know ls -t will list all files by modified time. But how can I limit these results to only the last n files?
Try using head or tail. If you want the 5 most-recently modified files:
ls -1t | head -5
The -1 (that's a one) says one file per line and the head says take the first 5 entries.
If you want the last 5 try
ls -1t | tail -5
The accepted answer lists only the filenames, but to get the top 5 files one can also use:
ls -lht | head -6
where:
-l outputs in a list format
-h makes output human readable (i.e. file sizes appear in kb, mb, etc.)
-t sorts output by placing most recently modified file first
head -6 will show 5 files because ls prints the block size in the first line of output.
I think this is a slightly more elegant and possibly more useful approach.
Example output:
total 26960312
-rw-r--r--# 1 user staff 1.2K 11 Jan 11:22 phone2.7.py
-rw-r--r--# 1 user staff 2.7M 10 Jan 15:26 03-cookies-1.pdf
-rw-r--r--# 1 user staff 9.2M 9 Jan 16:21 Wk1_sem.pdf
-rw-r--r--# 1 user staff 502K 8 Jan 10:20 lab-01.pdf
-rw-rw-rw-# 1 user staff 2.0M 5 Jan 22:06 0410-1.wmv
Use tail command:
ls -t | tail -n 5
By default ls -t sorts output from newest to oldest, so the combination of commands to use depends in which direction you want your output to be ordered.
For the newest 5 files ordered from newest to oldest, use head to take the first 5 lines of output:
ls -t | head -n 5
For the newest 5 files ordered from oldest to newest, use the -r switch to reverse ls's sort order, and use tail to take the last 5 lines of output:
ls -tr | tail -n 5
ls -t list files by creation time not last modified time. Use ls -ltc if you want to list files by last modified time from last to first(top to bottom). Thus to list the last n: ls -ltc | head ${n}
None of other answers worked for me. The results were both folders and files, which is not what I would expect.
The solution that worked for me was:
find . -type f -mmin -10 -ls
This lists in the current directory all the files modified in the last 10 minutes. It will not list last 5 files, but I think it might help nevertheless
if you want to watch as it process last five modified file and refresh in 2 secs and show total number of files at top use this:
watch 'ls -Art | wc -l ; ls -ltr | tail -n 5'

Using sort in linux, how can I make 12 after 2?

I have a file, leading with numbers:
$ cat file
1
3
13
2
4
12
When I use cat file | sort, it displays like this:
$ cat file | sort
1
12
13
2
3
4
How can I get the answer like this:
1
2
3
4
12
13
Use the -n option to enable numerical sorting:
$ cat file | sort -n
This is faster and more portable than -g, which is a proprietary extension of GNU sort.
Use -g option of sort for general sorting of numbers (can be slow for large inputs):
$ sort -g file
or:
$ sort -n file
The difference can be found in a related question.
UPD: Fixed the useless cat as stated in comments.

How to crop(cut) text files based on starting and ending line-numbers in cygwin?

I have few log files around 100MBs each.
Personally I find it cumbersome to deal with such big files. I know that log lines that are interesting to me are only between 200 to 400 lines or so.
What would be a good way to extract relavant log lines from these files ie I just want to pipe the range of line numbers to another file.
For example, the inputs are:
filename: MyHugeLogFile.log
Starting line number: 38438
Ending line number: 39276
Is there a command that I can run in cygwin to cat out only that range in that file? I know that if I can somehow display that range in stdout then I can also pipe to an output file.
Note: Adding Linux tag for more visibility, but I need a solution that might work in cygwin. (Usually linux commands do work in cygwin).
Sounds like a job for sed:
sed -n '8,12p' yourfile
...will send lines 8 through 12 of yourfile to standard out.
If you want to prepend the line number, you may wish to use cat -n first:
cat -n yourfile | sed -n '8,12p'
You can use wc -l to figure out the total # of lines.
You can then combine head and tail to get at the range you want. Let's assume the log is 40,000 lines, you want the last 1562 lines, then of those you want the first 838. So:
tail -1562 MyHugeLogFile.log | head -838 | ....
Or there's probably an easier way using sed or awk.
I saw this thread when I was trying to split a file in files with 100 000 lines. A better solution than sed for that is:
split -l 100000 database.sql database-
It will give files like:
database-aaa
database-aab
database-aac
...
And if you simply want to cut part of a file - say from line 26 to 142 - and input it to a newfile :
cat file-to-cut.txt | sed -n '26,142p' >> new-file.txt
How about this:
$ seq 1 100000 | tail -n +10000 | head -n 10
10000
10001
10002
10003
10004
10005
10006
10007
10008
10009
It uses tail to output from the 10,000th line and onwards and then head to only keep 10 lines.
The same (almost) result with sed:
$ seq 1 100000 | sed -n '10000,10010p'
10000
10001
10002
10003
10004
10005
10006
10007
10008
10009
10010
This one has the advantage of allowing you to input the line range directly.
If you are interested only in the last X lines, you can use the "tail" command like this.
$ tail -n XXXXX yourlogfile.log >> mycroppedfile.txt
This will save the last XXXXX lines of your log file to a new file called "mycroppedfile.txt"
This is an old thread but I was surprised nobody mentioned grep. The -A option allows specifying a number of lines to print after a search match and the -B option includes lines before a match. The following command would output 10 lines before and 10 lines after occurrences of "my search string" in the file "mylogfile.log":
grep -A 10 -B 10 "my search string" mylogfile.log
If there are multiple matches within a large file the output can rapidly get unwieldy. Two helpful options are -n which tells grep to include line numbers and --color which highlights the matched text in the output.
If there is more than file to be searched grep allows multiple files to be listed separated by spaces. Wildcards can also be used. Putting it all together:
grep -A 10 -B 10 -n --color "my search string" *.log someOtherFile.txt

Resources