How to split a text file in linux from the bottom of the file to the top based on a given pattern - linux

How to split (it doesn't matter what command) a text file in Linux from the bottom of the file to the top based on a given pattern.
If I have the file:
111
aaa
222
aaa
333
aaa
The output should be
1st file
333
aaa
2nd file
222
aaa
3rd file
111
aaa
Thank you.

Reverse the file with tac and then run it through csplit. The -k option means that you don't need to know the number of splits in advance.
tac file | csplit -s -k - "/aaa/+1" "{99}"

Related

Find different line in 2 UNSORTED files with different size

I have to files f_hold and f_new. f_new is 2 times bigger than f_hold. Both files are unsorted.
How can I discard lines in f_new which are in f_hold? ex:
f_hold:
aaa
bbb
ccc
ddd
eee
f_new:
ppp
ddd
aaa
ccc
bbb
fff
jjj
nnn
what I want:
ppp
fff
jjj
nnn
So, it is not a simple line by line comparison.
I tried several tips like 'grep -Fxv -f', 'comm' etc... but they are making line by line comparison. Is there a linux command to do that?
for the example you provided, using grep will work:
grep -v -f f_hold f_new
-f flag means "read from file"
-v flag is used for invert matching.
UPDATE:
I guess awk can be much faster:
awk 'NR==FNR{a[$0];next} !($0 in a)' f_hold f_new

Remove duplicates from INSANE BIG WORDLIST

What is the best way of doing this? It's a 250GB Text file 1 word per line
Input:
123
123
123
456
456
874
875
875
8923
8932
8923
Output wanted:
123
456
874
875
8923
8932
I need to get 1 copy of each duplicated line I DON'T WANT if there are 2 of the SAME LINES, REMOVE BOTH, just remove 1, always keeping 1 unique line.
What I do now:
$ cat final.txt | sort | uniq > finalnoduplicates.txt
In a screen, this is working? I don't know, because when I check the size of output file, and it's 0:
123user#instance-1:~$ ls -l
total 243898460
-rw-rw-r-- 1 123user 249751990933 Sep 3 13:59 final.txt
-rw-rw-r-- 1 123user 0 Sep 3 14:26 finalnoduplicates.txt
123user#instance-1:~$
But when I check htop cpu value of the screen running this command is at 100%.
Am I doing something wrong?
You can do this using just sort.
$ sort -u final.txt > finalnoduplicates.txt
You can simplify this further and just have sort do all of it:
$ sort -u final.txt -o finalnoduplicates.txt
Finally, since your input file is purely just numerical data, you can tell sort via the -n switch this to further improve the overall performance of this task:
$ sort -nu final.txt -o finalnoduplicates.txt
sort's man page
-n, --numeric-sort
compare according to string numerical value
-u, --unique
with -c, check for strict ordering; without -c, output only the
first of an equal run
-o, --output=FILE
write result to FILE instead of standard output
I found out about this awesome tool called Duplicut. The entire point of the project was to combine the advantages of unique sorting and increasing the memory limit for wordlists.
It is pretty simple to install, this is the GitHub link
https://github.com/nil0x42/duplicut

How to select uncommon rows from two large files using linux (in terminal)?

Both have two columns: names and IDs.(files are in xls or txt format)
File 1:
AAA K0125
ccc K0234
BMN_a K0567
BMN_c K0567
File 2:
AKP K0897
BMN_a K0567
ccc K0234
I want to print uncommon rows using these two files.
how can it be done using linux terminal.
Try something like this:-
join "-t " -j 1 -v 1 file1 file2
Considering the two files are sorted.
First sort both the files and then use comm utility with -3 option
sort file1 > file1_sorted
sort file2 > file2_sorted
comm -3 file1_sorted file2_sorted
A portion from man comm
-3 suppress column 3 (lines that appear in both files)
Output:
AAA K0125
AKP K0897
BMN_c K0567

Merge two files on Linux keeping only lines that appear in both files

In Linux, how can I merge two files and only keep lines that have a match in both files?
Each line is separated by a newline (\n).
So far, I found to sort it, then use comm -12. Is this the best approach (assuming it's correct)?
fileA contains
aaa
bbb
ccc
ddd
fileB contains
aaa
ddd
eee
and I'd like a new file to contain
aaa
ddd
Provided both of your two input files are lexicographically sorted, you can indeed use comm:
$ comm -12 fileA fileB > fileC
If that's not the case, you should sort your input files first:
$ comm -12 <(sort fileA) <(sort fileB) > fileC

Linux, How to sort the lines of a file

I have a file called abc. The content of abc is:
ccc
abc
ccc
ccc
a
b
dd
ccc
I want to sort the lines of the file and delete all duplicates (in this case ccc are duplicates).
In the shell script I use this:
sort -u < $1
But the sorted result becomes the standard output instead of saved into the abc file. How do I do this?
You can redirect output to a file as
sort -u < $1 > abc
try
sort -u abc -o abc_sorted
or if you want to replace the file
sort -u abc -o abc
you could also do
sort abc | uniq > abc_sorted
You can simply do it by using the commands sort uniq , | (pipes) and > (re direction). If your file name is file you can do it simply by the following command:-
sort file | uniq >file

Resources