sort across multiple files in linux - linux

I have multiple (many) files; each very large:
file0.txt
file1.txt
file2.txt
I do not want to join them into a single file because the resulting file would be 10+ Gigs. Each line in each file contains a 40-byte string. The strings are fairly well ordered right now, (about 1:10 steps is a decrease in value instead of an increase).
I would like the lines ordered. (in-place if possible?) This means some of the lines from the end of file0.txt will be moved to the beginning of file1.txt and vice versa.
I am working on Linux and fairly new to it. I know about the sort command for a single file, but am wondering if there is a way to sort across multiple files. Or maybe there is a way to make a pseudo-file made from smaller files that linux will treat as a single file.
What I know can do:
I can sort each file individually and read into file1.txt to find the value larger than the largest in file0.txt (and similarly grab the lines from the end of file0.txt), join and then sort.. but this is a pain and assumes no values from file2.txt belong in file0.txt (however highly unlikely in my case)
Edit
To be clear, if the files look like this:
f0.txt
DDD
XXX
AAA
f1.txt
BBB
FFF
CCC
f2.txt
EEE
YYY
ZZZ
I want this:
f0.txt
AAA
BBB
CCC
f1.txt
DDD
EEE
FFF
f2.txt
XXX
YYY
ZZZ

I don't know about a command doing in-place sorting, but I think a faster "merge sort" is possible:
for file in *.txt; do
sort -o $file $file
done
sort -m *.txt | split -d -l 1000000 - output
The sort in the for loop makes sure the content of the input files is sorted. If you don't want to overwrite the original, simply change the value after the -o parameter. (If you expect the files to be sorted already, you could change the sort statement to "check-only": sort -c $file || exit 1)
The second sort does efficient merging of the input files, all while keeping the output sorted.
This is piped to the split command which will then write to suffixed output files. Notice the - character; this tells split to read from standard input (i.e. the pipe) instead of a file.
Also, here's a short summary of how the merge sort works:
sort reads a line from each file.
It orders these lines and selects the one which should come first. This line gets sent to the output, and a new line is read from the file which contained this line.
Repeat step 2 until there are no more lines in any file.
At this point, the output should be a perfectly sorted file.
Profit!

It isn't exactly what you asked for, but the sort(1) utility can help, a little, using the --merge option. Sort each file individually, then sort the resulting pile of files:
for f in file*.txt ; do sort -o $f < $f ; done
sort --merge file*.txt | split -l 100000 - sorted_file
(That's 100,000 lines per output file. Perhaps that's still way too small.)

I believe that this is your best bet, using stock linux utilities:
sort each file individually, e.g. for f in file*.txt; do sort $f > sorted_$f.txt; done
sort everything using sort -m sorted_file*.txt | split -d -l <lines> - <prefix>, where <lines> is the number of lines per file, and <prefix> is the filename prefix. (The -d tells split to use numeric suffixes).
The -m option to sort lets it know the input files are already sorted, so it can be smart.

mmap() the 3 files, as all lines are 40 bytes long, you can easily sort them in place (SIP :-). Don't forget the msync at the end.

If the files are sorted individually, then you can use sort -m file*.txt to merge them together - read the first line of each file, output the smallest one, and repeat.

Related

Prefix search names to output in bash

I have a simple egrep command searching for multiple strings in a text file which outputs either null or a value. Below is the command and the output.
cat Output.txt|egrep -i "abc|def|efg"|cut -d ':' -f 2
Output is:-
xxx
(null)
yyy
Now, i am trying to prefix my search texts to the output like below.
abc:xxx
def:
efg:yyy
Any help on the code to achieve this or where to start would be appreciated.
-Abhi
Since I do not know exactly your input file content (not specified properly in the question), I will put some hypothesis in order to answer your question.
Case 1: the patterns you are looking for are always located in the same column
If it is the case, the answer is quite straightforward:
$ cat grep_file.in
abc:xxx:uvw
def:::
efg:yyy:toto
xyz:lol:hey
$ egrep -i "abc|def|efg" grep_file.in | cut -d':' -f1,2
abc:xxx
def:
efg:yyy
After the grep just use the cut with the 2 columns that you are looking for (here it is 1 and 2)
REMARK:
Do not cat the file, pipe it and then grep it, since this is doing the work twice!!! Your grep command will already read the file so do not read it twice, it might not be that important on small files but you will feel the difference on 10GB files for example!
Case 2: the patterns you are looking for are NOT located in the same column
In this case it is a bit more tricky, but not impossible. There are many ways of doing, here I will detail the awk way:
$ cat grep_file2.in
abc:xxx:uvw
::def:
efg:yyy:toto
xyz:lol:hey
If your input file is in this format; with your pattern that could be located anywhere:
$ awk 'BEGIN{FS=":";ORS=FS}{tmp=0;for(i=1;i<=NF;i++){tmp=match($i,/abc|def|efg/);if(tmp){print $i;break}}if(tmp){printf "%s\n", $2}}' grep_file
2.in
abc:xxx
def:
efg:yyy
Explanations:
FS=":";ORS=FS define your input/output field separator at : Then on each line you define a test variable that will become true when you reach your pattern, you loop on all the fields of the line until you reach it if it is the case you print it, break the loop and print the second field + an EOL char.
If you do not meet your pattern you do nothing.
If you prefer the sed way, you can use the following command:
$ sed -n '/abc\|def\|efg/{h;s/.*\(abc\|def\|efg\).*/\1:/;x;s/^[^:]*:\([^:]*\):.*/\1/;H;x;s/\n//p}' grep_file2.in
abc:xxx
def:
efg:yyy
Explanations:
/abc\|def\|efg/{} is used to filter the lines that contain only one of the patterns provided, then you execute the instructions in the block. h;s/.*\(abc\|def\|efg\).*/\1:/; save the line in the hold space and replace the line with one of the 3 patterns, x;s/^[^:]*:\([^:]*\):.*/\1/; is used to exchange the pattern and hold space and extract the 2nd column element. Last but not least, H;x;s/\n//p is used to regroup both extracted elements on 1 line and print it.
try this
$ egrep -io "(abc|def|efg):[^:]*" file
will print the match and the next token after delimiter.
If we can assume that there are only two fields, that abc etc will always match in the first field, and that getting the last match on a line which contains multiple matches is acceptable, a very simple sed script could work.
sed -n 's/^[^:]*\(abc\|def\|efg\)[^:]*:\([^:]*\)/\1:\2/p' file
If other but similar conditions apply (e.g. there are three fields or more but we don't care about matches in the first two) the required modifications are trivial. If not, you really need to clarify your question.

How do I do a one way diff in Linux?

How do I do a one way diff in Linux?
Normal behavior of diff:
Normally, diff will tell you all the differences between a two files. For example, it will tell you anything that is in file A that is not in file B, and will also tell you everything that is in file B, but not in file A. For example:
File A contains:
cat
good dog
one
two
File B contains:
cat
some garbage
one
a whole bunch of garbage
something I don't want to know
If I do a regular diff as follows:
diff A B
the output would be something like:
2c2
< good dog
---
> some garbage
4c4,5
< two
---
> a whole bunch of garbage
> something I don't want to know
What I am looking for:
What I want is just the first part, for example, I want to know everything that is in File A, but not file B. However, I want it to ignore everything that is in file B, but not in file A.
What I want is the command, or series of commands:
???? A B
that produces the output:
2c2
< good dog
4c4,5
< two
I believe a solution could be achieved by piping the output of diff into sed or awk, but I am not familiar enough with those tools to come up with a solution. I basically want to remove all lines that begin with --- and >.
Edit: I edited the example to account for multiple words on a line.
Note: This is a "sub-question" of: Determine list of non-OS packages installed on a RedHat Linux machine
Note: This is similar to, but not the same as the question asked here (e.g. not a dupe):
One-way diff file
An alternative, if your files consist of single-line entities only, and the output order doesn't matter (the question as worded is unclear on this), would be:
comm -23 <(sort A) <(sort B)
comm requires its inputs to be sorted, and the -2 means "don't show me the lines that are unique to the second file", while -3 means "don't show me the lines that are common between the two files".
If you need the "differences" to be presented in the order they occur, though, the above diff / awk solution is ok (although the grep bit isn't really necessary - it could be diff A B | awk '/^</ { $1 = ""; print }'.
EDIT: fixed which set of lines to report - I read it backwards originally...
As stated in the comments, one mostly correct answer is
diff A B | grep '^<'
although this would give the output
< good dog
< two
rather than
2c2
< good dog
4c4,5
< two
diff A B|grep '^<'|awk '{print $2}'
grep '^<' means select rows start with <
awk '{print $2}' means select the second column
If you want to also see the files in question, in case of diffing folders, you can use
diff public_html temp_public_html/ | grep '^[^>]'
to match all but lines starting with >

How to join two files in shell

There are two files-:
File1-:
email
abc#gmail.com
dbc#yahoo.com
hbc#ymail.com
File2-:
abc#gmail.com,dpk,25,India
dbc#yahoo.com,dpk,25,India
hbc#ymail.com,dpk,25,India
kbc#gmail.com,dpk,25,India
nbc#ymail.com,dpk,25,India
Required file should be-:
abc#gmail.com,dpk,25,India
dbc#yahoo.com,dpk,25,India
hbc#ymail.com,dpk,25,India
We are not using grep because actual file contains huge data and grepping an email id of file1 in file2 taking huge time.
Is it possible using Join or Comm utility, if yes please help. I had tried but not got desired result also these two utilities works on sort data, but data in two files is not sorted.
grep -Ff File1 File2
This takes the fixed strings (-F) from File1 (-f) as patterns to grep in File2 for. Grepping for fixed string should speed up operations significantly.
If that doesn't cut it...
join -t',' File1 File2
...should do as well, but requires both files to be sorted. (Joining on the first field is the default so you only have to tell join to use the comma as field delimiter.) If the files really are huge and require sorting first, I am not sure this will actually be faster.

How to determine if the content of one file is included in the content of another file

First, my apologies for what is perhaps a rather stupid question that doesn't quite belong here.
Here's my problem: I have two large text files containing a lot of file names, let's call them A and B, and I want to determine if A is a subset of B, disregarding order, i.e. for each file name in A, find if file name is also in B, otherwise A is not a subset.
I know how to preprocess the files (to remove anything but the file name itself, removing different capitalization), but now I'm left to wonder if there is a simple way to perform the task with a shell command.
Diff probably doesn't work, right? Even if I 'sort' the two files first, so that at least the files that are present in both will be in the same order, since A is probably a proper subset of B, diff will just tell me that every line is different.
Again, my apologies if the question doesn't belong here, and in the end, if there is no easy way to do it I will just write a small program to do the job, but since I'm trying to get a better handle on the shell commands, I thought I'd ask here first.
Do this:
cat b | sort -u | wc
cat a b | sort -u | wc
If you get the same result, a is a subset of b.
Here's how to do it in awk
awk '
# read A, the supposed subset file
FNR == NR {a[$0]; next}
# process file B
$0 in a {delete a[$0]}
END {if (length(a) == 0) {print "A is a proper subset of B"}}
' A B
Test if an XSD file is a subset of a WSDL file:
xmllint --format file.wsdl | awk '{$1=$1};1' | sort -u | wc
xmllint --format file.wsdl file.xsd | awk '{$1=$1};1' | sort -u | wc
This adapts the elegant concept of RichieHindle's prior answer using:
xmllint --format instead of cat, to pretty print the XML so each XML element was on one line, as required by sort -u | wc. Other pretty printing commands might work here, e.g. jq . for json.
an awk command to normalise the whitespace: strip leading and trailing (because the indentation is different in both files), and collapse internal. Caveat: does not consider XML attribute order within the element.

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?
As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt
Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

Resources