How to join two files in shell - linux

There are two files-:
File1-:
email
abc#gmail.com
dbc#yahoo.com
hbc#ymail.com
File2-:
abc#gmail.com,dpk,25,India
dbc#yahoo.com,dpk,25,India
hbc#ymail.com,dpk,25,India
kbc#gmail.com,dpk,25,India
nbc#ymail.com,dpk,25,India
Required file should be-:
abc#gmail.com,dpk,25,India
dbc#yahoo.com,dpk,25,India
hbc#ymail.com,dpk,25,India
We are not using grep because actual file contains huge data and grepping an email id of file1 in file2 taking huge time.
Is it possible using Join or Comm utility, if yes please help. I had tried but not got desired result also these two utilities works on sort data, but data in two files is not sorted.

grep -Ff File1 File2
This takes the fixed strings (-F) from File1 (-f) as patterns to grep in File2 for. Grepping for fixed string should speed up operations significantly.
If that doesn't cut it...
join -t',' File1 File2
...should do as well, but requires both files to be sorted. (Joining on the first field is the default so you only have to tell join to use the comma as field delimiter.) If the files really are huge and require sorting first, I am not sure this will actually be faster.

Related

Sorting numerically if the number is not at the start of a line

I used grep -Eo '[0-9]{1,}kg' *.dat which filters the ones with *kg. Now I'm trying to sort them in increasing order. My output from grep is:
blue_whale.dat:240kg
crocodile.dat:5kg
elephant.dat:6kg
giraffe.dat:15kg
hippopotamus.dat:4kg
humpback_whale.dat:5kg
ostrich.dat:1kg
sea_turtle.dat:10kg
I've tried to used sort -n. But the sorting doesn't work.
edit:
I have bunch of files with how much each animals weight and their length. I filtered the weights of each animal. This part was easy. And then I want to order them in increasing order which I thought was just sort -n.
edit:
In my directory, I have many dat files.
And they contain values like 110000kg 24m
And I need to order them in weight increasing order
You need to use the command in this manner:
grep -Eo '[0-9]{1,}kg' *.dat | sort -t: -n -k2
Use the "-t" option to specify the colon as field separator.
You can use -r option for decreasing or reverse order.

Linux / Unix — pipe the output to join

I am very new at Linux / Unix stuff, and time to time I am doing some sort of exercise.
I was doing my exercises till I came up to one part.
Plain sort quotes.t5 and pipe the output to join.
In join use field separator, read from stdin and from quotes.comms, output to quotes.t6
The problem is, I don't understand what this part is asking.
A few days ago I have ran this command on the server:
wget 'http://finance.yahoo.com/d/quotes.csv?s=BARC.L+0992.HK+RHT+AAPL+ADI+AEIS+AGNC+AMAT+AMGN+AMRN+ARCC+ARIA+ARNA+ATVI+BBRY+BIDU+BRCD+BRCM+BSFT+CENX+CERE+CMCSA+COCO+CSCO+CSIQ+CSOD+CTRP+CTSH+CYTX+DRYS+DTV+DXM+EA+EBAY+EGLE+ENDP+ESRX+EXPD+EXTR+FANG+FAST+FB+FCEL+FITB+FLEX+FOXA+FSLR+FTR+GALE+GERN+GILD+GMCR+GRPN+GTAT+HBAN+HDS+HIMX+HSOL+IMGN+INTC+JASO+JBLU+JDSU+KERX+LINE+LINTA+MDLZ+MNKD+MPEL+MSFT+MU+MXIM+MYL+NFLX+NIHD+NUAN+NVDA+ONNN+ORIG+OTEX+OXBT+PENN+PMCS+PSEC+QCOM+RBCN+REGN+RFMD+RSOL+SCTY+SINA+SIRI+SNDK+SPWR+SYMC+TSLA+TUES+TWGP+TXN+VOLC+WEN+YHOO+ZNGA&f=nab' -O quotes.csv
But the produced file quotes.csv was not good enough to get insight into finances and stuff so I need some help from you!
Checkpointing. When finished this lesson you must get this:
$ sha256sum -c quotesshasums
quotes.t1: OK
quotes.t2: OK
quotes.t3: OK
quotes.t4: OK
quotes.t5: OK
quotes.t6: OK
quotes.csv
We have a source file with stock prices data
Lines are terminated with CRLF, which is not Unix style. Make it LF terminated.
Means remove CR (\r) byte from each line. To do this use sed (man sed) substitute
command, output to quotes.t1
More info at http://en.wikipedia.org/wiki/Newline
Run checkpoint to test if quotes.t1 is OK.
Use head and tail commands to output all except first and last line of file
quotes.t1 to quotes.t2
Make fields separated with pipe (vertical bar |) instead of comma.
sed -re 's/,([0-9.]+),([0-9.]+)/|\1|\2/g' quotes.t2 > quotes.t3
Numeric sort by third field(key), don't forget the new separator, output to quotes.t4q
Output last five lines, cut it leaving first and third fields in result. quotes.t5
Plain sort quotes.t5 and pipe the output to join.
In join use field separator, read from stdin and from quotes.comms, output to quotes.t6
If needed, I can post all parts of this exercise, but I am thinking you may know what I need to do at this part.
Mainly what I need to know what that join means. I searched on Google about this, but still I don't get it.
Transferring an abbreviated version of the comments into an answer.
The original version of the question was asking about:
Plain sort quotes.t5 and pipe the output to join.
In join use field separator, read from stdin and from quotes.comms, output to quotes.t6
You need to know that join is a command. It can read from standard input if you specify - as one of its two input file names.
The steps are then, it seems to me, quite straight-forward:
sort quotes.t5 | join -t'|' - quotes.comm > quotes.t6
or perhaps:
sort quotes.t5 | join -t'|' quotes.comm - >quotes.t6
I'm not sure how you tell which is required, except by interpreting 'read from stdin and quotes.comms' as meaning standard input first and quotes.comms second.

How to sort content of a text file in Terminal Linux by splitting at a specific char?

I have an assignment in school to sort a files content in a specific order.
I had to do it with Windows batch-files first and now I have to do the same in Linux.
The file looks more or less like this the whole way through:
John Doe : Crocodiles : 1035
In windows I solved the problem by this:
sort /r /+39 file.txt
The rows in the file are supposed to get sorted by the number of points (which is the number to the right) in decreasing order.
Also the second part of the assignment is to sort the rows by the center column.
How can I get the same result(s) in Linux? I have tried a couple of different variations of the sort command in Linux too but so far without success.
I'd do it with:
sort -nr -t: -k3
-nr - numbers reverse order
-t: - key separator colon
-k3 - third field
The Linux equivalent of your Windows command, sort /r /+39 file, is:
sort -r -k +39 file

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?
As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt
Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

sort across multiple files in linux

I have multiple (many) files; each very large:
file0.txt
file1.txt
file2.txt
I do not want to join them into a single file because the resulting file would be 10+ Gigs. Each line in each file contains a 40-byte string. The strings are fairly well ordered right now, (about 1:10 steps is a decrease in value instead of an increase).
I would like the lines ordered. (in-place if possible?) This means some of the lines from the end of file0.txt will be moved to the beginning of file1.txt and vice versa.
I am working on Linux and fairly new to it. I know about the sort command for a single file, but am wondering if there is a way to sort across multiple files. Or maybe there is a way to make a pseudo-file made from smaller files that linux will treat as a single file.
What I know can do:
I can sort each file individually and read into file1.txt to find the value larger than the largest in file0.txt (and similarly grab the lines from the end of file0.txt), join and then sort.. but this is a pain and assumes no values from file2.txt belong in file0.txt (however highly unlikely in my case)
Edit
To be clear, if the files look like this:
f0.txt
DDD
XXX
AAA
f1.txt
BBB
FFF
CCC
f2.txt
EEE
YYY
ZZZ
I want this:
f0.txt
AAA
BBB
CCC
f1.txt
DDD
EEE
FFF
f2.txt
XXX
YYY
ZZZ
I don't know about a command doing in-place sorting, but I think a faster "merge sort" is possible:
for file in *.txt; do
sort -o $file $file
done
sort -m *.txt | split -d -l 1000000 - output
The sort in the for loop makes sure the content of the input files is sorted. If you don't want to overwrite the original, simply change the value after the -o parameter. (If you expect the files to be sorted already, you could change the sort statement to "check-only": sort -c $file || exit 1)
The second sort does efficient merging of the input files, all while keeping the output sorted.
This is piped to the split command which will then write to suffixed output files. Notice the - character; this tells split to read from standard input (i.e. the pipe) instead of a file.
Also, here's a short summary of how the merge sort works:
sort reads a line from each file.
It orders these lines and selects the one which should come first. This line gets sent to the output, and a new line is read from the file which contained this line.
Repeat step 2 until there are no more lines in any file.
At this point, the output should be a perfectly sorted file.
Profit!
It isn't exactly what you asked for, but the sort(1) utility can help, a little, using the --merge option. Sort each file individually, then sort the resulting pile of files:
for f in file*.txt ; do sort -o $f < $f ; done
sort --merge file*.txt | split -l 100000 - sorted_file
(That's 100,000 lines per output file. Perhaps that's still way too small.)
I believe that this is your best bet, using stock linux utilities:
sort each file individually, e.g. for f in file*.txt; do sort $f > sorted_$f.txt; done
sort everything using sort -m sorted_file*.txt | split -d -l <lines> - <prefix>, where <lines> is the number of lines per file, and <prefix> is the filename prefix. (The -d tells split to use numeric suffixes).
The -m option to sort lets it know the input files are already sorted, so it can be smart.
mmap() the 3 files, as all lines are 40 bytes long, you can easily sort them in place (SIP :-). Don't forget the msync at the end.
If the files are sorted individually, then you can use sort -m file*.txt to merge them together - read the first line of each file, output the smallest one, and repeat.

Resources