Linux / Unix — pipe the output to join - linux

I am very new at Linux / Unix stuff, and time to time I am doing some sort of exercise.
I was doing my exercises till I came up to one part.
Plain sort quotes.t5 and pipe the output to join.
In join use field separator, read from stdin and from quotes.comms, output to quotes.t6
The problem is, I don't understand what this part is asking.
A few days ago I have ran this command on the server:
wget 'http://finance.yahoo.com/d/quotes.csv?s=BARC.L+0992.HK+RHT+AAPL+ADI+AEIS+AGNC+AMAT+AMGN+AMRN+ARCC+ARIA+ARNA+ATVI+BBRY+BIDU+BRCD+BRCM+BSFT+CENX+CERE+CMCSA+COCO+CSCO+CSIQ+CSOD+CTRP+CTSH+CYTX+DRYS+DTV+DXM+EA+EBAY+EGLE+ENDP+ESRX+EXPD+EXTR+FANG+FAST+FB+FCEL+FITB+FLEX+FOXA+FSLR+FTR+GALE+GERN+GILD+GMCR+GRPN+GTAT+HBAN+HDS+HIMX+HSOL+IMGN+INTC+JASO+JBLU+JDSU+KERX+LINE+LINTA+MDLZ+MNKD+MPEL+MSFT+MU+MXIM+MYL+NFLX+NIHD+NUAN+NVDA+ONNN+ORIG+OTEX+OXBT+PENN+PMCS+PSEC+QCOM+RBCN+REGN+RFMD+RSOL+SCTY+SINA+SIRI+SNDK+SPWR+SYMC+TSLA+TUES+TWGP+TXN+VOLC+WEN+YHOO+ZNGA&f=nab' -O quotes.csv
But the produced file quotes.csv was not good enough to get insight into finances and stuff so I need some help from you!
Checkpointing. When finished this lesson you must get this:
$ sha256sum -c quotesshasums
quotes.t1: OK
quotes.t2: OK
quotes.t3: OK
quotes.t4: OK
quotes.t5: OK
quotes.t6: OK
quotes.csv
We have a source file with stock prices data
Lines are terminated with CRLF, which is not Unix style. Make it LF terminated.
Means remove CR (\r) byte from each line. To do this use sed (man sed) substitute
command, output to quotes.t1
More info at http://en.wikipedia.org/wiki/Newline
Run checkpoint to test if quotes.t1 is OK.
Use head and tail commands to output all except first and last line of file
quotes.t1 to quotes.t2
Make fields separated with pipe (vertical bar |) instead of comma.
sed -re 's/,([0-9.]+),([0-9.]+)/|\1|\2/g' quotes.t2 > quotes.t3
Numeric sort by third field(key), don't forget the new separator, output to quotes.t4q
Output last five lines, cut it leaving first and third fields in result. quotes.t5
Plain sort quotes.t5 and pipe the output to join.
In join use field separator, read from stdin and from quotes.comms, output to quotes.t6
If needed, I can post all parts of this exercise, but I am thinking you may know what I need to do at this part.
Mainly what I need to know what that join means. I searched on Google about this, but still I don't get it.

Transferring an abbreviated version of the comments into an answer.
The original version of the question was asking about:
Plain sort quotes.t5 and pipe the output to join.
In join use field separator, read from stdin and from quotes.comms, output to quotes.t6
You need to know that join is a command. It can read from standard input if you specify - as one of its two input file names.
The steps are then, it seems to me, quite straight-forward:
sort quotes.t5 | join -t'|' - quotes.comm > quotes.t6
or perhaps:
sort quotes.t5 | join -t'|' quotes.comm - >quotes.t6
I'm not sure how you tell which is required, except by interpreting 'read from stdin and quotes.comms' as meaning standard input first and quotes.comms second.

Related

How to execute multiple commands within the same input

Cyber newbie.
This has me completely stumped. I need to search a file called ‘countries’ for all countries containing the letter ‘y’. Following this, sort the output of this command in reverse order and write the output to a file called ‘output’.
How do I sort by a particular character?
Thanks
grep y countries | sort -r > output
should do it.
The pipe character | sends the output of the command on the left, grep, as input to the right, sort.
The output redirection character > sends the output of the result to a file
'output'.

concatenate two strings and one variable using bash

I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards
By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv
You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).
Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...

Bash - process backspace control character when redirecting output to file

I have to run a third-party program in background and capture its output to file. I'm doing this simply using the_program > output.txt. However, the coders of said program decided to be flashy and show processed lines in real-time, using \b characters to erase the previous value. So, one of the lines in output.txt ends up like Lines: 1(b)2(b)3(b)4(b)5, (b) being an unprintable character with ASCII code 08. I want that line to end up as Lines: 5.
I'm aware that I can write it as-is and post-process the file using AWK, but I wonder if it's possible to somehow process the control characters in-place, by using some kind of shell option or by piping some commands together, so that line would become Lines: 5 without having to run any additional commands after the program is done?
Edit:
Just a clarification: what I wrote here is a simplified version, actual line count processed by the program is a hundred thousands, so that string ends up quite long.
Thanks for your comments! I ended up piping the output of that program to AWK Script I linked in the question. I get a well-formed file in the end.
the_program | ./awk_crush.sh > output.txt
The only downside is that I get the output only once the program itself is finished, even though the initial output exceeds 5M and should be passed in the lesser chunks. I don't know the exact reason, perhaps AWK script waits for EOF on stdin. Either way, on more modern system I would use
stdbuf -oL the_program | ./awk_crush.sh > output.txt
to process the output line-by-line. I'm stuck on RHEL4 with expired support though, so I'm unable to use neither stdbuf nor unbuffer. I'll leave it as-is, it's fine too.
The contents of awk_crush.sh are based on this answer, except with ^H sequences (which are supposed to be ASCII 08 characters entered via VIM commands) replaced with escape sequence \b:
#!/usr/bin/awk -f
function crushify(data) {
while (data ~ /[^\b]\b/) {
gsub(/[^\b]\b/, "", data)
}
print data
}
crushify($0)
Basically, it replaces character before \b and \b itself with empty string, and repeats it while there are \b in the string - just what I needed. It doesn't care for other escape sequences though, but if it's necessary, there's a more complete SED solution by Thomas Dickey.
Pipe it to col -b, from util-linux:
the_program | col -b
Or, if the input is a file, not a program:
col -b < input > output
Mentioned in Unix & Linux: Evaluate large file with ^H and ^M characters.

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?
As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt
Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

Using sed to print range when pattern is inside the range?

I have a log file full of queries, and I only want to see the queries that have an error. The log entries look something like:
path to file executing query
QUERY
SIZE: ...
ROWS: ...
MSG: ...
DURATION: ...
I want to print all of this stuff, but only when MSG: contains something of interest (an error message). All I've got right now is the sed -n '/^path to file/,/^DURATION/' and I have no idea where to go from here.
Note: Queries are often multiline, so using grep's -B sadly doesn't work all the time (this is what I've been doing thus far, just being generous with the -B value)
Somehow I'd like to use only sed, but if I absolutely must use something else like awk I guess that's fine.
Thanks!
You haven't said what an error message looks like, so I'll assume it contains the word "ERROR":
sed -n '/^MSG.*ERROR/{H;g;N;p;};/^DURATION/{s/.*//;h;d;};H' < logname
(I wish there were a tidier way to purge the hold space. Anyone?...)
I could suggest a solution with grep. That will work if the structure in the log file is always the same as above (i.e. MSG is in the 5th line, and one line follows):
egrep -i '^MSG:.*error' -A 1 -B 4 logfile
That means: If the word error occurs in a MSG line then output the block beginning from 4 lines before MSG till one line after it.
Of course you have to adjust the regexp to recognize an error.
This will not work if the structure of those blocks differs.
Perhaps you can use the cgrep.sed script, as described by Unix Power Tools book

Resources