I have a tab separated file that looks like this. I need to prepare it so I can import it in R:
1 344544 rs30540
2 284783 rs34560
14 384643 rs30567
19 584643 rs31110
Genome_phase,common=1,19,genomes=hg19
11 222643 rs30543
44 544643 rs32345
Genome_phase,common=1,23,genomes=hg19
I want to keep only the rows that start with numbers and drop all others that begin with characters. It is a huge file of a few Gbs. Is there any way to do that in Linux?
You can use awk or gawk (depending which you have installed on your Linux version).
gawk '/^[0-9]/' file > newfile
Since your aim is to read this into R, I suggest you use fread from library(data.table) as it is fast for large files. In that case you could use the the ability of fread to accept a shell command that preprocesses the file as its input.
cmd = paste("gawk /^[0-9]/", filename)
x = fread(cmd = cmd)
Based on the title of the question, I guess you also want to replace \t with :. Then try the following:
sed -n '/^[0-9]\+\t/s/\t/:/gp' inputfile.txt
/^[0-9]\+\t/ part finds line that starts with digits followed by a \t.
s/\t/:/ substitutes \t by :.
g means global (i.e., substitute all \t), and p means 'print the line'. (We need to explicitly force print since we put option -n, i.e., 'silent'.)
Related
I would like to see the actual file contents without it being formatted to print. For example, to show:
\n0.032,170\n0.034,290
Instead of:
0.032,170
0.34,290
Is there a command to echo the file's actual data in bash? I've tried using head, cat, more, etc. but all those seem to echo the "print-formatted" text. For example:
$ cat example.csv
0.032,170
0.34,290
How can I print the actual characters within the file?
This reads as if you miss understand what the "actual characters in the file" are. You will not find the characters \ and n in that file. But only a line feed, which is a specific character. So the utilities like cat do actually output exactly the characters in the file.
Putting it the other way around: if you really had those two characters literally in the file, then a utility like cat would actually output them. I just checked that, just to be sure.
You can easily check that yourself if you open the file using a hexeditor. There you will see the character 0A (decimal 10) which is a line feed character. You will not see the pair of the two characters \ and n somewhere in that file.
Many programming languages and also shell environments use escape sequences like \n in string definitions and identify those as control characters which would not be typable otherwise. So maybe that is where your impression comes from that your files should contain those two characters.
To display newlines as \n, you might try:
awk 1 ORS='\\n' input-file
This is not the "actual characters in the file", as \n is merely a conventional method of displaying a newline, but this does seem to be what you want.
I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards
By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv
You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).
Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...
I have to run a third-party program in background and capture its output to file. I'm doing this simply using the_program > output.txt. However, the coders of said program decided to be flashy and show processed lines in real-time, using \b characters to erase the previous value. So, one of the lines in output.txt ends up like Lines: 1(b)2(b)3(b)4(b)5, (b) being an unprintable character with ASCII code 08. I want that line to end up as Lines: 5.
I'm aware that I can write it as-is and post-process the file using AWK, but I wonder if it's possible to somehow process the control characters in-place, by using some kind of shell option or by piping some commands together, so that line would become Lines: 5 without having to run any additional commands after the program is done?
Edit:
Just a clarification: what I wrote here is a simplified version, actual line count processed by the program is a hundred thousands, so that string ends up quite long.
Thanks for your comments! I ended up piping the output of that program to AWK Script I linked in the question. I get a well-formed file in the end.
the_program | ./awk_crush.sh > output.txt
The only downside is that I get the output only once the program itself is finished, even though the initial output exceeds 5M and should be passed in the lesser chunks. I don't know the exact reason, perhaps AWK script waits for EOF on stdin. Either way, on more modern system I would use
stdbuf -oL the_program | ./awk_crush.sh > output.txt
to process the output line-by-line. I'm stuck on RHEL4 with expired support though, so I'm unable to use neither stdbuf nor unbuffer. I'll leave it as-is, it's fine too.
The contents of awk_crush.sh are based on this answer, except with ^H sequences (which are supposed to be ASCII 08 characters entered via VIM commands) replaced with escape sequence \b:
#!/usr/bin/awk -f
function crushify(data) {
while (data ~ /[^\b]\b/) {
gsub(/[^\b]\b/, "", data)
}
print data
}
crushify($0)
Basically, it replaces character before \b and \b itself with empty string, and repeats it while there are \b in the string - just what I needed. It doesn't care for other escape sequences though, but if it's necessary, there's a more complete SED solution by Thomas Dickey.
Pipe it to col -b, from util-linux:
the_program | col -b
Or, if the input is a file, not a program:
col -b < input > output
Mentioned in Unix & Linux: Evaluate large file with ^H and ^M characters.
I have thousand of files in a directory and each file contains numbers of defined variables starting with keyword DEFINE and ending with a semicolon (;), I want to copy all the occurrences of the data between this keyword(Inclusive) into a target file.
Example: Below is the content of the text file:
/* This code is for lookup */
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
END.
Now from the above content i just want to copy the section starting with DEFINE and ending with ; into a target file i.e. the output should be:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
this needs to done for thousands of scripts and multiple occurences, Please help out.
Thanks a lot , the provided code works, but to a limited extent only when the whole sentence is in a single line but the data is not supposed to be in one single line it is spread in multiple line like below:
/* This code is for lookup */
DEFINE variable as a1 expr= if branchno > 55
then
extract (n123f1 using brach, code)
else
branchno = null
;
END.
The code is also in the above fashion i need to capture all the data between DEFINE and semicolon (;) after every define there will be an ending semicolon ;, this is the pattern.
It sounds like you want grep(1):
grep '^DEFINE.*;$' input > output
Try using grep. Let's say you have files with extension .txt in present directory,
grep -ho 'DEFINE.*;' *.txt > outfile
Output:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
Short Description
-o will give you only matching string rather than whole line, if line also contains something else and want to ommit it.
-h will suppress file names before matching result
Read man page of grep by typing man grep on your terminal
EDIT
If you want capability to search in multiple lines, you can use pcregrep with -M option
pcregrep -M 'DEFINE.*?(\n|.)*?;' *.txt > outfile
Works fine on my system. Check man pcregrep for more details
Reference : SO Question
One can make a simple solution using sed with version :
sed -n -e '/^DEFINE/{:a p;/;$/!{n;ba}}' your-file
Option -n prevents sed from printing every line; then each time a line begins with DEFINE, print the line (command p) then enter a loop: until you find a line ending with ;, grab the next line and loop to the print command. When exiting the loop, you do nothing.
It looks a bit dirty; it seems that the version sed15 has a shorter (and more straightforward) way to achieve this in one line:
sed -n -e '/^DEFINE/,/;$/p' your-file
Indeed, only for this version of sed, both patterns are treated; for other versions of sed like mine under cygwin, the range patterns must be on separate lines to work properly.
One last thing to remember: it does not treat inclusive patterned ranges, i.e. it stops printing after the first encountered end-pattern even if multiple start patterns have been matched. Prefer something with awk if this is a feature you are looking for.
I have a file with tons of call logs and I am trying to clean it up using bash. I figured out how to search for a string and delete the entire line it is on but that isn't what I want to accomplish.
I want to search for a string as an example:
There are tons of MAC address in the file and I want to remove them all MAC:00-0A-DD-84-01-33
There is also a call ID at the beginning of each line that looks like: 354469805 or 354469894 and I want to remove all of those as well.
I'm just starting in bash so please excuse my ignorance. I am entering 2 lines of the call log below for clarification. I want to delete the 3544 number, the MAC address, and the word Telepacific.
354469725 06/24/2013 09:34 00:03:26 Chante Squires 105 TelePacific MAC:00-0A-DD-84-01-1D TelePacific 17025290701 1
354469732 06/24/2013 09:59 00:01:16 Chante Squires 105 TelePacific MAC:00-0A-DD-84-01-1D TelePacific 12132238375 1
You could use sed:
sed -i 's/^[0-9]\{9\}\|MAC:[0-9A-Fa-f]\{2\}\([-\:][0-9A-Fa-f]\{2\}\)\{5\}//g' input.log
Between the 's/ and //g' is a regular expression that matches the removal criteria in your question. The s flag in front means "search and replace" the regular expression. The // means replace the regular expression with nothing. The g flag at the end means "replace all matches" if they occur more than once in a line. Finally, the -i switch means "edit the files in-place".
This solution assumes that your call IDs are all 9 digits and that the MAC address has six groups of two hexadecimal digits separated by dashes or colons.
One way with awk (you will loose extra tabs space, every field will be separated by single space):
awk '{for(i=2;i<NF;i++) if(8>i || i>10) printf "%s ", $i; print $NF}' log