text file contains lines of bizarre characters - want to fix - linux

I'm an inexperienced programmer grappling with a new problem in a large text file which contains data I am trying to process. Here's a screen capture of what I'm looking at (using 'less' - I am on a linux server):
https://drive.google.com/file/d/0B4VAqfRxlxGpaW53THBNeGh5N2c/view?usp=sharing
Bioinformaticians will recognize this file as a "fastq" file containing DNA sequence data. The top half of the screenshot contains data in its expected format (which I admit contains some "bizarre" characters, but that is not the issue). However, the bottom half (with many characters shaded in white) is completely messed up. If I were to scroll down the file, it eventually returns to normal text after about 500 lines. I want to fix it because it is breaking downstream operations I am trying to perform (which complain about precisely this position in the file).
Is there a way to grep for and remove the shaded lines? Or can I fix this problem by somehow changing the encoding on the offending lines?
Thanks

If you are lucky, you can use
strings file > file2
Oh well, try it another way.
Determine the linelength of the correct lines (I think the first two lines are different).
head -1 file | wc -c
head -2 file | tail -1 | wc -c
Hmm, wc also counts the line-ending, substract 1 from both lengths.
Than try to read the file 1 line a time. Use a case-statement so you do not have to write a lot of else-if constructions for comparing the length to the expected length. In the code I will accept the lengths 20, 100 and 330
Redirect everything to another file outside the loop (inside will overwrite each line).
cat file | while read -r line; do
case ${#line} in
20|100|330) echo $line ;;
esac
done > file2
A total different approach would be filtering the wrong lines with sed, awk or grep but that would require knowledge what characters you will and won't accept.
Yes, when you are a lucky (wo-)man, all ugly lines will have a character in common like '<' or maybe an '#'. In that case you can use egrep:
egrep -v "<|#" file > file2

BASED ON INSPECTION OF THE SNAP
sed -r 's/<[[:alnum:]]{2}>//g;s/\^.//g;s/ESC\^*C*//g' file
to make the actual changes in the file and make a backup file with a .bak extension do
sed -r -i.bak 's/<[[:alnum:]]{2}>//g;s/\^.//g;s/ESC\^*C*//g' file

Related

AWK very slow when splitting large file based on column value and using close command

I need to split a large log file into smaller ones based on the id found in the first column, this solution worked wonders and very fast for months:
awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"}' ${nome}.all;
Where $nome is a file and directory name.
Its very fast and worked until the log file reachead several million lines +2GB text file, then it started to show
"Too many open files"
The solution is indeed very simple, adding the close command:
awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"; close(dir"/"dir"_"$1".log")}' ${nome}.all;
The problem is, now its VERY slow, its taking forever to do something that was done in seconds and I need to optmize this.
AWK is not mandatory, I can use an alternative, I just dont know how
Untested since you didn't provide any sample input/output to test with but this should do it:
sort -t';' -k1,1 "${nome}.all" |
awk -v dir="$nome" -F\; '$1!=prev{close(out); out=dir"/"dir"_"$1".log"; prev=$1} {print > out}'
Your first script:
awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"}' ${nome}.all;
had 3 problems:
It wasn't closing file names as you go and so exceeded the threshold you saw.
It had an unparenthesized expression on the right side of output redirection which is undefined behavior per POSIX.
It wasn't quoting the shell variable ${nome} in the file name.
It's worth mentioning that gawk would be able to handle 1 and 2 without failing but it would slow down as the number of open files grew and it was having to manage the opens/closes internally.
Your second script:
awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"; close(dir"/"dir"_"$1".log")}' ${nome}.all;
though now closing the output file name, still had problems 2 and 3 and then added 2 new problems:
It was opening and closing the output files once per input line instead of only when the output file name had to change.
It was overwriting the output file for each $1 for every line written to it instead of appending to it.
The above assumes you have multiple lines of input for each $1 and so each output file will have multiple lines. Otherwise the slow down you saw when closing the output files wouldn't have happened.
The above sort could rearrange the order of input lines for each $1. If that's a problem add -s for "stable sort" if you have GNU sort or let us know as it's easy to work around with POSIX tools.

Can you use 'less' or 'more' to output one page worth of text?

So in Linux, less is used to read files page by page for better readability. I was wondering if you can do something like less file.txt > output.txt to get one page worth of file.txt and output/write it to `output.txt.
Apparently, this does not work, output.txt is exactly the same as the original file, I'm wondering why this is the case, and if there are other work-arounds. Thank you!
You can use the split command.
split -l 100 -d -a 3 input output
This will split the input file every 100 lines (-l 100), will use numbers as suffixes (-d) and will use 3 numbers as suffix (-a 3) in the output file. Something like this output000, output001, output002
You can use head to get a specific number of lines, and tput lines to see how many lines there are on your current terminal.
Here's a script that fetches a pageful, or the standard 25 lines if no terminal is available:
#!/bin/bash
lines=$(tput lines) || lines=25
head -n "$lines" file.txt > output.txt
we use head and tail to get n lines of top or bottom part of a file
cat /var/log/messages|tail -n20
head -n10 src/main.h

Split large gz files while preserving rows

I have a larger .gz file (2.1G) that I am trying to load into R, but it is large enough that I have to split it into pieces and load each individually before recombining them. However, I am having difficulty in splitting the file in a way that preserves the structure of the data. The file itself, with the exception of the first two rows, is a 56318 x 9592 matrix with non-homogenous entries.
I'm using Ubuntu 16.04. First, I tried using the split command from terminal as suggested by this link (https://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts?rq=1)
$ split --lines=10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
Doing this, though, creates far more files than I would have expected (since my matrix has 57000 rows, I was hoping to output 6 files, each 10000 rows in size). When reading one of these into R and investigating the dimensions, I see that each is a matrix of 62x9592, indicating that the columns have all been preserved, but I'm getting significantly less rows than I would have hoped. Further, when reading it in, I get an error specifying an unexpected end of file. My thought is that it's not reading in how I want it to.
I found a two possible alternatives here - https://superuser.com/questions/381394/unix-split-a-huge-gz-file-by-line
In particular, I've tried piping different arguments using gunzip, and then passing the output through to split (with the assumption that perhaps the file being compressed is what led to inconsistent end lines). I tried
$ zcat originalFile.gct.gz | split -l 10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
but, doing this, I ended up with the exact same splits that I had previously. I have the same problem replacing "zcat" with "gunzip -c", which should have sent the uncompressed output to the split command.
Another answer on that link suggested piping to head or tail with something like zcat, for example
$ zcat originalFile.gct.gz | head -n 10000 >> "originalFile.gct.gz.1"
With zcat, this works perfectly, and it's exactly what I want. The dimension for this ends up being 10000x9592, so this is the ideal solution. One thing that I'll note is that this output is an ASCII text file rather than a compressed file, and I'm perfectly OK with that.
However, I want to be able to do this until end up file, making an additional output file for each 10000 rows. For this particular case, it's not a huge deal to just make the six, but I have tens of files like this, some of which are >10gb. My question, then, is how can I use split command that will take the first 10000 lines of the unzipped file and then output them, automatically updating the suffix with each new file? Basically, I want the output that I got from using "head", but with "split" so that I can do it over the entire file.
Here is the solution that ended up working for me
$ zcat originalFile.gct.gz | split -l 10000 - "originalFile.gtc.gz-"
As Guido mentioned in the comment, my original command
$ zcat originalFile.gct.gz | split -l 10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
was discarding the output of zcat, and split was once again reading from the compressed data. By including the "-" after the split argument, I was able to pass the standard output from zcat into split, and now the piping works as I was expecting it to.
When you want to control your splitting better, you can use awk.
You mentioned that the first two rows were special.
Try something like
zcat originalFile.gct.gz |
awk 'BEGIN {j=1} NR<3 {next} {i++} i%5==0 {j++} {print > "originalFile.gct.part"j }'
When you want your outfiles compressed, modify the awk command: Let is print the completed files and use xargs to gzip them.
If spliting based on the content of the file works for you. Try:
zcat originalFile.gct.gz | awk -F$',' '{print $0 | "gzip > /tmp/name_"$1".gct.gz";}'
and example line of my file was: 2014,daniel,2,1,2,3
So I was splitting the files for the year (first column) using the
If spliting based on the content of the file works for you. Try:
zcat originalFile.gct.gz | awk -F$',' '{print $0 | "gzip > /tmp/file_"$1".gct.gz";}'
and example line of my file was:
2014,daniel,2,1,2,3
So I was splitting the files for the year (first column) using the variable $1
Getting and ouput of:
/tmp/file_2014.gct.gz
/tmp/file_2015.gct.gz
/tmp/file_2016.gct.gz
/tmp/file_2017.gct.gz
/tmp/file_2018.gct.gz

How to do something like grep -B to select only one line?

Everything is in the title. Basicaly let's say I have this pattern
some text lalala
another line
much funny wow grep
I grep funny and I want my output to be "lalala"
Thank you
One possible answer is to use either ed or ex to do this (it is trivial in them):
ed - yourfile <<< 'g/funny/.-2p'
(Or replace ed with ex. You might have red, the restricted editor, too; it can't modify files.) This looks for the pattern /funny/ globally, and whenever it is found, prints the line 2 before the matching line (that's the .-2p part). Or, if you want the most recent line containing 'lalala' before the line matching 'funny':
ed - yourfile <<< 'g/funny/?lalala?p'
The only problem is if you're trying to process standard input rather than a file; then you have to save the standard input to a file and process that file, which spoils the concurrency.
You can't do negative offsets in sed (though GNU sed allows you to do positive offsets, so you could use sed -n '/lalala/,+2p' file to get the 'lalala' to 'funny' lines (which isn't quite what you want) based on finding 'lalala', but you cannot find the 'lalala' lines based on finding 'funny'). Standard sed does not allow offsets at all.
If you need to print just the IP address found on a line 8 lines before the pattern-matching line, you need a slightly more involved ed script, but it is still doable:
ed - yourfile <<< 'g/funny/.-8s/.* //p'
This uses the same basic mechanism to find the right line, then runs a substitute command to remove everything up to the last space on the line and print the modified version. Since there isn't a w command, it doesn't actually modify the file.
Since grep -B only prints each full number of lines before the match, you'll have to pipe the output into something like grep or Awk.
grep -B 2 "funny" file|awk 'NR==1{print $NF; exit}'
You could also just use Awk.
awk -v s="funny" '/[[:space:]]lalala$/{n=NR+2; o=$NF}NR==n && $0~s{print o}' file
For the specific example of an IP address 8 lines before the match as mentioned in your comment:
awk -v s="funny" '
/[[:space:]][0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$/ {
n=NR+8
ip=$NF
}
NR==n && $0~s {
print ip
}' file
These Awk solutions first find the output field you might want, then print the output only if the word you want exists in the nth following line.
Here's an attempt at a slightly generalized Awk solution. It maintains a circular queue of the last q lines and prints the line at the head of the queue when it sees a match.
#!/bin/sh
: ${q=8}
e=$1
shift
awk -v q="$q" -v e="$e" '{ m[(NR%q)+1] = $0 }
$0 ~ e { print m[((NR+1)%q)+1] }' "${#--}"
Adapting to a different default (I set it to 8) or proper option handling (currently, you'd run it like q=3 ./qgrep regex file) as well as remembering (and hence printing) the entire line should be easy enough.
(I also didn't bother to make it work correctly if you see a match in the first q-1 lines. It will just print an empty line then.)

Using grep to find difference between two big wordlists

I have one 78k lines .txt file with british words and a 5k lines .txt file with the most common british words. I want to sort out the most common words from the big list so that I have a new list with the not as common words.
I managed solve my problem in another matter, but I would really like to know, what I am doing wrong since this does not work.
I have tried the following:
//To make sure they are trimmed
cut -d" " -f1 78kfile.txt | tac | tac > 78kfile.txt
cut -d" " -f1 5kfile.txt | tac | tac > 5kfile.txt
grep -xivf 5kfile.txt 78kfile.txt > cleansed
//But this procedure apparently gives me two empty files.
If I run just the grep without cut first, I get words that I know are in both files.
I have also tried this:
sort 78kfile.txt > 78kfile-sorted.txt
sort 5kfile.txt > 5kfile-sorted.txt
comm -3 78kfile-sorted.txt 5kfile-sorted.txt
//No luck either
The two text files in case anyone wants to try for them selves:
https://www.dropbox.com/s/dw3k8ragnvjcfgc/5k-most-common-sorted.txt
https://www.dropbox.com/s/1cvut5z2zp9qnmk/brit-a-z-sorted.txt
After downloading your files, I noticed that (a) brit-a-z-sorted.txt has Microsoft line endings while 5k-most-common-sorted.txt has Unix line endings and (b) you are trying to do whole-line compare (grep -x). So, first we need to convert to a common line ending:
dos2unix <brit-a-z-sorted.txt >brit-a-z-sorted-fixed.txt
Now, we can use grep to remove the common words:
grep -xivFf 5k-most-common-sorted.txt brit-a-z-sorted-fixed.txt >less-common.txt
I also added the -F flag to assure that the words would be interpreted as a fixed strings rather than as regular expressions. This also speeds things up.
I note that there are several words in the 5k-most-common-sorted.txt file that are not in the brit-a-z-sorted.txt. For example, "British" is in the common file but not the larger file. Also the common file has "aluminum" while the larger file has only "aluminium".
What do the grep options mean? For those who are curious:
-f means read the patterns from a file.
-F means treat them as fixed patterns, not regular expressions,
-i mean ignore case.
-x means do whole-line matches
-v means invert the match. In other words, print those lines that do not match any of the patterns.

Resources