How to edit 300 GB text file (genomics data)? - linux

I have a 300 GB text file that contains genomics data with over 250k records. There are some records with bad data and our genomics program 'Popoolution' allows us to comment out the "bad" records with an asterisk. Our problem is that we cannot find a text editor that will load the data so that we can comment out the bad records. Any suggestions? We have both Windows and Linux boxes.
UPDATE: More information
The program Popoolution (https://code.google.com/p/popoolation/) crashes when it reaches a "bad" record giving us the line number that we can then comment out. Specifically, we get a message from Perl that says "F#€%& Scaffolding". The manual suggests we can just use an asterisk to comment out the bad line. Sadly, we will have to repeat this process many times...
One more thought... Is there an approach that would allow us to add the asterisk to the line without opening the entire text file at once. This could be very useful given that we will have to repeat the process an unknown number of times.

Based on your update:
One more thought... Is there an approach that would allow us to add
the asterisk to the line without opening the entire text file at once.
This could be very useful given that we will have to repeat the
process an unknown number of times.
Here you have an approach: If you know the line number, you can add an asterisk in the beginning of that line saying:
sed 'LINE_NUMBER s/^/*/' file
See an example:
$ cat file
aa
bb
cc
dd
ee
$ sed '3 s/^/*/' file
aa
bb
*cc
dd
ee
If you add -i, the file will be updated:
$ sed -i '3 s/^/*/' file
$ cat file
aa
bb
*cc
dd
ee
Even though I always think it's better to do a redirection to another file
sed '3 s/^/*/' file > new_file
so that you keep intact your original file and save the updated one in new_file.

If you are required to have a person mark these records manually with a text editor, for whatever reason, you should probably use split to split the file up into manageable pieces.
split -a4 -d -l100000 hugefile.txt part.
This will split the file up into pieces with 100000 lines each. The names of the files will be part.0000, part.0001, etc. Then, after all the files have been edited, you can combine them back together with cat:
cat part.* > new_hugefile.txt

The simplest solution is to use a stream-oriented editor such as sed. All you need is to be able to write one or more regular expression(s) that will identify all (and only) the bad records. Since you haven't provided any details on how to identify the bad records, this is the only possible answer.

A basic pattern in R is to read the data in chunks, edit, and write out
fin = file("fin.txt", "r")
fout = file("fout.txt", "w")
while (length(txt <- readLines(fin, n=1000000))) {
## txt is now 1000000 lines, add an asterix to problem lines
## bad = <create logical vector indicating bad lines here>
## txt[bad] = paste0("*", txt[bad])
writeLines(txt, fout)
}
close(fin); close(fout)
While not ideal, this works on Windows (implied by the mention of Notepad++) and in a language that you are presumably familiar (R). Using sed (definitely the appropriate tool in the long run) would require installation of additional software and coming up to speed with sed.

Related

Need to create a single record from 3 consecutive lines of text

I can easily write a little parse program to do this task, but I just know that some linux command line tool guru can teach me something new here. I've pulled apart a bunch of files to gather some data within such that I can create a table with it. The data is presently in the format:
.serviceNum=6360,
.transportId=1518,
.suid=6360,
.serviceNum=6361,
.transportId=1518,
.suid=6361,
.serviceNum=6362,
.transportId=1518,
.suid=6362,
.serviceNum=6359,
.transportId=1518,
.suid=6359,
.serviceNum=203,
.transportId=117,
.suid=20203,
.serviceNum=9436,
.transportId=919,
.suid=16294,
.serviceNum=9524,
.transportId=906,
.suid=17613,
.serviceNum=9439,
.transportId=917,
.suid=9439,
What I would like is this:
.serviceNum=6360,.transportId=1518,.suid=6360,
.serviceNum=6361,.transportId=1518,.suid=6361,
.serviceNum=6362,.transportId=1518,.suid=6362,
.serviceNum=6359,.transportId=1518,.suid=6359,
.serviceNum=203,.transportId=117,.suid=20203,
.serviceNum=9436,.transportId=919,.suid=16294,
.serviceNum=9524,.transportId=906,.suid=17613,
.serviceNum=9439,.transportId=917,.suid=9439,
So, the question is, is there a linux command line tool that will somehow read through the file and auto remove the EOL/CR on the end of every 2nd and 3rd line? I've seen old school linux gurus do incredible things on the command line and this is one of those instances where I think it's worth my time to inquire. :)
TIA
O
Use cat and paste and see the magic
cat inputfile.txt | paste - - -
Perl to the rescue:
perl -pe 'chomp if $. % 3' < input
-p processes the input line by line printing each line;
chomp removes the final newline;
$. contains the input line number;
% is the modulo operator.
perl -alne '$a.=$_; if($.%3==0){print $a; $a=""}' filename

Saving a flat-file through Vim add an invisible byte to the file that creates a new line

The title is not really specific, but I have trouble identifying the correct key words as I'm not sure what is going on here. For the same reason, it is possible that my question has a duplicate, as . If that's the case: sorry!
I have a Linux application that receive data via flat files. I don't know exactly how those files are generated, but I can read them without any problem. Those are short files, only a line each.
For test purpose, I tried to modify one of those files and reinjected it again in the application. But when I do that I can see in the log that it added a mysterious page break at the end of the message (resulting in the application not recognising the message)...
For the sake of example, let's say I receive a flat file, named original, that contains the following:
ABCDEF
I make a copy of this file and named it copy.
If I compare those two files using the "diff" command, it says they are identical (as I expect them to be)
If I open copy via Vi and then quit without changing nor saving anything and then use the "diff" command, it says they are identical (as I also expect them to be)
If I open copy via Vi and then save it without changing anything and then use the "diff" command, I have the following (I added the dot for layout purpose):
diff original copy
1c1
< ABCDEF
\ No newline at end of file
---
.> ABCDEF
And if I compare the size of my two files, I can see that original is 71 bytes when copy is 72.
It seems that the format of the file change when I save the file. I first thought of an encoding problem, so I used the ":set list" command on Vim to see the invisible characters. But for both files, I can see the following:
ABCDEF$
I have found other ways to do my test, But this problem still bugged me and I would really like to understand it. So, my two questions are:
What is happening here?
How can I modify a file like that without creating this mysterious page break?
Thank you for your help!
What happens is that Vim is set by default to assume that the files you edit end with a "newline" character. That's normal behavior in UNIX-land. But the "files" your program is reading look more like "streams" to me because they don't end with a newline character.
To ensure that those "files" are written without a newline character, set the following options before writing:
:set binary noeol
See :help 'eol'.

Keep only rows in a range using vim

I was wondering if there's a way in VIM to keep only rows in a certain range, i.e say I wanted to to keep only rows 1:20 in a file, and discard everything else. Better yet say I wanted to keep lines 1-20 and 40-60 is there a way to do this?
Is there a way to do this without manually deleting stuff?
If you mean entire lines by "rows", just use the :delete command with the inverted range:
:21,$delete
removes all lines except 1-20.
If the ranges are non-consecutive, an alternative is the :vglobal command with regular expression atoms that match only in certain lines. For example, to only keep lines 3 and 7:
:g!/\%3l\|\%7l/delete
There are also atoms for "lines less/larger than", so you can build ranges with them, too.
In order to keep lines 1 through 20 and 40 through 60, the following construct should do:
:v/\%>0l\%<21l\|\%>39l\%<61l/d
If you want to (as I now understand from your comments) save (different) parts of the buffer as new files, it's best to not modify the original file, but to write fragments as separate files. In fact, Vi(m) supports this well, because you can just pass a range to the :write command:
:1,20w newfile1
:40,60w newfile2
Append works, too:
:40,60w >> newfile1
There's not only one way to achieve what you want:
If this question is really about the first rows in a file:
head -20 <filename> > newfile
If it shall be a vim solution:
:21ENTER
dG
However, you mention that you want to split up a large file into smaller pieces. The tool for this is split: it lets you split up files into chunks of even line count or even size.

Adding content to middle of file..without reading it till the end

I have read various questions/answers here on unix.stackexchange on how to add or remove lines to/from a file without needing to create a temporary file.
https://unix.stackexchange.com/questions/11067/is-there-a-way-to-modify-a-file-in-place?lq=1
It appears all these answers need one to atleast read till end of the file, which can be time consuming if input is a large file. Is there a way around this? I would expect a file system to be implemented like a linked list...so there should be a way to reach the required "lines" and then just add stuff (node in linked lists). How do I go about doing this?
Am I correct in thinking so? Or Am I missing anything?
Ps: I need this to be done in 'C' and cannot use any shell commands.
As of Linux 4.1, fallocate(2) supports the FALLOC_FL_INSERT_RANGE flag, which allows one to insert a hole of a given length in the middle of a file without rewriting the following data. However, it is rather limited: the hole must be inserted at a filesystem block boundary, and the size of the inserted hole must be a multiple of the filesystem block size. Additionally, in 4.1, this feature was only supported by the XFS filesystem, with Ext4 support added in 4.2.
For all other cases, it is still necessary to rewrite the rest of the file, as indicated in other answers.
The short answer is that yes, it is possible to modify the contents of a file in place, but no, it is not possible to remove or add content in the middle of the file.
UNIX filesystems are implemented using an inode pointer structure which points to whole blocks of data. Each line of a text file does not "know" about its relationship to the previous or next line, they are simply adjacent to each other within the block. To add content between those two lines would require all of the following content to be shifted further "down" within the block, pushing some data into the next block, which in turn would have to shift into the following block, etc.
In C you can fopen a file for update and read its contents, and overwrite some of the contents, but I do not believe there is (even theoretically) any way to insert new data in the middle, or delete data (except to overwrite it with nulls.
You can modify a file in place, for example using dd.
$ echo Hello world, how are you today? > helloworld.txt
$ cat helloworld.txt
Hello world, how are you today?
$ echo -n Earth | dd of=helloworld.txt conv=notrunc bs=1 seek=6
$ cat helloworld.txt
Hello Earth, how are you today?
The problem is that if your change also changes the length, it will not quite work correctly:
$ echo -n Joe | dd of=helloworld.txt conv=notrunc bs=1 seek=6
Hello Joeth, how are you today?
$ echo -n Santa Claus | dd of=helloworld.txt conv=notrunc bs=1 seek=6
Hello Santa Clausare you today?
When you change the length, you have to re-write the file, if not completely then starting at the point of change you make.
In C this is the same as with dd. You open the file, you seek, and you write.
You can open a file in read/write mode. You read the file (or use "seek" to jump to the position you want, if you know it), and then write into the file, but you overwrite the data which are here (this is not an insertion).
Then you can choose to trunc the file from the last point you wrote or keep all the remaining data after the point you wrote without reading them.

vim -- How to read range of lines from a file into current buffer

I want to read line n1->n2 from file foo.c into the current buffer.
I tried: 147,227r /path/to/foo/foo.c
But I get: "E16: Invalid range", though I am certain that foo.c contains more than 1000 lines.
:r! sed -n 147,227p /path/to/foo/foo.c
You can do it in pure Vimscript, without having to use an external tool like sed:
:put =readfile('/path/to/foo/foo.c')[146:226]
Note that we must decrement one from the line numbers because arrays start from 0 while line numbers start from 1.
Disadvantages: This solution is 7 characters longer than the accepted answer. It will temporarily read the entire file into memory, which could be a concern if that file is huge.
The {range} refers to the destination in the current file, not the range of lines in the source file.
After some experimentation, it seems
:147,227r /path/to/foo/foo.c
means insert the contents of /path/to/foo/foo.c after line 227 in this file. i.e.: it ignores the 147.
Other solutions posted are great for specific line numbers. It's often the case that you want to read from top or bottom of another file. In that case, reading the output of head or tail is very fast. For example -
:r !head -20 xyz.xml
Will read first 20 lines from xyz.xml into current buffer where the cursor is
:r !tail -10 xyz.xml
Will read last 10 lines from xyz.xml into current buffer where the cursor is
The head and tail commands are extremely fast, therefore even combining them can be much faster than other approaches for very large files.
:r !head -700030 xyz.xml| tail -30
Will read line numbers from 700000 to 700030 from file xyz.xml into current buffer
This operation should complete instantly even for fairly large files.
I just had to do this in a code project of mine and did it this way:
In buffer with /path/to/foo/foo.c open:
:147,227w export.txt
In buffer I'm working with:
:r export.txt
Much easier in my book... It requires having both files open, but if I'm importing a set of lines, I usually have them both open anyway. This method is more general and easier to remember for me, especially if I'm trying to export/import a trickier set of lines using g/<search_criteria/:.w >> export.txt or some other more complicated way of selecting lines.
You will need to:
:r /path/to/foo/foo.c
:d 228,$
:d 1,146
Three steps, but it will get it done...
A range permits a command to be applied to a group of lines in the current buffer.
So, the range of read instruction means where to insert the content in the current file, but not the range of file that you want to read.

Resources