Find the file offset for a character index, ignoring newline - text

I have a text file of 3GB size (a FASTA file with DNA sequences). It contains about 50 million lines of differing
length, though the most lines are 70 characters wide. I want to extract a string from this file, given two character indices. The difficult
part is, that newlines shall not be counted as character.
For good speed, I want to use seek() to reach the beginning of the string and start reading, but I need the offset in bytes for that.
My current approach is to write a new file, with all the newlines removed, but that takes another 3GB on disk. I want to find a solution which requires less disk space.
Using a dictionary mapping each character count to a file offset is not practicable either, because there would be one key for every byte, therefore using at least 16bytes*3 billion characters = 48GB.
I think I need a data structure which allows to retrieve the number of newline characters that come before a character of certain index, then I can add their number and the character index to obtain the file offset in bytes.

The SamTools fai index was designed just for this purpose. Which makes a very small compact index file with enough information to quickly seek to any point in the fasta file for any record inside as long as the file is properly formatted
You can create a SamTools index using samtools faidx command.
You can then use other programs in the SamTools package to pull out subsequences or alignments very quickly using the index.
see http://www.htslib.org/doc/samtools.html for usage.

Related

If torrent contain multiple files, how to know what piece correspond to each file?

I'm building a BitTorrent client application in Java and I have 2 small question:
Can torrent contain folders? recursively?
If a torrent contains n files (not directories - for simplicity), do I need to create n files with their corresponding size? When I receive a piece from a peer, how do I know to which file it (the piece) belong?
For example, here is a torrent which contains 2 files:
TorrentInfo{Created By: ruTorrent (PHP Class - Adrien Gibrat)
Main tracker: http://tracker.hebits.net:35777/tracker.php?do=announce&passkey=5d3ab309eda55c1e7183975099958ab2
Comment: null
Info_hash: c504216ca4a113d26f023a10a1249ca3a6217997
Name: Veronica.2017.1080p.BluRay.DTS-HD.MA.5.1.x264-HDH
Piece Length: 16777216
Pieces: 787
Total Size: null
Is Single File Torrent: false
File List:
TorrentFile{fileLength=13202048630, fileDirs=[Veronica.2017.1080p.BluRay.DTS-HD.MA.5.1.x264-HDH.mkv]}
TorrentFile{fileLength=62543, fileDirs=[Veronica.2017.1080p.BluRay.DTS-HD.MA.5.1.x264-HDH.srt]}
The docs doesn't say much: https://wiki.theory.org/index.php/BitTorrentSpecification
what you are doing is similar to mine...
The following bold fonts are important to your questions.
1.yes; no
Info in Multiple File Mode
name: the name of the directory in which to store all the files. This is purely advisory. (string)
path: a list containing one or more string elements that together represent the path and filename. Each element in the list corresponds to either a directory name or (in the case of the final element) the filename. For example, a the file "dir1/dir2/file.ext" would consist of three string elements: "dir1", "dir2", and "file.ext". This is encoded as a bencoded list of strings such as l4:*dir*14:*dir*28:file.exte
Info in Single File Mode
name: the filename. This is purely advisory. (string)
Filename includes floder name.
2.maybe;
Whether you need to create n files with their corresponding size depend on whether you need to download n files.
Peer wire protocol (TCP)
piece:
The piece message is variable length, where X is the length of the block. The payload contains the following information:
index: integer specifying the zero-based piece index
begin: integer specifying the zero-based byte offset within the piece
block: block of data, which is a subset of the piece specified by index.
For the purposes of piece boundaries in the multi-file case, consider the file data as one long continuous stream, composed of the concatenation of each file in the order listed in the files list. The number of pieces and their boundaries are then determined in the same manner as the case of a single file. Pieces may overlap file boundaries.
I am sorry for my english, because I am not native speaker...
Can torrent contain folders? recursively?
Yes.
Sortof. In BEP3 Nested directories are mapped into path elements, i.e. /dir1/dir2/dir3/file.ext is represented as path: ["dir1", "dir2", "dir3", "file.ext"] in the file list. BEP52 changes this to a tree-based structure more closely resembling a directory tree.
If a torrent contains n files (not directories - for simplicity), do I need to create n files with their corresponding size? When I receive a piece from a peer, how do I know to which file it (the piece) belong?
The bittorrent wire protocol deals with a contiguous address space of bytes which are grouped into fixed-sized pieces. How a client stores those bytes locally is in principle up to the implementation. But if you want to store it in the file layout described in the .torrent then you have to calculate a mapping between the pieces address space and file offsets. In BEP3 files are not aligned to piece boundaries, so a single piece can straddle multiple files. BEP 47 and BEP 52 aim to simplify this by introducing padding files or implicit alignment gaps respectively.

Changing the head of a large Fortran binary file without dealing with the whole body

I have a large binary file (~ GB size) generated from a Fortran 90 program. I want to modify something in the head part of the file. The structure of the file is very complicated and contains many different variables, which I want to avoid going into. After reading and re-writing the head, is it possible to "copy and paste" the reminder of the file without knowing its detailed structure? Or even better, can I avoid re-writing the whole file altogether and just make changes on the original file? (Not sure if it matters, but the length of the header will be changed.)
Since you are changing the length of the header, I think that you have to write a new, revised file. You could avoid having to "understand" the records after the header by opening the file with stream access and just reading bytes (or perhaps four byte words if the file is a multiple of four bytes) until you reach EOF and copying them to the new file. But if the file was originally created as sequential access and you want to access it that way in the future, you will have to handle the record length information for the header record(s), including altering the value(s) to be consistent with the changed the length of the record(s). This record length information is typically a four-byte integer at beginning and end of each record, but it depends on the compiler.

Partial buffer writes

I'm looking through the node Buffer documentation in detail, and I can't get my head around the explanation for buffer.write().
Specifically, I don't get what the behaviour is when a write attempt is performed with string larger than the buffer's capacity. The following passage seems to contradict itself:
If buffer did not contain enough space to fit the entire string, it will write a partial amount of the string. length defaults to buffer.length - offset. The method will not write partial characters.
The first sentence claims it will write what it can, while the last one says it's an all-or-nothing operation.
Am I missing something?
In certain encodings (like UTF-8) a single character can be represented by multiple bytes.
When the documentation says "The method will not write partial characters" I think they mean that if a characters needs 3 bytes but there are only 2 bytes left on the buffer the character won't be written at all (as opposed to only writing the first 2 bytes)
http://en.wikipedia.org/wiki/UTF-8

why does an iterable object have no length in Python?

I think I am constantly improving my previous question. Basically, I would need to chunk up a large text (csv) file to send pieces to a multiprocess.Pool. To do so, I think I need at iterable object where the lines can be iterated over.
(see how to multiprocess large text files in python?)
Now I realized that the file object itself (or an _io.TextIOWrapper type) after you open a textfile is iterable line by line, so perhaps my chunking code (now below, sorry for missing it previously) could chunk it, if it could get its length? But if it's iterable, why can't I simply call its length (by lines, not bytes)?
Thanks!
def chunks(l,n):
"""Divide a list of nodes `l` in `n` chunks"""
l_c = iter(l)
while 1:
x = tuple(itertools.islice(l_c,n))
if not x:
return
yield x
The reason files are iterable, is that they are read in series. The length of a file, in lines, can't be calculated unless the whole file is processed. (The file's length in bytes is no indicator to how many lines it has.)
The problem is that, if the file were Gigabytes long, you might not want to read it twice if it could be helped.
That is why it is better to not know the length; that is why one should deal with data files as an Iterable rather than a collection/vector/array that has a length.
Your chunking code should be able to deal directly with the file object itself, without knowing its length.
However if you wanted to know the number of lines before processing fully, your 2 options are
buffer the whole file into an array of lines first, then pass these lines to your chunker
read it twice over, the first time discarding all the data, just recording the lines

Splitting long input into multiple text files

I have some code which will generate an infinite number of lines in output. So, I can't store those values in a single output file.
Instead, I split the output file into more files. I am splitting the file according to the index numbers. Now my doubt is I don't know how many numbers my file will be having. So is it possible to split the file into different output without giving index? For example:
first 100,000 lines in m.txt
from 100,001 to next 200,000 in n.txt
If you don't need to be able to find a particular line based on the file name, you can split the output based on the file size. Write lines to m1.txt until the next line will make it >1MB; then move to the next file - m2.txt.
split(1) appears to be exactly the tool for your job.
Generate files with a running index. Start with opening e.g. m_000001.txt. Write a fixed nuber of lines to that file. Close file. Open next file, e.g. m_000002.txt, and continue.
Making sure that you don't overflow the disk is an housekeepting task to be done separately. Here one can think of backups, compression, file rotation and so on.
You may want to use logrotate for this purpose. It has a lot of options: check out the man page.
Here's the introduction of the man page:
"logrotate is designed to ease administration of systems that generate
large numbers of log files. It allows automatic rotation, compression,
removal, and mailing of log files. Each log file may be handled daily,
weekly, monthly, or when it grows too large."
4 ways to split while writing:
A) Fixed no of characters (Size)
B) Fixed no of lines
C) Fixed Interval of time before writing
D) Fixed Counter of a function before calling a write
Based on those splitings, You can name the output file.

Resources