If torrent contain multiple files, how to know what piece correspond to each file? - bittorrent

I'm building a BitTorrent client application in Java and I have 2 small question:
Can torrent contain folders? recursively?
If a torrent contains n files (not directories - for simplicity), do I need to create n files with their corresponding size? When I receive a piece from a peer, how do I know to which file it (the piece) belong?
For example, here is a torrent which contains 2 files:
TorrentInfo{Created By: ruTorrent (PHP Class - Adrien Gibrat)
Main tracker: http://tracker.hebits.net:35777/tracker.php?do=announce&passkey=5d3ab309eda55c1e7183975099958ab2
Comment: null
Info_hash: c504216ca4a113d26f023a10a1249ca3a6217997
Name: Veronica.2017.1080p.BluRay.DTS-HD.MA.5.1.x264-HDH
Piece Length: 16777216
Pieces: 787
Total Size: null
Is Single File Torrent: false
File List:
TorrentFile{fileLength=13202048630, fileDirs=[Veronica.2017.1080p.BluRay.DTS-HD.MA.5.1.x264-HDH.mkv]}
TorrentFile{fileLength=62543, fileDirs=[Veronica.2017.1080p.BluRay.DTS-HD.MA.5.1.x264-HDH.srt]}
The docs doesn't say much: https://wiki.theory.org/index.php/BitTorrentSpecification

what you are doing is similar to mine...
The following bold fonts are important to your questions.
1.yes; no
Info in Multiple File Mode
name: the name of the directory in which to store all the files. This is purely advisory. (string)
path: a list containing one or more string elements that together represent the path and filename. Each element in the list corresponds to either a directory name or (in the case of the final element) the filename. For example, a the file "dir1/dir2/file.ext" would consist of three string elements: "dir1", "dir2", and "file.ext". This is encoded as a bencoded list of strings such as l4:*dir*14:*dir*28:file.exte
Info in Single File Mode
name: the filename. This is purely advisory. (string)
Filename includes floder name.
2.maybe;
Whether you need to create n files with their corresponding size depend on whether you need to download n files.
Peer wire protocol (TCP)
piece:
The piece message is variable length, where X is the length of the block. The payload contains the following information:
index: integer specifying the zero-based piece index
begin: integer specifying the zero-based byte offset within the piece
block: block of data, which is a subset of the piece specified by index.
For the purposes of piece boundaries in the multi-file case, consider the file data as one long continuous stream, composed of the concatenation of each file in the order listed in the files list. The number of pieces and their boundaries are then determined in the same manner as the case of a single file. Pieces may overlap file boundaries.
I am sorry for my english, because I am not native speaker...

Can torrent contain folders? recursively?
Yes.
Sortof. In BEP3 Nested directories are mapped into path elements, i.e. /dir1/dir2/dir3/file.ext is represented as path: ["dir1", "dir2", "dir3", "file.ext"] in the file list. BEP52 changes this to a tree-based structure more closely resembling a directory tree.
If a torrent contains n files (not directories - for simplicity), do I need to create n files with their corresponding size? When I receive a piece from a peer, how do I know to which file it (the piece) belong?
The bittorrent wire protocol deals with a contiguous address space of bytes which are grouped into fixed-sized pieces. How a client stores those bytes locally is in principle up to the implementation. But if you want to store it in the file layout described in the .torrent then you have to calculate a mapping between the pieces address space and file offsets. In BEP3 files are not aligned to piece boundaries, so a single piece can straddle multiple files. BEP 47 and BEP 52 aim to simplify this by introducing padding files or implicit alignment gaps respectively.

Related

Concatenate files by inode

Is there a method in linux to concatenate existing files by essentially turning 2 files into 1 file with 2 fragments? I'm imagining by updating the first file's inode pointers to include the second files blocks and then removing the second files inode.
This is not "physically" possible on most filesystems, and there is no Linux system call to do it.
Consider the case of appending two files to each other, where each file is 1 GB + 1 byte. Simply concatenating the two would leave a single 1-byte extent in the middle of the file; most filesystems have no way of representing this, as they only use partial extents at the end of a file.

Changing the head of a large Fortran binary file without dealing with the whole body

I have a large binary file (~ GB size) generated from a Fortran 90 program. I want to modify something in the head part of the file. The structure of the file is very complicated and contains many different variables, which I want to avoid going into. After reading and re-writing the head, is it possible to "copy and paste" the reminder of the file without knowing its detailed structure? Or even better, can I avoid re-writing the whole file altogether and just make changes on the original file? (Not sure if it matters, but the length of the header will be changed.)
Since you are changing the length of the header, I think that you have to write a new, revised file. You could avoid having to "understand" the records after the header by opening the file with stream access and just reading bytes (or perhaps four byte words if the file is a multiple of four bytes) until you reach EOF and copying them to the new file. But if the file was originally created as sequential access and you want to access it that way in the future, you will have to handle the record length information for the header record(s), including altering the value(s) to be consistent with the changed the length of the record(s). This record length information is typically a four-byte integer at beginning and end of each record, but it depends on the compiler.

Find the file offset for a character index, ignoring newline

I have a text file of 3GB size (a FASTA file with DNA sequences). It contains about 50 million lines of differing
length, though the most lines are 70 characters wide. I want to extract a string from this file, given two character indices. The difficult
part is, that newlines shall not be counted as character.
For good speed, I want to use seek() to reach the beginning of the string and start reading, but I need the offset in bytes for that.
My current approach is to write a new file, with all the newlines removed, but that takes another 3GB on disk. I want to find a solution which requires less disk space.
Using a dictionary mapping each character count to a file offset is not practicable either, because there would be one key for every byte, therefore using at least 16bytes*3 billion characters = 48GB.
I think I need a data structure which allows to retrieve the number of newline characters that come before a character of certain index, then I can add their number and the character index to obtain the file offset in bytes.
The SamTools fai index was designed just for this purpose. Which makes a very small compact index file with enough information to quickly seek to any point in the fasta file for any record inside as long as the file is properly formatted
You can create a SamTools index using samtools faidx command.
You can then use other programs in the SamTools package to pull out subsequences or alignments very quickly using the index.
see http://www.htslib.org/doc/samtools.html for usage.

data pointers in inode data structure

I have gone through the code of inode in linux kernel code but I am unable to figure where are the data pointers in inode. I know that there are 15 pointers [0-14] out of which 12 are direct, 1 single indirect, 1 double indirect and 1 triple indirect.
Can some one please locate these data members. Also please specify how you located these as I have searched on google many time with different key words but all in vain.
It is up to a specific file system to access its data, so there's no "data pointers" in general (some file systems may be virtual, that means generating their data on the fly or retrieving it from network).
If you're interested in ext4, you can look up the ext4-specific inode structure (struct ext4_inode) in fs/ext4/ext4.h, where data of an inode is indeed referenced by indices of 12 direct blocks, 1 of single indirection, 1 of double indirection and 1 of triple indirection.
This means that blocks [0..11] of an inode's data have numbers e4inode->i_block[0/1/.../11], whereas e4inode->i_block[12] is a number of a block which is filled with data block numbers itself (so it holds indices of inode's data blocks in range [12, 12 + fs->block_size / sizeof(__le32)]. The same trick is applied to i_block[13], only it holds double-indirected indices (blocks filled with indices of blocks that hold list of blocks holding the actual data) starting from index 12 + fs->block_size / sizeof(__le32), and i_block[14] holds triple indirected indices.
As explained here:
http://computer-forensics.sans.org/blog/2010/12/20/digital-forensics-understanding-ext4-part-1-extents
Ext4 uses extents instead of block pointers to track the file content.
If you are interested in ext3/ext2 datastructure where content pointer are used:
http://www.slashroot.in/how-does-file-deletion-work-linux
has many good diagrams to elaborate it. And here:
http://mcgrewsecurity.com/training/extx.pdf
at page 16 has examples of the details of "block pointers" (which are basically block numbers, or offset values relative to the start of the disk image, 1 block usually 512 bytes).
If you want to walk the filesystem phyiscally, say for a ext3 formatted hard drive, see this:
http://wiki.sleuthkit.org/index.php?title=FS_Analysis
but you can always use just "dd" command to do everything, just need to know where to start reading and stop reading, and input for the dd command is usually a replica of the harddisk image itself, for many reasons.

How to find the position of Central Directory in a Zip file?

I am trying to find the position of the first Central Directory file header in a Zip file.
I'm reading these:
http://en.wikipedia.org/wiki/Zip_(file_format)
http://www.pkware.com/documents/casestudies/APPNOTE.TXT
As I see it, I can only scan through the Zip data, identify by the header what kind of section I am at, and then do that until I hit the Central Directory header. I would obviously read the File Headers before that and use the "compressed size" to skip the actual data, and not for-loop through every byte in the file...
If I do it like that, then I practically already know all the files and folders inside the Zip file in which case I don't see much use for the Central Directory anymore.
To my understanding the purpose of Central Directory is to list file metadata, and the position of the actual data in the Zip file so you wouldn't need to scan the whole file?
After reading about End Of Central Directory record, Wikipedia says:
This ordering allows a zip file to be created in one pass, but it is
usually decompressed by first reading the central directory at the
end.
How would I find End of Central Directory record easily? We need to remember that it can have an arbitrary sized comment there, so I may not know how many bytes from the end of the data stream it is located at. Do I just scan it?
P.S. I'm writing a Zip file reader.
Start at the end and scan towards the beginning, looking for the end of directory signature and counting the number of bytes you have scanned. When you find a candidate, get the byte 20 offset for the comment length (L). Check if L + 20 matches your current count. Then check that the start of the central directory (pointed to by the byte 12 offset) has an appropriate signature.
If you assumed the bits were pretty random when the signature check happened to be a wild guess (e.g. a guess landing into a data segment), the probability of getting all the signature bits correct is pretty low. You could refine this and figure out the chance of landing in a data segment and the chance of hitting a legitimate header (as a function of the number of such headers), but this is already sounded like a low likelihood to me. You could increase your confidence level by then checking the signature of the first file record listed, but be sure to handle the boundary case of an empty zip file.
I ended up looping through the bytes starting from the end. The loop stops if it finds a matching byte sequence, the index is below zero or if it already went through 64k bytes.
Just cross your fingers and hope that there isn't an entry with the CRC, timestamp or datestamp as 06054B50, or any other sequence of four bytes that happen to be 06054B50.

Resources