Does the last piece of torrent correspond to the last piece of last file? - bittorrent

I am trying to download multiple torrents where I want last pieces of .mp4 files.
I can't specifically target specific parts of specific files, but for example,if my torrent contains 3 files:
1.mp4
2.mp4
3.mp4
is the last piece of torrent same as last piece of the 3.mp4 file? So that, by downloading the last piece I will be downlaoding the last piece of 3.mp4 file.
and is there a way to target last pieces of 2.mp4 and 1.mp4 as well?
Thank you.

Is the last piece of torrent same as last piece of the 3.mp4 file? So that, by downloading the last piece I will be downlaoding the last piece of 3.mp4 file.
Yes
Is there a way to target last pieces of 2.mp4 and 1.mp4 as well?
Yes
When a torrent is created, all files in it is concatenated together and then chunked up in pieces.
example:
Files |-------------------#1|----------------#2|---------------------------#3|
Pieces |--0|--1|--2|--3|--4|--5|--6|--7|--8|--9|-10|-11|-12|-13|-14|-15|-16|17|
All pieces has the same length except the last one.
A file has one or more pieces.
A piece may contain (parts from) more than one file.
It's very rare that internal file and piece boundaries align (except if padding files is used).
A file in a multi-file torrent almost always has a piece shared with another file.
In the metadata from the .torrent file there is; the piece size, file sizes and and the exact order of the files.
The files are ordered as they appear on the files list in the .torrent file.
The order is decided by the torrent creator and my be any arbitrary order. File size, order on disk, alphabetical, random,creation date, name length, etc, etc
From that data, it's possible to calculate exact what piece and offset a specific file ends.

Related

automatically gathering multiple variables with same name into an array

I have a python program which reads a word and its meanings (one or more) from a json file and creates audio for each meaning separately.
As a result there are dynamically generated multiple audio files such as M-1.mp3, M-2.mp3, ... M-n.mp3
later on i want to concatenate all the audio files into one audio clip with a function which requires the names of each audio files listed one by one.
Is there a way i can pass the filenames as a list to the function provided that I know the number of audio files that I want to concatenate.
I want something like this:
One_Audio=concatenate(listnames("M-",3))
to get the this:
One_Audio=concatenate("M-1.mp3","M-2.mp3","M-3.mp3")
effect.

Microsoft translator engine customization: parallel txt files

I am trying to perform some NMT engine customization for Japanese but I am having some difficulties uploading parallel txt files. I've gathered 10k parallel sentences and I've put them into two txt files:
As the guide suggested, I've also been careful to remove sentences containing the \n and \r characters in them, but upon uploading I get the following:
What's wrong?
We display the sentence counts because the model training engine operates at the sentence level. The expected format of the txt parallel file set is one sentence for each line. During the upload process we do run a sentence breaker which identifies end of sentence markers and breaks accordingly. This is why the count of sentences do not always match the count of lines. Sentences are the units we operate on, not lines of the input file. That's why we focus on sentences rather than lines.
This is also why we suggest removing newline characters within sentences. The newline is considered an end of sentence marker, so having newlines within a sentence creates a false sentence break.
In response to your second concern, we do run a sentence aligning process on most data that is submitted. If there is an inconsistent number of sentences in the uploaded parallel files we can usually get most of the sentence pairs, as long as the sentences are fairly close.
After some "debugging" I've noticed that the number shown in the portal is the number of sentences (instead of lines, my bad!). I find it kind of confusing (and not really useful in my opinion). What would be the usefulness of displaying this information?
In addition, I've noticed that there is no warning if you upload one file containing less lines than the second file (which would make the parallel files not parallel anymore - the whole point of parallel files is to have X lines in the source file, and X lines on the target file). It would be helpful if at least a warning was shown to prevent mistakes (if you use parallel files and len(f1)!=len(f2) it's a great indicator that something is off)

Adding big junks of custom data to jpg image file

I wonder if there is an obvious and elegant way to add additional data to a jpeg while keeping it readable for standard image viewers. More precisely I would like to embed a picture of the backside of a (scanned) photo into it. Old photos often have personal messages written on the back, may it be the date or some notes. Sure you could use EXIF and add some text, but an actuall image of the back is more preferable.
Sure I could also save 2 files xyz.jpg and xyz_back.jpg, or arrange both images side by side, always visible in one picture, but that's not what I'm looking for.
It is possible and has been done, like on Samsung Note 2 and 3 you can add handwritten notes to the photos you've taken as a image. Or some smartphones allow to embed voice recordings to the image files while preserving the readability of those files on other devices.
There are two ways you can do this.
1) Use and Application Marker (APP0–APPF)—the preferred method
2) Use a Comment Marker (COM)
If you use an APPn marker:
1) Do not make it the first APPn in the file. Every known JPEG file format expects some kind of format specific APPn marker right after the SOI marker. Make sure that your marker is not there.
2) Place a unique application identifier (null terminated string) at the start of the data (something done by convention).
All kinds of applications store additional data this way.
One issue is that the length field is only 16-bits (Big Endian format). If you have a lot of data, you will have to split it across multiple markers.
If you use a COM marker, make sure it comes after the first APPn marker in the file. However, I would discourage using a COM marker for something like this as it might choke applications that try to display the contents.
An interesting question. There are file formats that support multiple images per file (multipage TIFF comes to mind) but JPEG doesn't support this natively.
One feature of the JPEG file format is the concept of APP segments. These are regions of the JPEG file that can contain arbitrary information (as a sequence of bytes). Exif is actually stored in one of these segments, and is identified by a preamble.
Take a look at this page: http://www.sno.phy.queensu.ca/~phil/exiftool/#JPEG
You'll see many segments there that start with APP such as APP0 (which can store JFIF data), APP1 (which can contain Exif) and so forth.
There's nothing stopping you storing data in one of these segments. Conformant JPEG readers will ignore this unrecognised data, but you could write software to store/retrieve data from within there. It's even possible to embed another JPEG file within such a segment! There's no precedent I know for doing this however.
Another option would be to include the second image as the thumbnail of the first. Normally thumbnails are very small, but you could store a second image as the thumbnail of the first. Some software might replace or remove this though.
In general I think using two files and a naming convention would be the simplest and least confusing, but you do have options.

How to quickly infer start/end time of files that only show start time?

I have a huge list of video files from a webcam that have that look like this:
video_123
video_456
video_789
...
Where each number (123, 456, and 789) represents the start time of the file in seconds since epoch. The files are created based on file size and are not always the same duration. There may also be gaps in the files (eg camera goes down for an hour). It is a custom file format that I cannot change.
I have a tool that can extract out portions of the video given a time range and a set of files. However, it will run MUCH faster if I only give the tool the files that have frames within the given range. It's very costly to determine the duration of each file. Instead, I'd like to use the start timestamp to rule out most files. For example, if I wanted video for 500-600, I know video_123 will not be needed because video_456 is larger. Also, video_789 is larger than 600 so it will not be needed either.
I could do an ls and iterate through each file, converting the timestamp to an int and comparing until we hit a file bigger than the desired range. I have a LOT of files and this is slow. Is there a faster method? I was thinking of having some sort of binary-tree that could get log2n search time and already have the timestamps parsed out. I am doing most of this work in bash and would prefer to use simple, common tools like grep, awk, etc. However, I will consider Perl or some other large scripting language if there is a compelling reason.
If you do several searchs with the files, you can pre-process the files, in the sense of loading them into a bash array (note, bash, not sh), order them, and then do a binary search. Assume for a second that the name of the file is just the time tag, this will ease the examples (you can always do ${variable/video_/} to remove the prefix.)
First, you can use an array to load all the files sorted:
files=(`echo * | sort -n`)
Then implement the binary search (just a sketch, searching for the pos $min-$max):
nfiles=${#files[*]}
nfiles2=`expr $nfiles / 2`
if test ${files[$nfiles2]} -gt $max
then
nfiles2=`expr $nfiles2 - $nfiles2/2`
else
#check $min, etc.
fi
And so on. Searching several times once you have the files ordered in the array would make faster lookups.
Because of a quirk of UNIX design, there is no way to search for the name of a file in a directory other than stepping through the filenames one by one. So if you keep all your files in one directory, you're not going to get much faster than using ls.
That said, if you're willing to move your files around, you could turn your flat directory into a tree by splitting on the most significant digits. Instead of:
video_12301234
video_12356789
video_12401234
video_13579123
You could have:
12/video_12301234
12/video_12356789
12/video_12401234
13/video_13579123
or even:
12/30/video_12301234
12/35/video_12356789
12/40/video_12401234
13/57/video_13579123
For best results with this method, you'll want to have your files named with leading zeros so the numbers are all the same length.

Access MP3 audio data independently of ID3 tags?

this is a 2 part question. First off, is it possible to access the audio data in an MP3 independently of the ID3 tags, and secondly, is there any way to do so using available libraries?
I recently consolidated my music collection from 3 computers and ended up with songs which had changed ID3 tags, but the audio data itself was unmodified. Running a search for duplicate files failed because the file changed with the ID3 tag change, but I think it should be possible to identify duplicate files if I just run a deduplication using the audio data for comparison.
I know that it's possible to seek to a particular position past the ID3 header in the file, and directly read the data, but was wondering if there's a library that would expose the audio data so I could just extract the data, run a checksum on it, and store the computed result somewhere, then look for identical checksums. (Also, I'd probably have to use some kind of library when you take into account variable length headers.)
Coincidentally I wanted to do something similar the other day.
Here is a Ruby script that I whipped up:
http://code.google.com/p/kodebucket/source/browse/trunk/bin/mp3dump.rb
It dumps mpeg frames to stdout, so one could grab a checksum like so:
# mp3dump.rb file.mp3 | md5sum

Resources