The structure of the uTorrent's uTorrentPartFile.dat - bittorrent

I'm trying to make a small utility that should automate some maintenance tasks of the uTerrent's pool of torrents. To veryfy the hashes of partially downloded shares, I have to retrieve the parts of the pieces, that are not completely contained by the downloaded files, from the ~uTorrentPartFile_XXX.dat file where uTorrent keeps them. This raises two questions:
Given a certain .torrent file, how do I compute the name of the corresponding ~uTorrentPartFile_XXX.dat file (namely, the hexadecimal string that uTorrent uses instead of my XXX)
Where can I find information about the inner structure of the file that would allow me to retrieve the required data from it? Google's failed to help.

The BiglyBT team has reverse engineered the ~uTorrentPartFile_XXXX.dat format when they created a migration plugin.
https://www.biglybt.com/download/utMigrate
https://github.com/BiglySoftware/BiglyBT-plugin-migratetorrentapp
From: https://github.com/BiglySoftware/BiglyBT-plugin-migratetorrentapp/blob/master/src/com/biglybt/plugins/migratetorrentapp/utorrent/PartFile.java
/**
* uTorrent partfile
* Basically, torrent data is split into 64k parts. Header has 4 byte index for
* each part, pointing to data if index is > 0.
* After the header is the 64k data chunks, first data chunk is 1, second is 2, etc.
* Last data chunk may be smaller than 64k.
*
* ~uTorrentPartFile_*<hexsize>*.dat
* <Header>, <data>
*
* hexsize
* torrent data length in bytes in hex with no leading 0
*
* Header
* <DataIndex>[<Num64kParts>]
* Raw header length = <Num64kParts> * 4
*
* Num64kParts
* How many parts is required if you split torrent data length into 64k sections.
* ie. Math.ceil(torrent data length in bytes / 64k)
*
* DataIndex
* 4 byte little endian integer. Values:
* 0
* No data for this 64k part
* 1..<num64Parts>
* 1-based positional index in <data>
* Location in part file can be calculated with
* (Header Size) + ((value - 1) * 64k)
*
* data
* <DataPart>[up to <num64kParts>]
*
* DataPart
* 64k byte array containing torrent data.
* Bytes in <DataPart> that are stored elsewhere in real files will be 0x00.
* ie. non-skipped files sharing the 64k part will be 0x00.
* Last <DataPart> may be less than 64k, which means the rest of the 64k would
* be 0x00 (and part of a non-skipped file)
*
*/
Bonus
There is also some useful information about the content in resume.dat and settings.dat in the code comments here:
https://github.com/BiglySoftware/BiglyBT-plugin-migratetorrentapp/blob/master/src/com/biglybt/plugins/migratetorrentapp/utorrent/ResumeConstants.java
https://github.com/BiglySoftware/BiglyBT-plugin-migratetorrentapp/blob/master/src/com/biglybt/plugins/migratetorrentapp/utorrent/SettingsConstants.java

Related

Where exactly should I write the OpenMP "THREADPRIVATE" directive for common blocks and SAVE variables in FORTRAN?

Can you please tell me exactly where to place the OpenMp THREADPRIVATE directive for the common block? Immediately after the description of this block, immediately after the description of this block and all other variables of any kind, or only after all DATA blocks? Example:
SUBROUTINE SCHNEVPDH (RZ,FLAT,FLON,R,T,L,BN,BE,BV)
PARAMETER (IBO=0,JBO=1,KDIM=8,LDIM=4)
DIMENSION FN(0:KDIM,0:KDIM), CONSTP(0:KDIM,0:KDIM)
DIMENSION CML(KDIM), SML(KDIM)
DIMENSION DELT(0:LDIM)
DIMENSION BINT(0:KDIM,0:KDIM,1-IBO-JBO:LDIM),
* BEXT(0:KDIM,0:KDIM,1-IBO-JBO:LDIM)
COMMON /CONST/UMR,PI
* /AMTB/BINT,BEXT,RE,TZERO,IFIT,IB,KINT,LINT,KEXT,
* LEXT,KMAX,FN
CHARACTER*1 IE,RESP
DATA ((CONSTP(N,M), M=0,N), N=0,KDIM)
* /4*1.,1.73205,0.866025,1.,2.44949,1.93649,0.7905691,1.,
* 3.16228,3.35410,2.09165,0.739510,1.,3.87298,5.12348,
* 4.18330,2.21853,0.701561,1.,4.58258,2*7.24569,4.96078,
* 2.32681,0.671693,1.,5.29150,9.72111,11.4564,9.49918,
* 5.69951,2.42182,0.647260,1.,6.,12.5499,16.9926,16.4531,
* 11.8645,6.40755,2.50683,0.626707/
!$OMP THREADPRIVATE(/CONST/)
!$OMP THREADPRIVATE(/AMTB/)
IBF = 2
T1=1.
T2=12.
CALL TBFIT (T1,T2,IBF,THINT,TZERO)
Is this directive in the right place? The same question applies to SAVE type variables.

Vim: sorting with levels

I have the following file with Markdown markup:
* [B](B)
* b
* c
* [C](C)
* [A](a)
* a
I try to sort it and get the following result:
* [A](a)
* a
* [B](B)
* b
* c
* [C](C)
It is necessary to sort only the main levels, and sub-levels must follow the main levels, i.e., stay at the levels where they were. The first thing that comes to mind is of course :sort; but unfortunately this will also sort the sub-levels. We will get:
* a
* b
* c
* [A](a)
* [B](B)
* [C](C)
Are there any tricks or plugins for this kind of sorting? Thx!
The usual approach for this class of problems is to inline each block, then sort them, and then "de-inline" them back to their original state.
First step: inline each block.
We do this by replacing each EOL followed by SPACE-SPACE-STAR with some fancy symbol unlikely to be found in our document:
:%s/\n\( \*\)/§\1
Which gives us the following:
* [B](B)§ * b§ * c
* [C](C)
* [A](a)§ * a
Second step: sort the buffer.
We simply use :help :sort:
:sort
to obtain this:
* [A](a)§ * a
* [B](B)§ * b§ * c
* [C](C)
Third step: revert each "block" to its initial state.
We do this by reverting the substitution above with another, much simpler, one:
:%s/§/\r
which gives us the desired outcome:
* [A](a)
* a
* [B](B)
* b
* c
* [C](C)
A couple of notes:
The exact pattern to use in the first substitution depends on the exact structure of your document. That part is, IMO, too highly contextual to be generalisable.
§ is just an example, use whatever symbol you want.

Working out sample rate and bit depth of aiff audio from file size

I need some help with Maths/logic here. Working with aif files.
I have written the following:
LnByte = FileLen(ToCheck) 'Returns Filesize in Bytes
LnBit = LnByte * 8 'Get filesize in Bits
Chan = 1 'Channels in audio: mono = 1
BDpth = 24 'Bit Detph
SRate = 48000 'Sample Rate
BRate = 1152000 'Expected Bit Rate
Time_Secs = LnBit / Chan / BDpth / SRate 'Size in Bits / Channels / Bit Depth / Sample Rate
FSize = (BRate / 8) * Time_Secs '(Bitrate / 8) * Length of file in seconds
ToCheck is the current file when looping through a folder of files.
So I'm finding the length of audio based on the file size in bits / channels / bit depth / sample rate. This assumes that the bit depth and sample rate are correct (I need the files to be 24-bit/48kHz).
Time_Secs = Length of the file in seconds.
FSize = File size based on 24/48kHz using the Time_Secs
Probably because the FSize uses Time_Secs, I can't work out how to, from this, work out if the file sample rate and/or bit depth are indeed correct...
Assuming 24/48k should give 144,000 Bytes per second
Assuming 16/48k should give 96,000 Bytes per second
If I check a file that is 16-bit/48 kHz using the above code it gives the incorrect time in secs (naturally) but the correct file size... even though the Bit Rate is 1,152,000 should be wrong.
-- It would seem that the difference in time is making up for the difference in Bit Rate - or I'm looking at it wrong.
How would I adapt my formula, or do the maths to work out if the sample rate/bit depth of a file is actually 48,000 Hz /24-bit? Or is there a different way entirely? Remembering that they are aif files, not wavs.
Hope that makes sense.
Many Thanks in advance!

Apache Spark - shuffle writes more data than the size of the input data

I use Spark 2.1 in local mode and I'm running this simple application.
val N = 10 << 20
sparkSession.conf.set("spark.sql.shuffle.partitions", "5")
sparkSession.conf.set("spark.sql.autoBroadcastJoinThreshold", (N + 1).toString)
sparkSession.conf.set("spark.sql.join.preferSortMergeJoin", "false")
val df1 = sparkSession.range(N).selectExpr(s"id as k1")
val df2 = sparkSession.range(N / 5).selectExpr(s"id * 3 as k2")
df1.join(df2, col("k1") === col("k2")).count()
Here, the range(N) creates a dataset of Long (with unique values), so I assume that the size of
df1 = N * 8 bytes ~ 80MB
df2 = N / 5 * 8 bytes ~ 16MB
Ok now let's take df1 as an example.
df1 consists of 8 partitions and shuffledRDDs of 5, so I assume that
# of mappers (M) = 8
# of reducers (R) = 5
As the # of partitions is low, Spark will use the Hash Shuffle which will create M * R files in the disk but I haven't understood if every file has all the data, thus each_file_size = data_size resulting to M * R * data_size files or all_files = data_size.
However when executing this app, shuffle write of df1 = 160MB which doesn't match either of the above cases.
Spark UI
What am I missing here? Why has the shuffle write data doubled in size?
First of all, let's see what data size total(min, med, max) means:
According to SQLMetrics.scala#L88 and ShuffleExchange.scala#L43, the data size total(min, med, max) we see is the final value of dataSize metric of shuffle. Then, how is it updated? It get updated each time a record is serialized: UnsafeRowSerializer.scala#L66 by dataSize.add(row.getSizeInBytes) (UnsafeRow is the internal representation of records in Spark SQL).
Internally, UnsafeRow is backed by a byte[], and is copied directly to the underlying output stream during serialization, its getSizeInBytes() method just return the length of the byte[]. Therefore, the initial question is transformed to: Why the bytes representation is twice big as the only long column a record have? This UnsafeRow.scala doc gives us the answer:
Each tuple has three parts: [null bit set] [values] [variable length portion]
The bit set is used for null tracking and is aligned to 8-byte word boundaries. It stores one bit per field.
since it's 8-byte word aligned, the only 1 null bit is taking another 8 byte, the same width as the long column. Therefore, each UnsafeRow represents your one-long-column-row using 16 bytes.

Base64: What is the worst possible increase in space usage?

If a server received a base64 string and wanted to check it's length before converting,, say it wanted to always permit the final byte array to be 16KB. How big could a 16KB byte array possibly become when converted to a Base64 string (assuming one byte per character)?
Base64 encodes each set of three bytes into four bytes. In addition the output is padded to always be a multiple of four.
This means that the size of the base-64 representation of a string of size n is:
ceil(n / 3) * 4
So, for a 16kB array, the base-64 representation will be ceil(16*1024/3)*4 = 21848 bytes long ~= 21.8kB.
A rough approximation would be that the size of the data is increased to 4/3 of the original.
From Wikipedia
Note that given an input of n bytes,
the output will be (n + 2 - ((n + 2) %
3)) / 3 * 4 bytes long, so that the
number of output bytes per input byte
converges to 4 / 3 or 1.33333 for
large n.
So 16kb * 4 / 3 gives very little over 21.3' kb, or 21848 bytes, to be exact.
Hope this helps
16kb is 131,072 bits. Base64 packs 24-bit buffers into four 6-bit characters apiece, so you would have 5,462 * 4 = 21,848 bytes.
Since the question was about the worst possible increase, I must add that there are usually line breaks at around each 80 characters. This means that if you are saving base64 encoded data into a text file on Windows it will add 2 bytes, on Linux 1 byte for each line.
The increase from the actual encoding has been described above.
This is a future reference for myself. Since the question is on worst case, we should take line breaks into account. While RFC 1421 defines maximum line length to be 64 char, RFC 2045 (MIME) states there'd be 76 char in one line at most.
The latter is what C# library has implemented. So in Windows environment where a line break is 2 chars (\r\n), we get this: Length = Floor(Ceiling(N/3) * 4 * 78 / 76)
Note: Flooring is because during my test with C#, if the last line ends at exactly 76 chars, no line-break follows.
I can prove it by running the following code:
byte[] bytes = new byte[16 * 1024];
Console.WriteLine(Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks).Length);
The answer for 16 kBytes encoded to base64 with 76-char lines: 22422 chars
Assume in Linux it'd be Length = Floor(Ceiling(N/3) * 4 * 77 / 76) but I didn't get around to test it on my .NET core yet.
Also it would depend on actual character encoding, i.e. if we encode to UTF-32 string, each base64 character would consume 3 additional bytes (4 byte per char).

Resources