Generating a fake ISBN from book title? (Or: How to hash a string into a 6-digit numeric ID)

Generating a fake ISBN from book title? (Or: How to hash a string into a 6-digit numeric ID) - string

Short version: How can I turn an arbitrary string into a 6-digit number with minimal collisions?
Long version:
I'm working with a small library that has a bunch of books with no ISBNs. These are usually older, out-of-print titles from tiny publishers that never got an ISBN to begin with, and I'd like to generate fake ISBNs for them to help with barcode scanning and loans.
Technically, real ISBNs are controlled by commercial entities, but it is possible to use the format to assign numbers that belong to no real publisher (and so shouldn't cause any collisions).
The format is such that:
978-0-01-######-?
Gives you 6 digits to work with, from 000000 to 999999, with the ? at the end being a checksum.
Would it be possible to turn an arbitrary book title into a 6-digit number in this scheme with minimal chance of collisions?

After using code snippets for making a fixed-length hash and calculating the ISBN-13 checksum, I managed to create really ugly C# code that seems to work. It'll take an arbitrary string and convert it into a valid (but fake) ISBN-13:
public int GetStableHash(string s)
{
uint hash = 0;
// if you care this can be done much faster with unsafe
// using fixed char* reinterpreted as a byte*
foreach (byte b in System.Text.Encoding.Unicode.GetBytes(s))
{
hash += b;
hash += (hash << 10);
hash ^= (hash >> 6);
}
// final avalanche
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
// helpfully we only want positive integer < MUST_BE_LESS_THAN
// so simple truncate cast is ok if not perfect
return (int)(hash % MUST_BE_LESS_THAN);
}
public int CalculateChecksumDigit(ulong n)
{
string sTemp = n.ToString();
int iSum = 0;
int iDigit = 0;
// Calculate the checksum digit here.
for (int i = sTemp.Length; i >= 1; i--)
{
iDigit = Convert.ToInt32(sTemp.Substring(i - 1, 1));
// This appears to be backwards but the
// EAN-13 checksum must be calculated
// this way to be compatible with UPC-A.
if (i % 2 == 0)
{ // odd
iSum += iDigit * 3;
}
else
{ // even
iSum += iDigit * 1;
}
}
return (10 - (iSum % 10)) % 10;
}
private void generateISBN()
{
string titlehash = GetStableHash(BookTitle.Text).ToString("D6");
string fakeisbn = "978001" + titlehash;
string check = CalculateChecksumDigit(Convert.ToUInt64(fakeisbn)).ToString();
SixDigitID.Text = fakeisbn + check;
}

The 6 digits allow for about 10M possible values, which should be enough for most internal uses.
I would have used a sequence instead in this case, because a 6 digit checksum has relatively high chances of collisions.
So you can insert all strings to a hash, and use the index numbers as the ISBN, either after sorting or without it.
This should make collisions almost impossible, but it requires keeping a number of "allocated" ISBNs to avoid collisions in the future, and keeping the list of titles that are already in store, but it's information that you would most probably want to keep anyway.
Another option is to break the ISBN standard and use hexadecimal/uuencoded barcodes, that may increase the possible range to a point where it may work with a cryptographic hash truncated to fit.
I would suggest that since you are handling old book titles, which may have several editions capitalized and punctuated differently, I would strip punctuation, duplicated whitespaces and convert everything to lowercase before the comparison to minimize the chance of a technical duplicate even though the string is different (Unless you want different editions to have different ISBNs, in that case, you can ignore this paragraph).

Related

Maximum repeating substring of size n

Find the substring of length n that repeats a maximum number of times in a given string.
Input: abbbabbbb# 2
Output: bb
My solution:
public static String mrs(String s, int m) {
int n = s.length();
String[] suffixes = new String[n-m+1];
for (int i = 0; i < n-m+1; i++) {
suffixes[i] = s.substring(i, i+m);
}
Arrays.sort(suffixes);
String ans = "", tmp=suffixes[0].substring(0,m);
int cnt = 1, max=0;
for (int i = 0; i < n-m; i++) {
if (suffixes[i].equals(suffixes[i+1])){
cnt++;
}else{
if(cnt>max){
max = cnt;
ans =tmp;
}
cnt=0;
tmp = suffixes[i];
}
}
return ans;
}
Can it be done better than the above O(nm) time and O(n) space solution?

For a string of length L and a given length k (not to mess up with n and m which the question interchanges at times), we can compute polynomial hashes of all substrings of length k in O(L) (see Wikipedia for some elaboration on this subproblem).
Now, if we map the hash values to the number of times they occur, we get the value which occurs most frequently in O(L) (with a HashMap with high probability, or in O(L log L) with a TreeMap).
After that, just take the substring which got the most frequent hash as the answer.
This solution does not take hash collisions into account.
The idea is to just reduce the probability of collisions enough for the application (if it's too high, use multiple hashes, for example).
If the application demands that we absolutely never give a wrong answer, we can check the answer in O(L) with another algorithm (KMP, for example), and re-run the whole solution with a different hash function as long as the answer turns out to be wrong.

Generate 50 random numbers and store them into an array c++

this is what i have of the function so far. This is only the beginning of the problem, it is asking to generate the random numbers in a 10 by 5 group of numbers for the output, then after this it is to be sorted by number size, but i am just trying to get this first part down.
/* Populate the array with 50 randomly generated integer values
* in the range 1-50. */
void populateArray(int ar[], const int n) {
int n;
for (int i = 1; i <= length - 1; i++){
for (int i = 1; i <= ARRAY_SIZE; i++) {
i = rand() % 10 + 1;
ar[n]++;
}
}
}

First of all we want to use std::array; It has some nice property, one of which is that it doesn't decay as a pointer. Another is that it knows its size. In this case we are going to use templates to make populateArray a generic enough algorithm.
template<std::size_t N>
void populateArray(std::array<int, N>& array) { ... }
Then, we would like to remove all "raw" for loops. std::generate_n in combination with some random generator seems a good option.
For the number generator we can use <random>. Specifically std::uniform_int_distribution. For that we need to get some generator up and running:
std::random_device device;
std::mt19937 generator(device());
std::uniform_int_distribution<> dist(1, N);
and use it in our std::generate_n algorithm:
std::generate_n(array.begin(), N, [&dist, &generator](){
return dist(generator);
});
Live demo

How to find the longest continuous sub-string in a string?

For example, there is a given string which is consisted of 1s and 0s:
s = "00000000001111111111100001111111110000";
What is the efficient way to get the count of longest 1s substring in s? (11)
What is the efficient way to get the count of longest 0s substring in s? (10)
I appreciate the question would be answered from an algorithmic perspective.

I think the most straight-forward way is to walk through the bit-string while recording the max lengths for all 0 and all 1 sub-strings. This is of O (n) complexity as suggested by others.
If you can afford some sort of a data-parallel computation, you might want to look at parallel patterns as explained here. Specifically, take a look at parallel reduction. I think this problem can be implemented in O (log n) time if you can afford one of those methods.
I'm trying to think of a parallel reduction for this problem:
On the first level of the reduction, each thread will process chunks of 8 bit strings (depending on the number of threads you have and the length of the string) and produce a summary of the bit string like: 0 -> x, 1 -> y, 0 -> z, ....
On the next level each thread will merge two of these summaries into one, any possible joins will be performed at this phase (basically, if the previous summary ended with a 0 (1) and the next summary begins with a 0 (1), then the last entry and the first entry of the two summaries can be collapsed into one).
On the top level there will be just one structure with the overall summary of the bit string, which you'll have to step through to figure out the largest sequences (but this time they are all in summary form, so it should be faster). Or, you can make each summary structure keep track of the larges 0 and 1 sub-strings, this will make it unnecessary to walk through the final structure.
I guess this approach only makes sense in a very limited scope, but since you seem to be very keen on getting better than O (n)...

OK, here is one solution I come up with, I'm not sure whether this is bug-free. Correct me if you discover a bug or suggest a better way to do it. Vote it if you agree with this solution. Thanks!
#include <iostream>
using namespace std;
int main(){
int s[] = {0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0};
int length = sizeof(s) / sizeof(s[0]);
int one_start = 0;
int one_n = 0;
int max_one_n = 0;
int zero_start = 0;
int zero_n = 0;
int max_zero_n = 0;
for(int i=0; i<length; i++){
// Calculate 1s
if(one_start==0 && s[i]==1){
one_start = 1;
one_n++;
}
else if(one_start==1 && s[i]==1){
one_n++;
}
else if(one_start==1 && s[i]==0){
one_start = 0;
if(one_n > max_one_n){
max_one_n = one_n;
}
one_n = 0; // Reset
}
// Calculate 0s
if(zero_start==0 && s[i]==0){
zero_start = 1;
zero_n++;
}
else if(zero_start==1 && s[i]==0){
zero_n++;
}
else if(one_start==1 && s[i]==1){
zero_start = 0;
if(zero_n > max_zero_n){
max_zero_n = zero_n;
}
zero_n = 0; // Reset
}
}
if(one_n > max_one_n){
max_one_n = one_n;
}
if(zero_n > max_zero_n){
max_zero_n = zero_n;
}
cout << "max_one_n: " << max_one_n << endl;
cout << "max_zero_n: " << max_zero_n << endl;
return 0;
}

Worst case is always O(n), you can always find input which forces the algorithm to check every bit.
But you can probably get average slightly better than that (more simply if you scan just for 0 or 1, not both), because you can skip the length of currently found longest sequence and scan backwards. At the very least this will reduce the constant factor of O(n), but at least with random input, more items also means longer sequences, and thus longer and longer skips. But the difference to O(n) will not be much...

Checksum Algorithm Producing Unpredictable Results

I'm working on a checksum algorithm, and I'm having some issues. The kicker is, when I hand craft a "fake" message, that is substantially smaller than the "real" data I'm receiving, I get a correct checksum. However, against the real data - the checksum does not work properly.
Here's some information on the incoming data/environment:
This is a groovy project (see code below)
All bytes are to be treated as unsigned integers for the purpose of checksum calculation
You'll notice some finagling with shorts and longs in order to make that work.
The size of the real data is 491 bytes.
The size of my sample data (which appears to add correctly) is 26 bytes
None of my hex-to-decimal conversions are producing a negative number, as best I can tell
Some bytes in the file are not added to the checksum. I've verified that the switch for these is working properly, and when it is supposed to - so that's not the issue.
My calculated checksum, and the checksum packaged with the real transmission always differ by the same amount.
I have manually verified that the checksum packaged with the real data is correct.
Here is the code:
// add bytes to checksum
public void addToChecksum( byte[] bytes) {
//if the checksum isn't enabled, don't add
if(!checksumEnabled) {
return;
}
long previouschecksum = this.checksum;
for(int i = 0; i < bytes.length; i++) {
byte[] tmpBytes = new byte[2];
tmpBytes[0] = 0x00;
tmpBytes[1] = bytes[i];
ByteBuffer tmpBuf = ByteBuffer.wrap(tmpBytes);
long computedBytes = tmpBuf.getShort();
logger.info(getHex(bytes[i]) + " = " + computedBytes);
this.checksum += computedBytes;
}
if(this.checksum < previouschecksum) {
logger.error("Checksum DECREASED: " + this.checksum);
}
//logger.info("Checksum: " + this.checksum);
}
If anyone can find anything in this algorithm that could be causing drift from the expected result, I would greatly appreciate your help in tracking this down.

I don't see a line in your code where you reset your this.checksum.
This way, you should alway get a this.checksum > previouschecksum, right? Is this intended?
Otherwise I can't find a flaw in your above code. Maybe your 'this.checksum' is of the wrong type (short for instance). This could rollover so that you get negative values.
here is an example for such a behaviour
import java.nio.ByteBuffer
short checksum = 0
byte[] bytes = new byte[491]
def count = 260
for (def i=0;i<count;i++) {
bytes[i]=255
}
bytes.each { b ->
byte[] tmpBytes = new byte[2];
tmpBytes[0] = 0x00;
tmpBytes[1] = b;
ByteBuffer tmpBuf = ByteBuffer.wrap(tmpBytes);
long computedBytes = tmpBuf.getShort();
checksum += computedBytes
println "${b} : ${computedBytes}"
}
println checksum +"!=" + 255*count
just play around with the value of the 'count' variable which somehow corresponds to the lenght of your input.

Your checksum will keep incrementing until it rolls over to being negative (as it is a signed long integer)
You can also shorten your method to:
public void addToChecksum( byte[] bytes) {
//if the checksum isn't enabled, don't add
if(!checksumEnabled) {
return;
}
long previouschecksum = this.checksum;
this.checksum += bytes.inject( 0L ) { tot, it -> tot += it & 0xFF }
if(this.checksum < previouschecksum) {
logger.error("Checksum DECREASED: " + this.checksum);
}
//logger.info("Checksum: " + this.checksum);
}
But that won't stop it rolling over to being negative. For the sake of saving 12 bytes per item that you are generating a hash for, I would still suggest something like MD5 which is know to work is probably better than rolling your own... However I understand sometimes there are crazy requirements you have to stick to...

Efficient string sorting algorithm

Sorting strings by comparisons (e.g. standard QuickSort + strcmp-like function) may be a bit slow, especially for long strings sharing a common prefix (the comparison function takes O(s) time, where s is the length of string), thus a standard solution has the complexity of O(s * nlog n). Are there any known faster algorithms?

If you know that the string consist only of certain characters (which is almost always the case), you can use a variant of BucketSort or RadixSort.

You could build a trie, which should be O(s*n), I believe.

Please search for "Sedgewick Multikey quick sort" (Sedgewick wrote famous algorithms textbooks in C and Java). His algorithm is relatively easy to implement and quite fast. It avoids the problem you are talking above. There is the burst sort algorithm which claims to be faster, but I don't know of any implementation.
There is an article Fast String Sort in C# and F# that describes the algorithm and has a reference to Sedgewick's code as well as to C# code. (disclosure: it's an article and code that I wrote based on Sedgewick's paper).

Summary
I found the string_sorting
repo by Tommi Rantala comprehensive, it includes many known efficient (string) sorting algorithms, e.g. MSD radix sort, burstsort and multi-key-quicksort. In addition, most of them are also cache efficient.
My Experience
It appears to me three-way radix/string quicksort is one of the fastest string sorting algorithms. Also, MSD radix sort is a good one. They are introduced in Sedgewick's excellent Algorithms book.
Here are some results to sort leipzig1M.txt taken from here:
$ wc leipzig1M.txt
# lines words characters
1'000'000 21'191'455 129'644'797 leipzig1M.txt
Method
Time
Hoare
7.8792s
Quick3Way
7.5074s
Fast3Way
5.78015s
RadixSort
4.86149s
Quick3String
4.3685s
Heapsort
32.8318s
MergeSort
16.94s
std::sort/introsort
6.10666s
MSD+Q3S
3.74214s
The charming thing about three-way radix/string quicksort is it is really simple to implement, effectively only about ten source lines of code.
template<typename RandomIt>
void insertion_sort(RandomIt first, RandomIt last, size_t d)
{
const int len = last - first;
for (int i = 1; i < len; ++i) {
// insert a[i] into the sorted sequence a[0..i-1]
for (int j = i; j > 0 && std::strcmp(&(*(first+j))[d], &(*(first+j-1))[d]) < 0; --j)
iter_swap(first + j, first + j - 1);
}
}
template<typename RandomIt>
void quick3string(RandomIt first, RandomIt last, size_t d)
{
if (last - first < 2) return;
#if 0 // seems not to help much
if (last - first <= 8) { // change the threshold as you like
insertion_sort(first, last, d);
return;
}
#endif
typedef typename std::iterator_traits<RandomIt>::value_type String;
typedef typename string_traits<String>::value_type CharT;
typedef std::make_unsigned_t<CharT> UCharT;
RandomIt lt = first, i = first + 1, gt = last - 1;
/* make lo = median of {lo, mid, hi} */
RandomIt mid = lt + ((gt - lt) >> 1);
if ((*mid)[d] < (*lt)[d]) iter_swap(lt, mid);
if ((*mid)[d] < (*gt)[d]) iter_swap(gt, mid);
// now mid is the largest of the three, then make lo the median
if ((*lt)[d] < (*gt)[d]) iter_swap(lt, gt);
UCharT pivot = (*first)[d];
while (i <= gt) {
int diff = (UCharT) (*i)[d] - pivot;
if (diff < 0) iter_swap(lt++, i++);
else if (diff > 0) iter_swap(i, gt--);
else ++i;
}
// Now a[lo..lt-1] < pivot = a[lt..gt] < a[gt+1..hi].
quick3string(first, lt, d); // sort a[lo..lt-1]
if (pivot != '\0')
quick3string(lt, gt+1, d+1); // sort a[lt..gt] on following character
quick3string(gt+1, last, d); // sort a[gt+1..hi]
}
/*
* Three-way string quicksort.
* Similar to MSD radix sort, we first sort the array on the leading character
* (using quicksort), then apply this method recursively on the subarrays. On
* first sorting, a pivot v is chosen, then partition it in 3 parts, strings
* whose first character are less than v, equal to v, and greater than v. Just
* like the partitioning in classic quicksort but with comparing only the 1st
* character instead of the whole string. After partitioning, only the middle
* (equal-to-v) part can sort on the following character (index of d+1). The
* other two recursively sort on the same depth (index of d) because these two
* haven't been sorted on the dth character (just partitioned them: <v or >v).
*
* Time complexity: O(N~N*lgN), space complexity: O(lgN).
* Explaination: N * string length (for partitioning, find equal-to-v part) +
* O(N*lgN) (to do the quicksort thing)
* character comparisons (instead of string comparisons in normal quicksort).
*/
template<typename RandomIt>
void str_qsort(RandomIt first, RandomIt last)
{
quick3string(first, last, 0);
}
NOTE: But if you like me searching Google for "fastest string sorting algorithm", chances are it's burstsort, a cache-aware MSD radix sort variant (paper). I also found this paper by Bentley and Sedgewick helpful, which used a Multikey Quicksort.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string