I have several questions according to the ZPL and GS1 128 Barcodes.
I thought using subset B is always possible but sometimes it extends the
width of the barcode more then subset C (if there are only numeric values).
So I started switching between the subsets. But when does it makes sense to switch? One example:
Plain Barcode: (02)12345678901234(10)00TestTest00
Could be:'>;>802123456789012311000>6TestTest00'
or
'>;>802123456789012311000>6TestTest>500'
What are the advantages of Subset A?
I also didn't find any information about the maximum of characters which can be part of a GS1 128 barcode for a specific label size (like DIN A5).
As a rule of thumb, I stick to Code128B with two exceptions:
I switch to Code128C when I know I am going to have at least 6 contiguous numbers embedded in a barcode.
I use Code128A when I can't get around embedding tabs or carriage returns in a single symbol (when I'm trying to simulate a user filling out multiple fields on a form with one scan), but seldom for access to other control codes..
Maximum characters for GS1 fields can be found here: https://en.wikipedia.org/wiki/GS1-128
It appears that most fields are limited, but many allow up to 30 characters. With one exception (Extended Packaging URL), which allows up to 70 characters.
As far as label size, that's all about bar density. My tightest scannable 70 character symbol is about 4 inches long, assuming the use of Code128B.
Related
I'm working on Unicode support in a Linux console application. I ran into a need to change the screen buffer format to store Unicode glyphs instead of bytes representing ASCII characters. Unicode has combined characters, hence more than one Unicode code point can be rendered into one console cell.
The question is: what is the maximum number of Unicode combined characters that may be needed to render one glyph in real-life languages? Are there any languages ββin the world that have glyphs that need more than 8 combined characters to render, for example? Let's assume that I don't need "Zalgo text" support at the cost of performance degradation caused by implementing dynamic length variables to store each console buffer glyph.
Nobody can be an expert in what makes up a "real-life" character in every language, so I might be missing some longer sequences here. But I do know about a lot of emoji! There are a few emojis for flags of geographic subdivisions which are implemented with combining codepoints. For example, the flag for Scotland, π΄σ §σ ’σ ³σ £σ ΄σ Ώ, is 7 codepoints, taking up 28 bytes in UTF-32:
WAVING BLACK FLAG
TAG LATIN SMALL LETTER G
TAG LATIN SMALL LETTER B
TAG LATIN SMALL LETTER S
TAG LATIN SMALL LETTER C
TAG LATIN SMALL LETTER T
CANCEL TAG
Country flags, like π―π΅, have just two combining codepoints.
Family emojis with 4 people, like π©βπ©βπ§βπ§, are also 7 codepoints. The only emoji I'm aware of that's longer are family emojis with a skin-tone specified for each family member, but these don't have a lot of support right now. Here's what one displays as on your device: π©πΎβπ¨πΎβπ§πΎβπ§πΎ (if you just see four heads, then you don't have a font installed that supports this). That emoji has 11 codepoints.
That being said, keep in mind that not all languages are rendered as a series of glyphs in sequence: Ψ£ΩΩΨ§ is segmented using Unicode rules into 4 distinct characters.
I'm working with some text from twitter, using Tweepy. All that is fine, and at the moment I'm just looking to start with some basic frequency counts for words. However, I'm running into an issue where the ability of users to use different fonts for their tweets is making it look like some words are their own unique word, when in reality they're words that have already been encountered but in a different font/font size, like in the picture below (those words are words that were counted previously and appear in the spreadsheet earlier up).
This messes up the accuracy of the counts. I'm wondering if there's a package or general solution to make all the words a uniform font/size - either while I'm tokenizing it (just by hand, not using a module) or while writing it to the csv (using the csv module). Or any other solutions for this that I may not be considering. Thanks!
You can (mostly) solve your problem by normalising your input, using unicodedata.normalize('NFKC', str).
The KC normalization form (which is what NF stands for) first does a "compatibility decomposition" on the text, which replaces Unicode characters which represent style variants, and then does a canonical composition on the result, so that Γ±, which is converted to an n and a separate ~ diacritic by the decomposition, is then turned back into an Γ±, the canonical composite for that character. (If you don't want the recomposition step, use NFKD normalisation.) See Unicode Annex 15 for a more precise description, with examples.
Unicode contains a number of symbols, mostly used for mathematics, which are simply stylistic variations on some letter or digit. Or, in some cases, on several letters or digits, such as ΒΌ or β. In particular, this includes commonly-used symbols written with font variants which have particular mathematical or other meanings, such as β (the Laplace transform) and β (the set of rational numbers). Canonical decomposition will strip out the stylistic information, which reduces those four examples to '1/4', 'c/u', 'L' and 'Q', respectively.
The first published Unicode standard defined a block of Letter-like symbols block in the Basic Multilingula Plane (BMP). (All of the above examples are drawn from that block.) In Unicode 3.1, complete Latin and Greek alphabets and digits were added in the Mathematical Alphanumeric Symbols block, which includes 13 different font variants of the 52 upper- and lower-case letters of the roman alphabet (lower and upper case), 58 greek letters in five font variants (some of which could pass for roman letters, such as πͺ which is upsilon, not capital Y), and the 10 digits in five variants (π π π€ π― πΊ). And a few loose characters which mathematicians apparently asked for.
None of these should be used outside of mathematical typography, but that's not a constraint which most users of social networks care about. So people compensate for the lack of styled text in Twitter (and elsewhere) by using these Unicode characters, despite the fact that they are not properly rendered on all devices, make life difficult for screen readers, cannot readily be searched, and all the other disadvantages of used hacked typography, such as the issue you are running into. (Some of the rendering problems are also visible in your screenshot.)
Compatibility decomposition can go a long way in resolving the problem, but it also tends to erase information which is really useful. For example, xΒ² and HβO become just x2 and H2O, which might or might not be what you wanted. But it's probably the best you can do.
I have a large set of names (millions in number). Each of them has a first name, an optional middle name, and a lastname. I need to encode these names into a number that uniquely represents the names. The encoding should be one-one, that is a name should be associated with only one number, and a number should be associated with only one name.
What is a smart way of encoding this? I know it is easy to tag each alphabet of the name according to its position in the alphabet set (a-> 1, b->2.. and so on) and so a name like Deepa would get -> 455161, but again here I cannot make out if the '16' is really 16 or a combination of 1 and 6.
So, I am looking for a smart way of encoding the names.
Furthermore, the encoding should be such that the number of digits in the output numeral for any name should have fixed number of digits, i.e., it should be independent of the length. Is this possible?
Thanks
Abhishek S
To get the same width numbers, can't you just zero-pad on the left?
Some options:
Sort them. Count them. The 10th name is number 10.
Treat each character as a digit in a base 26 (case insensitive, no
digits) or 52 (case significant, no digits) or 36 (case insensitive
with digits) or 62 (case sensitive with digits) number. Compute the
value in an int. EG, for a name of "abc", you'd have 0 * 26^2 + 1 *
26^1 + 2 * 20^0. Sometimes Chinese names may use digits to indicate tonality.
Use a "perfect hashing" scheme: http://en.wikipedia.org/wiki/Perfect_hash_function
This one's mostly suggested in fun: use goedel numbering :). So
"abc" would be 2^0 * 3^1 * 5^2 - it's a product of powers of primes.
Factoring the number gives you back the characters. The numbers
could get quite large though.
Convert to ASCII, if you aren't already using it. Then treat each
ordinal of a character as a digit in a base-256 numbering system.
So "abc" is 0*256^2 + 1*256^1 + 2*256^0.
If you need to be able to update your list of names and numbers from time to time, #2, #4 and #5 should work. #1 and #3 would have problems. #5 is probably the most future-proofed, though you may find you need unicode at some point.
I believe you could do unicode as a variant of #5, using powers of 2^32 instead of 2^8 == 256.
What you are trying to do there is actually hashing (at least if you have a fixed number of digits). There are some good hashing algorithms with few collisions. Try out sha1 for example, that one is well tested and available for modern languages (see http://en.wikipedia.org/wiki/Sha1) -- it seems to be good enough for git, so it might work for you.
There is of course a small possibility for identical hash values for two different names, but that's always the case with hashing and can be taken care of. With sha1 and such you won't have any obvious connection between names and IDs, which can be a good or a bad thing, depending on your problem.
If you really want unique ids for sure, you will need to do something like NealB suggested, create IDs yourself and connect names and IDs in a Database (you could create them randomly and check for collisions or increment them, starting at 0000000000001 or so).
(improved answer after giving it some thought and reading the first comments)
You can use the BigInteger for encoding arbitrary strings like this:
BigInteger bi = new BigInteger("some string".getBytes());
And for getting the string back use:
String str = new String(bi.toByteArray());
I've been looking for a solution to a problem very similar to the one you proposed and this is what I came up with:
def hash_string(value):
score = 0
depth = 1
for char in value:
score += (ord(char)) * depth
depth /= 256.
return score
If you are unfamiliar with Python, here's what it does.
The score is initially 0 and the depth are set to 1
For every character add the ord value * the depth
The ord function returns the UTF-8 value (0-255) for each character
Then it's multiplied by the 'depth'.
Finally the depth is divided by 256.
Essentially, the way that it works is that the initial characters add more to the score while later characters contribute less and less. If you need an integer, multiply the end score by 2**64. Otherwise you will have a decimal value between 0-256. This encoding scheme works for binary data as well as there are only 256 possible values in a byte/char.
This method works great for smaller string values, however, for longer strings you will notice that the decimal value requires more precision than a regular double (64-bit) can provide. In Java, you can use the 'BigDecimal' and in Python use the 'decimal' module for added precision. A bonus to using this method is that the values returned are in sorted order so they can be searched 'efficiently'.
Take a look at https://en.wikipedia.org/wiki/Huffman_coding. That is the standard approach.
You can translate it, if every character (plus blank, at least) will occupy a position.
Therefore ABC, which is 1,2,3 has to be translated to
1*(2*26+1)Β² + 2*(53) + 3
This way, you could encode arbitrary strings, but if the length of the input isn't limited (and how should it?), you aren't guaranteed to have an upper limit for the length.
I know that I can encode numbers to a base like 65 to decrease the size of the character display (even if the number is smaller in binary).
However, is there a way to encode UTF-8 text to another base with more characters than our standard 26 letter English alphabet? In other words, Instead of requiring 4 "characters" for the word "four" - I can create a representation or hash using only, maybe 2 (i.e. "6$")?
I believe the point of Base64 is you can easily convert any binary data into "human readable" letters and numbers. It makes it easy to transcribe arbitrary data to newsgroups or transmit them over text based protocols.
If you want to further "compress" this data, you need to figure out how many characters you want to allow. There's only so many combinations of 8 bits. The most efficient would be to use all of them, in which case why just not use gzip?
Your question seems related to Order-0 entropy coding :
http://en.wikipedia.org/wiki/Entropy_encoding
The most famous algorithm is this family is Huffman coding :
http://en.wikipedia.org/wiki/Huffman_coding
Huffman will not only tells you that only 64 characters are used and therefore only 6 bits per characters are necessary : it will also make a difference between frequent characters, such as (space), and rare ones, such as (;). It will then create a code in which frequent characters use less bits than rarer ones, resulting in better compression (typically 4.5bits per character on English texts).
Huffman coding is an all-around compression technique, used as part of many compression algorithms, including zip.
You can find a demo program which only applies one pass of Huffman compression here (Huff0), it will help you determine how much can be gained by using this technique for your sample inputs :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html
Task:
to cluster a large pool of short DNA fragments in classes that share common sub-sequence-patterns and find the consensus sequence of each class.
Pool: ca. 300 sequence fragments
8 - 20 letters per fragment
4 possible letters: a,g,t,c
each fragment is structured in three regions:
5 generic letters
8 or more positions of g's and c's
5 generic letters
(As regex that would be [gcta]{5}[gc]{8,}[gcta]{5})
Plan:
to perform a multiple alignment (i.e. withClustalW2) to find classes that share common sequences in region 2 and their consensus sequences.
Questions:
Are my fragments too short, and would it help to increase their size?
Is region 2 too homogeneous, with only two allowed letter types, for showing patterns in its sequence?
Which alternative methods or tools can you suggest for this task?
Best regards,
Simon
Yes, 300 is FAR TOO FEW considering that this is the human genome and you're essentially just looking for a particular 8-mer. There are 65,536 possible 8-mers and 3,000,000,000 unique bases in the genome (assuming you're looking at the entire genome and not just genic or coding regions). You'll find G/C containing sequences 3,000,000,000 / 65,536 * 2^8 =~ 12,000,000 times (and probably much more since the genome is full of CpG islands compared to other things). Why only choose 300?
You don't want to use regex's for this task. Just start at chromosome 1, look for the first CG or GC and extend until you get your first non-G-or-C. Then take that sequence, its context and save it (in a DB). Rinse and repeat.
For this project, Clustal may be overkill -- but I don't know your objectives so I can't be sure. If you're only interested in the GC region, then you can do some simple clustering like so:
Make a database entry for each G/C 8-mer (2^8 = 256 in all).
Take each GC-region and walk it to see which 8-mers it contains.
Tag each GC-region with the sequences it contains.
Now, for each 8-mer, you have thousands of sequences which contain it. I'll leave the analysis of the data up to your own objectives.
Your region two, with the 2 letters, may end up a bit too similar, increasing length or variability (e.g. more letters) could help.