spamassasin HTML_IMAGE_ONLY - phpmailer

My mails get 1.6 points (out of max 2 to be classified by spam) from this rule:
SpamAssassin Rule: HTML_IMAGE_ONLY_24
Standard description: HTML: images with 2000-2400 bytes of words
Explanation
This may indicate a message using an image instead of words in order to sidestep text-based filtering.
I have two images embedded in my mail - is that really not possible? How am I even to understand the phrase "with 2000 bytes of words".
Shouldnt it less strict if you embedd the images - as the mails grows larger in size and make it less likely to be spam?

You should take a look here. If you embed images on your mail you'll reduce HTML code and you should get a smaller score

Related

Extracting attribute-value from fuzzy text

I'm using an OCR library to extract products specifications from images. I'm first focusing on notebooks.For example:
Processor
Processor model: Intel N3540
Clock speed: 2.16 GHz
Memory
Internal: 4 GB
Hard disk
Capacity: 1 TB
or:
TOSHIBA
SATELLITE C50-5302
PENTIUM
TOSHIBA
DISPLAY 15.6
4GB
DDR3
500
The OCR is not perfect and sometimes what would be C10 ends up being CIO and other similar things.
I'd like to extract the attribute-value pairs but I don't know how to approach this problem.
I'm thinking about building a file with all the notebooks and microprocessors I can get(because brand, memory and hard drive capacity are pretty limited) and then use a NLP library to extract the entities from text. The problem also is that sometimes there are spelling errors so it's not as easy as comparing the exact values.
How would you approach this problem?
As for spelling errors, I'd suggest to, if possible, obtain an ambiguous and probabilistic output of the OCR system. Considering your CIO example, I is much graphically closer to 1 than to other characters. If no such an output is available, you may consider using some sort of weighted edit distance between characters.
For named entity recognition, work has been done for recognizing named entities from noisy input, mostly for ASR sources (as much as I know). See how word confusion networks does handle this, for instance this article.
As a final step, you'll probably need a joint task for OCR correction and named entity recognition. This will probably require to define what entities are likely for your domain: what tokens are expected to describe CPU speed, storage capacity, computer brands, etc. You may either manually implement rules or mine data from existing databases. As a final step, you'll probably have to somehow adjust the rate of expected OCR error correction to extract correct attribute-value pairs without adding false positive.
Don't hesitate to keep us informed about the solution you experiment!

How do computers process ascii/text & images/color differently?

I've recently been thinking more about what kind of work computer hardware has to do to produce the things we expect.
Comparing text and color, it seems that both rely on combinations of 1's and 0's with 256 possible combinations per byte. ASCII may represent a letter such as (01100001) to be the letter 'A'. But then there may be a color R(01100001), G(01100001), B(01100001) representing some random color. Considering on a low level, the computer is just reading these collections of 1's and 0's, what needs to happen to ensure the computer renders the color R(01100001), G(01100001), B(01100001) and not the letter A three times on my screen?
I'm not entirely sure this question is appropriate for Stack Overflow, but I'll go ahead and give a basic answer anyways. Though it's actually a very complicated question because depending on how deep you want to go into answering it I could write an entire book on computer architecture in order to do so.
So to keep it simple I'll just give you this: It's all a matter of context. First let's just tackle text:
When you open, say, a text editor the implicit assumption is the data to be displayed in it is textual in nature. The text to be displayed is some bytes in memory (possibly copied out of some bytes on disk). There's no magical internal context from the memory's point of view that these bytes are text. Instead, the source for text editor contains some commands that point to those bytes and say "these bytes represent 300 characters of text" for example. Then there's a complex sequence of steps involving library code all the way to hardware that handles mapping those bytes according to an encoding like ASCII (there are many other ways of encoding text) to characters, finding those characters in a font, writing that font to the screen, etc.
The point is it doesn't have to interpret those bytes as text. It just does because that's what a text editor does. You could hypothetically open it in an image program and tell it to interpret those same 300 bytes as a 10x10 array (or image) of RGB values.
As for colors the same logic applies. They're just bytes in memory. But when the code that's drawing something to the screen has decided what pixels it wants to write with what colors, it will pipe those bytes via a memory mapping to the video card which will then translate them to commands that are sent to the monitor (still in some binary format representing pixels and the colors, though the reality is a lot more complicated), and the monitor itself contains firmware that then handles the detail of mapping those colors to the physical pixels. The numbers that represent the colors themselves will at some point be converted to a specific current to each R/G/B channel to raise or lower its intensity.
That's all I have time for for now but it's a start.
Update: Just to illustrate my point, I took the text of Flatland from here. Which is just 216624 bytes of ASCII text (interpreted as such by your web browser based on context: the .txt extension helps, but the web server also provides a MIME type header informing the browser that it should be interpreted as plain text. Your browser might also analyze the bytes to determine that their pattern does look like that of plain text (and that there aren't an overwhelming number of bytes that don't represent ASCII characters). I appended a few spaces to the end of the text so that its length is 217083 which is 269 * 269 * 3 and then plotted it as a 269 x 269 RGB image:
Not terribly interesting-looking. But the point is that I just took those same exact bytes and told the software, "okay, these are RGB values now". That's not to say that looking at plain text bytes as images can't be useful. For example, it can be a useful way to visualize an encryption algorithm. This shows an image that was encrypted with a pretty insecure algorithm--you can still get a very good sense of the patterns of bytes in the original unencrypted file. If it were text and not an image this would be no different, as text in a specific language like English also has known statistical patterns. A good encryption algorithm would look make the encrypted image look more like random noise.
Zero and one are just zero and one, nothing more. A byte is just a collection of 8 bit.
The meaning you assign to information depends on what you need at the moment, what "language" you use to interpret your information. 65 is either letter A in ASCII or number 65 if you're using it in, say, int a = 65 + 3.
At low level, different (thousands of) machine instructions are executed to ensure that your data is treated properly, depending for example on the type of file you're reading, its headers, which process requests the data, and so on. The different high-level functions you use to treat different information expand to very different machine code.

gravatar highest resolution

I show a gravatr image of users in my site. How can I know the best highest resolution to use? e.g. which parameter "s" should be.
https://secure.gravatar.com/avatar/?s=250
of course it depends in the user image, but gravatr must know the resolution of the initial image and can advise me on best highest size.
It seems the maximum size changed drastically in the last 9 month
You may request images anywhere from 1px up to 2048px, however note that many users have lower resolution images, so requesting larger sizes may result in pixelation/low-quality images.
quoted from the Gravatar image Request tutorial
I don't think it is possible, I just did some research on that matter because I have the same need, but I think Gravatar was designed for websites that will show all avatars at the same size, which means small avatars, and if it's too big for some of them, they'll be ok with the automatic upscaling.
They should include a "?s=native" to get the native size.
This is what Gravatar writes about the resolution of the avatars:
You may request images anywhere from 1px up to 2048px, however note
that many users have lower resolution images, so requesting larger
sizes may result in pixelation/low-quality images.
The highest resolution of the default image would be 2048px:)
Read more about Gravatar images (including the default image) on https://en.gravatar.com/site/implement/images/
EDIT: You will see the picture cannot get any bigger than 2048px x 2048px even if you set s=3000:)
EDIT 2: Apparently the maximum size changed from 512px to 2048px

QR code compression

Is it possible to store about 20 000 characters in QR code? (Or even more? http://blog.qr4.nl/page/QR-Code-Data-Capacity.aspx)
I would like to store only ascii simbols (chars and numbers with extra dash and so on).
As far as I know it's possible to compress not complext text with ratio 80-98% which sound promissing: http://www.maximumcompression.com/index.html
Do you have some more experience? Thanks for sharing!
If your question is: "Is it possible to store 20K characters in QR Code?", then the answer is yes, it is possible.
If your question is: "Is it possible to guarantee you'll always be able to store 20K characters in QR Code to compression?", the answer is no. There is no way to guarantee that, due to pigeonhole principle.
If your question is: "Is there a "comfortable zone" where it is highly likely that a text input, whose maximum size is 20K, will most probably fit into a QR Code?", the proper answer is: it depends on your input data. And a more risky answer is: if you're dealing with "normal text" data, such as a book content, you're probably asking for too much.
The 80-90% compression ratio you refer to is possible because input data is extremely large (several MB), and decompression algorithms are very slow. For a "small" input data, such as 20K characters, the compression ratio for a "normal text" will more likely be in the 50-70% range, depending on algorithm strength (PPM for example, is very suitable for such input data).
Obviously, if your input data is a kind of "log file", with a huge lot of repetitions, then yes, compression ratio > 95% is easily accessible.
But compression ratio is not the only thing to take into consideration. For "real-life" usage, you'll also have to consider the QR size, and a reasonable level of correction for the QR print to survive. Betting on "max capacity with lowest possible correction" is a fairly wrong bet, at least for real life scenarios. You'll have to ask around you to know what are the "reasonable limits" of your QR Code. Most probably, printing capabilities will get into the way, and you'll have to settle for something less than maximum.
Last point, don't forget that compressed data are "binary", not "alphanumeric". As a consequence, the final capacity of your QR Code is into the last column. Which is much less than the column "alphanumeric".
QR codes have a special encoding mode for alphanumeric data (upper-case only, plus digits and a few symbols). It uses less than 8 bits per character and can store 4,296 characters at most in this mode.
This ought to be close to optimal. For simpler data (like, all alpha), a compression algorithm like gzip might be able to achieve fewer bits per byte. Of course, no standard reader would interpret the gzipped payload as such. Only a special reader would be able to.
Can you get 5x more data into a QR code this way? No, almost surely not, unless it's a trivial case like 20,000 "a"s.
Even if you could, it would create a large complex QR code. Anything holding over a few hundred bytes gets hard to scan in practice. Version 40, the largest, is useless in the real world. Even version 20 is.
In practice, when you want to use a QR to store huge ammounts of data, you simply store a URL pointing to the location of the data.
What is theoretically possible is very different to what is actually possible when you have to support real-life devices. Good luck scanning anything above version 10 (57x57 modules) with a low-end smartphone camera.

Large amount of dataURIs compared to images

I'm trying to compare (for performance) the use of either dataURIs compared to a large number of images. What I've done is setup two tests:
Regular Images (WPT)
Base64 (WPT)
Both pages are exactly the same, other than "how" these images/resources are being offered. I've ran a WebPageTest against each (noted above - WPT) and it looks the average load time for base64 is a lot faster -- but the cached view of regular view is faster. I've implemented HTML5 Boilerplate's .htaccess to make sure resources are properly gzipped, but as you can see I'm getting an F for base64 for not caching static resources (which I'm not sure if this is right or not). What I'm ultimately trying to figure out here is which is the better way to go (assuming let's say there'd be that many resources on a single page, for arguments sake). Some things I know:
The GET request for base64 is big
There's 1 resource for base64 compared to 300 some-odd for the regular (which is the bigger downer here... GET request or number of resources)? The thing to remember about the regular one is that there are only so many resources that can be loaded in parallel due to restrictions -- and for base64 - you're really only waiting until the HTML can be read - so nothing is technically loaded than the page itself.
Really appreciate any help - thanks!
For comparison I think you need to run a test with the images sharded across multiple hostnames.
Another option would be to sprite the images into logical sets.
If you're going to go down the BASE64 route, then perhaps you need to find a way to cache them on the client.
If these are the images you're planning on using then there's plenty of room for optimisation in them, for example: http://yhmags.com/profile-test/img_scaled15/interior-flooring.jpg
I converted this to a PNG and ran it through ImageOptim and it came out as 802 bytes (vs 1.7KB for the JPG)
I'd optimise the images and then re-run the tests, including one with multiple hostnames.

Resources