I'm new to Haskell so forgive me for not understanding the basics.
When using the Crypto.Hash.SHA256 to hash the result is something like below.
\159\252\170M\NAK\221\189S\n\191{\197y\t\USUx\143\&3\249\198K}]'\195\nU\154\SI3\199
Can anyone explain what the hell it is I'm looking at?
You're looking at the binary representation of the hash. You're probably used to seeing the hexadecimal representation. To get that, import Data.ByteString.Builder and call toLazyByteString . byteStringHex on it. With the hash in your question, the hexadecimal representation will be 9ffcaa4d15ddbd530abf7bc579091f55788f33f9c64b7d5d27c30a559a0f33c7.
Related
I'm trying to figure out the algorithm of the most_similar function in the pymagnitude library.
I have tried to go to their documentation but the algorithm that they used is not included there
vector = (magnitude_file)
vector.most_similar("king", topn=10)
I'm guessing they used interpolation search but I don't know if i'm right.
P.S Can Someone Help?
Note that this question is the same as this previously unanswered question.
It is also the same as this PHP question, but I'm looking for the haskell equivalent.
RFC 2047 defines the standard for "encoded-word" encodings and provides an example of:
=?iso-8859-1?q?this=20is=20some=20text?=
Is there a standard haskell library for dealing with decoding this into it's correct Text representation?
This shouldn't be too hard to write a custom parser using parsec and the RFC Spec, but this seems like a common, solved problem in other languages that I cannot find a Haskell equivalent for, and I'd rather not re-invent the wheel here.
In the mime package have look at decodeWord in the module Codec.MIME.Decode:
ghci> import Codec.MIME.Decode
ghci> decodeWord "=?iso-8859-1?q?this=20is=20some=20text?="
Just ("this is some text","")
From reading the source code both iso-8859-1 and us-ascii are supported.
There is also the decodeWords which uses the decodeWord function to translate a entire String:
ghci> decodeWords "Foo=?iso-8859-1?q?this=20is=20some=20text?=Bar"
"Foothis is some textBar"
Do you know the fastest way to encode and decode UTF8 with some extra information? Here's the interesting cases that occur to me:
Serialization
I just want to encode an opaque buffer with no validation so I can decode again later. The fastest would be to use the underlying memory buffer and somehow unsafely coerce it from Text to ByteString without touching the contents.
Probably ASCII
I guess that 99% of the time my UTF8 is actually ASCII so it makes sense to do a first pass to confirm this and only further processing if it's found not to be true.
Probably not ASCII
Converse of the previous.
Probably short
A single key in JSON or a database that I guess will be 1 to 20 characters. Would be silly pay some upfront cost like vectorized SIMD approach.
Probably long
An HTML document. It's worth it pay some upfront cost for the highest throughput.
There's some more variants that are similar like if encoding JSON or URL and you think there's probably no escape characters.
I'm asking this question under the [Haskell] tag since Haskell's strong typing makes some techniques that would be easy in, say, C hard to implement. Also, there may be some special GHC tricks like using SSE4 instructions on an Intel platform that would be interesting. But this is more of a UTF8 issue in general and good ideas would be helpful to any language.
Update
After some research I propose to implement encode and decode for serialization purposes like so:
myEncode :: Text -> ByteString
myEncode = unsafeCoerce
myDecode :: ByteString -> Text
myDecode = unsafeCoerce
This is a great idea if you enjoy segfault ...
This question implicates a sprawling range of issues. I'm going to interpret it as "In Haskell, how should I convert between Unicode and other character encodings?"
In Haskell, the recommended way to convert to and from Unicode is with the functions in text-icu, which provides some basic functions:
fromUnicode :: Converter -> Text -> ByteString
toUnicode :: Converter -> ByteString -> Text
text-icu is a binding to the International Components for Unicode libraries, which does the heavy lifting for, among other things, encoding and decoding to non-Unicode character sets. Its website gives documentation on conversion in general and some specific information on how its converter implementations operate. Note that different character sets require somewhat different coverter implementations.
ICU can also attempt to automatically detect the character set of an input. "This is, at best, an imprecise operation using statistics and heuristics." No other implementation could "fix" that characteristic. The Haskell bindings do not expose that functionality as I write; see #8.
I don't know of any character set conversion procedures written in native Haskell. As the ICU documentation indicates, there is a lot of complexity; after all, this is a rich area of international computing history.
Performance
As the ICU FAQ laconically notes, "Most of the time, the memory throughput of the hard drive and RAM is the main performance constraint." Although that comment is not specifically about conversions, I'd expect it to be broadly the case here as well. Is your experience otherwise?
unsafeCoerce is not appropriate here.
In Haskell's Data.Text.Encoding, presuming one presents a pure ASCII ByteString, is decodeLatin1 very much faster than decodeUtf8? Intuitively It seems like there would be at least one more machine instruction given the nature of utf-8 (ie a test of the top bit). I know I could do my own profiling, but I presume this may have been done already so that is why I ask.
Here is the underlying C code that the text library uses internally for the decoder. Specifically the function _hs_text_decode_latin1:
http://hackage.haskell.org/package/text-1.0.0.1/src/cbits/cbits.c
is decodeLatin1 very much faster than decodeUtf8
The answer to this is simply that it shouldn't matter, you should choose to use latin1 decoder if you need to work with existing latin1 text data. Anything else is just microoptimization for almost all cases, the text library is already very heavily optimized.
I am just starting out with Haskell and Yesod so please forgive if I am missing something obvious.
I am noticing that fileContentType in Yesod.Request.FileInfo is a Text even though Yesod.Content has an explicit ContentType. I'm wondering, why is it not a ContentType instead and what is the cleanest conversion?
Thanks in advance!
This comes down to a larger issue. A lot of the HTTP spec is stated in terms of ASCII. The question is, how do we represent it. There are essentially three choices:
Create a special newtype Ascii around a ByteString. This is the most correct, but also very tedious since it involves a lot of wrapping/unwrapping. We tried this approach, and got a lot of negative feedback.
Use normal ByteStrings. This is efficient, and mostly correct, but allows people to enter non-ASCII binary data.
Use Text (or String). This is the most developer-friendly, but allows you to enter non-ASCII character data. It's a bit less efficient than (1) or (2) due to the encoding/decoding overhead.
Overall, we've been moving towards (3) for most operations, especially for things like sessions keys which are internal to Yesod. You could say that ContentType is an inconsistency and should be changed to Text, but I think it doesn't seem to bother anyone, is a bit more semantic, and a bit faster.
tl;dr: No good reason :)
You have confused the type of Content with the type of ContentType. The fileContent field of FileInfo should be ContentType - and it is, modulo the type alias - the type of fileContentType is Text which would be ContentTypeType. It might help to imaging the last world as a prefixing adjective, so ContentType == the type of Content.