XSS Validation, Prevention or Cure? - security

I am having hands on trying to make a hack-proof website and learning about XSS. So the process is
A: Get User Input -> B: Store It -> C: Show It Again To client
I am using Microsoft AntiXSS library to avoid XSS attacks, but, the confusion is, should I perform required steps to avoid XSS attacks at step 'B' or at Step 'C'.

You should perform the sanitisation at the point where you are presenting the content, because this is the only point where it matters.
In a more complicated scenario, your data-flow may look like this:
/---> C (presentation)
A (get input) -> B (store)
\---> D (process)
If you've already sanitised the data at point B, then the processing at point D won't be able to operate on the original data.

Related

Have a Strategy that does not uniformly choose between different strategies

I'd like to create a strategy C that, 90% of the time chooses strategy A, and 10% of the time chooses strategy B.
The random python library does not work even if I seed it since each time the strategy produces values, it generates the same value from random.
I looked at the implementation for OneOfStrategy and they use
i = cu.integer_range(data, 0, n - 1)
to randomly generate a number
cu is from the internals
import hypothesis.internal.conjecture.utils as cu
Would it be fine for my strategy to use cu.integer_range or is there another implementation?
Hypothesis does not allow users to control the probability of various choices within a strategy. You should not use undocumented interfaces either - hypothesis.internal is for internal use only and could break at any time!
I strongly recommend using C = st.one_of(A, B) and trusting Hypothesis with the details.

US state resolution from unstructured text

I have a database with a "location" field that contains unconstrained user input in the form of a string. I would like to map each entry to either a US state or NULL.
For example:
'Southeastern Massachusetts' -> MA
'Brookhaven, NY' -> NY
'Manitowoc' -> WI
'Blue Springs, MO' -> MO
'A Damp & Cold Corner Of The World.' -> NULL
'Baltimore, Maryland' -> MD
'Indiana' -> IN
I can tolerate some errors but fewer would obviously be better. What's is the best way to go about this?
You may use Geonames which provides very large lists of location names with information about them, and is free. String matching (or approximate string matching) would then be probably not too hard to implement in the simplest cases.
One difficulty you'll probably encounter are names which are ambiguous, i.e. have multiple referents (e.g. Washington, is it the state or the city). If multiple indicators are present, you may check their coherence. Otherwise, you may check other words in input, but this is probably risky.
IMO, this is very close to Entity Linking with a posterior search to the closest state considering entities that have been linked.
For posterity: I just threw a bunch of regexps at it, which worked 'pretty alright'.

How to create a diff of two complex data structures?

Problem specification:
I am currently searching for a elegant and/but efficient solution to a problem that i guess is quite common. Consider the following situation:
I defined a fileformat based on a BTree that is defined (in a simplified way) like this:
data FileTree = FileNode [Key] [FileOffset]
| FileLeaf [Key] [Data]
Reading and writing this from a file to a lazy data structure is implemented and works just fine. This will result in a instance of:
data MemTree = MemNode [Key] [MemTree]
| MemLeaf [Key] [Data]
Now my goal is to have a generic function updateFile :: FilePath -> (MemTree -> MemTree) -> IO () that will read in the FileTree and convert it into a MemTree, apply the MemTree -> MemTree function and write back the changes to the tree structure. The problem is that the FileOffsets have to be conserved somehow.
I have two approaches to this problem. Both of them lack in elegance and/or efficiency:
Approach 1: Extend MemTree to contain the offsets
This approach extends the MemTree to contain the offsets:
data MemTree = MemNode [Key] [(MemTree, Maybe FileOffset)]
| MemNode [Key] [Data]
The read function would then read in the FileTree and stores the FileOffset alongside the MemTree reference. Writing will checks if a reference already has an associated offset and if it does it just uses it.
Pros: easy to implement, no overhead to find the offset
Cons: exposes internal to the user who is responsible to set the offset to Nothing
Approach 2: Store offsets in a secondary structure
Another way to attack this problem is to read in the FileTree and create a StableName.Map that holds onto the FileOffsets. That way (and if i understand the semantics of StableName correctly) it should be possible to take the final MemTree and lookup the StableName of each node in the the StableName.Map. If there is an entry the node is clean and doesn't have to be written again.
Pros: doesn't expose the internals to the user
Cons: involves overhead for lookups in the map
Conclusion
These are the two approaches i can think of. The first one should be more efficient, the second one is more pleasant to the eye. I'd like your comments on my ideas, maybe someone even has a better approach in mind?
[Edit] Reasonal
There are two reasons i am searching for a solution like this:
On the one hand you should try to handle errors before they arise by using the type system. The aforementioned user is of course the designer of the next layer in the system (ie me). By working on the pure tree representation some kinds of bugs won't be able to happen. All changes to the tree in the file should be in one place. That should make reasoning easier.
On the other hand i could just implement something like insert :: FilePath -> Key -> Value -> IO () and be done with it. But then i'll lose a very nice trait that comes free when i keep a (kind of a) log by updating the tree in place. Transactions (ie merging of several inserts) are just a matter of working on the same tree in memory and writing just the differences back to the file.
I think that the package Data.Generic.Diff may do exactly what you wanted. It references somebody's thesis for the idea of how it works.
I am very new at Haskell so I won't be showing code, but hopefully my explanation may help for a solution.
First, why not just expose only the MemTree to the user, since that is what they will update, and the FileTree can be kept completely hidden. That way, later, if you want to change this to be going to a database, for example, the user doesn't see any difference.
So, since the FileTree is hidden, why not just read it in when you are going to update, then you have the offsets, so do the update, and close the file again.
One problem with keeping the offsets is that it prevents another program from making any changes to the file, and in your case that may be fine, but I think as a general rule it is a bad design.
The main change, that I see, is that the MemTree shouldn't be lazy, since the file won't be staying open.

A specific use case of algebraic data types

I was writing an generic enumerator to scrape sites as an exercise and I did it, and it is complete and works fine, but I have a question. You can find it here: https://github.com/mindreader/scrape-enumerator if you want to look at the code.
The basic idea is I wanted an enumerator that spits out site defined entries on pages like search engines, blogs, things where you have to fetch a page, and it will have 25 entries, and you want one entry at a time. But at the same time I didn't want to write the plumbing for every site, so I wanted a generic interface. What I came up with is this (this uses type families):
class SiteEnum a where
type Result a :: *
urlSource :: a -> InputUrls (Int,Int)
enumResults :: a -> L.ByteString -> Maybe [Result a]
data InputUrls state =
UrlSet [URL] |
UrlFunc state (state -> (state,URL)) |
UrlPageDependent URL (L.ByteString -> Maybe URL)
In order to do this on every type of site, this requires a url source of some sort, which could be a list (possibly infinite) of pregenerated urls, or it could be an initial state and something to generate urls from it (like if the urls contained &page=1, &page=2, etc), and then for really screwed up pages like google, give an initial url and then provide a function that will search the body for the next link and then use that. Your site makes a data type an instance of SiteEnum and gives a type to Result which is site dependent and now the enumerator deals with all the I/O, and you don't have to think about it. This works perfectly and I implemented one site with it.
My question is that there is an annoyance with this implementation is the InputUrls datatype. When I use UrlFunc everything is golden. When I use UrlSet or UrlPageDependent, it isn't all fun and games because the state type is undefined, and I have to cast it to :: InputUrls () in order for it to compile. This seems totally unnecessary as that type variable due to the way the program is written, will never be used for the majority of sites, but I don't know how to get around it. I'm finding that I want to use types like this in a lot of different contexts, and I always end up with stray type variables that only are needed in certain pieces of the datatype, but it doesn't feel like I should be using it this way. Is there a better way of doing this?
Why do you need the UrlFunc case at all? From what I understand, the only thing you're doing with the state function is using it to build a list like the one in UrlSet anyway, so instead of storing the state function, just store the resulting list. That way, you can eliminate the state type variable from your data type, which should eliminate the ambiguity problems.

Managing a stateful computation system in Haskell

So, I have a system of stateful processors that are chained together. For example, a processor might output the average of its last 10 inputs. It requires state to calculate this average.
I would like to submit values to the system, and get the outputs. I also would like to jump back and restore the state at any time in the past. Ie. I run 1000 values through the system. Now I want to "move" the system back to exactly as it was after I had sent the 500th value through. Then I want to "replay" the system from that point again.
I also need to be able to persist the historical state to disk so I can restore the whole thing again some time in the future (and still have the move back and replay functions work). And of course, I need to do this with gigabytes of data, and have it be extremely fast :)
I had been approaching it using closures to hold state. But I'm wondering if it would make more sense to use a monad. I have only read through 3 or 4 analogies for monads so don't understand them well yet, so feel free to educate me.
If each processor modifies its state in the monad in such a way that its history is kept and it is tied to an id for each processing step. And then somehow the monad is able to switch its state to a past step id and run the system with the monad in that state. And the monad would have some mechanism for (de)serializing itself for storage.
(and given the size of the data... it really shouldn't even all be in memory, which would mean the monad would need to be mapped to disk, cached, etc...)
Is there an existing library/mechanism/approach/concept that has already been done to accomplish or assist in accomplishing what I'm trying to do?
So, I have a system of stateful processors that are chained together. For example, a processor might output the average of its last 10 inputs. It requires state to calculate this average.
First of all, it sounds like what you have are not just "stateful processors" but something like finite-state machines and/or transducers. This is probably a good place to start for research.
I would like to submit values to the system, and get the outputs. I also would like to jump back and restore the state at any time in the past. Ie. I run 1000 values through the system. Now I want to "move" the system back to exactly as it was after I had sent the 500th value through. Then I want to "replay" the system from that point again.
The simplest approach here, of course, is to simply keep a log of all prior states. But since it sounds like you have a great deal of data, the storage needed could easily become prohibitive. I would recommend thinking about how you might construct your processors in a way that could avoid this, e.g.:
If a processor's state can be reconstructed easily from the states of its neighbors a few steps prior, you can avoid logging it directly
If a processor is easily reversible in some situations, you don't need to log those immediately; rewinding can be calculated directly, and logging can be done as periodic snapshots
If you can nail a processor down to a very small number of states, make sure to do so.
If a processor behaves in very predictable ways on certain kinds of input, you can record that as such--e.g., if it idles on numeric input below some cutoff, rather than logging each value just log "idled for N steps".
I also need to be able to persist the historical state to disk so I can restore the whole thing again some time in the future (and still have the move back and replay functions work). And of course, I need to do this with gigabytes of data, and have it be extremely fast :)
Explicit state is your friend. Functions are a convenient way to represent active state machines, but they can't be serialized in any simple way. You want a clean separation of a (basically static) network of processors vs. a series of internal states used by each processor to calculate the next step.
Is there an existing library/mechanism/approach/concept that has already been done to accomplish what I'm trying to do? Does the monad approach make sense? Are there other better/special approaches that would help it do this efficiently especially given the enormous amount of data I have to manage?
If most of your processors resemble finite state transducers, and you need to have processors that take inputs of various types and produce different types of outputs, what you probably want is actually something with a structure based on Arrows, which gives you an abstraction for things that compose "like functions" in some sense, e.g., connecting the input of one processor to the output of another.
Furthermore, as long as you avoid the ArrowApply class and make sure that your state machine model only returns an output value and a new state, you'll be guaranteed to avoid implicit state because (unlike functions) Arrows aren't automatically higher-order.
Given the size of the data... it really shouldn't even all be in memory, which would mean the monad would need to be mapped to disk, cached, etc...
Given a static representation of your processor network, it shouldn't be too difficult to also provide an incremental input/output system that would read the data, serialize/deserialize the state, and write any output.
As a quick, rough starting point, here's an example of probably the simplest version of what I've outlined above, ignoring the logging issue for the moment:
data Transducer s a b = Transducer { curState :: s
, runStep :: s -> a -> (s, b)
}
runTransducer :: Transducer s a b -> [a] -> [b]
runTransducer t [] = (t, [])
runTransducer t (x:xs) = let (s, y) = runStep t (curState t) x
(t', ys) = runTransducer (t { curState = s }) xs
in (t', y:ys)
It's a simple, generic processor with explicit internal state of type s, input of type a, and output of type b. The runTransducer function shoves a list of inputs through, updating the state value manually, and collects a list of outputs.
P.S. -- since you were asking about monads, you might want to know if the example I gave is one. In fact, it's a combination of multiple common monads, though which ones depends on how you look at it. However, I've deliberately avoided treating it as a monad! The thing is, monads capture only abstractions that are in some sense very powerful, but that same power also makes them more resistant in some ways to optimization and static analysis. The main thing that needs to be ruled out is processors that take other processors as input and run them, which (as you can imagine) can create convoluted logic that's nearly impossible to analyze.
So, while the processors probably could be monads, and in some logical sense intrinsically are, it may be more useful to pretend that they aren't; imposing an artificial limitation in order to make static analysis simpler.

Resources