What is the proper way to store API client id/secret information in a separate file? There are many approaches, but there seems to be a lack of convention. If picking an approach for saving volatile strings is highly subjective, what are the deciding factors one should consider that go into making the decision and when is appropriate to use the String type vs a configuration library?
I see a couple of simple options to implement this that could be adapted to potentially follow the DRY principle:
String variables
-- define string variables in keys.hs
mykey :: String
mykey = "key here"
-- then in the main file import these keys
import keys
Raw text file
keyFile :: String
keyFile = "keys.txt"
getKeyFromFile :: IO B.ByteString
getKeyFromFile = B.readFile keyFile
Also, one could potentially use a library and of course for when you need to manage more keys:
OAuth Authentication library
-- Define a data structure for each set
myoauth :: OAuth
myoauth =
newOAuth { oauthServerName = "api.server.com"
, oauthConsumerKey = "key here"
, oauthConsumerSecret = "secret here"
}
Configuration manager and use their config format
--taken from their example
my-group
{
a = 1
# groups support nesting
nested {
b = "yay!"
}
}
I think there are two issues here, and its important to distinguish them:
1: What is the best practice for storing user secret keys regardless of the language.
2: What is the best way of implementing the above in Haskell.
I would always recommend keeping secrets in a separate file from the rest of the configuration information because it may well need to be treated differently. For instance a config file might be world readable and backed up, but the secrets file must not be world readable and might not be included in the backup. SELinux might also be configured to treat it differently by restricting which programs can read or write it. Designing your program to keep it separate enables the user to make these decisions.
As for data format, personally I would use JSON so you can store structured data such as salts, (user,password) pairs or whatever your application requires. But that is purely a matter of taste. A Binary instance will also work fine.
You might also take a look at this previous answer of mine which discusses ways of ensuring that secrets are securely wiped from memory as soon as they are no longer required, although that isn't going to help if you use Haskell to parse a JSON representation of your secret data.
Related
I have a situation where I need to store some intermediate values so I can reuse them in other parts of the root module. I know about local values and I know about null_data_source except I do not know which one is the recommended option for holding re-usable values. Both descriptions look somewhat similar to me
local values (https://www.terraform.io/docs/configuration/locals.html)
Local values can be helpful to avoid repeating the same values or expressions multiple times in a >configuration, but if overused they can also make a configuration hard to read by future >maintainers by hiding the actual values used.
and null_data_source (https://www.terraform.io/docs/providers/null/data_source.html)
The primary use-case for the null data source is to gather together collections of intermediate >values to re-use elsewhere in configuration:
So both appear to be a valid choice for this scenario.
Here is my example code
locals {
my_string_A = "This is string A"
}
data "null_data_source" "my_string_B" {
inputs = {
my_string_B = "This is string B"
}
}
output "my_output_a" {
value = "${local.my_string_A}"
}
output "my_output_b" {
value = "${data.null_data_source.my_string_B.outputs["my_string_B"]}"
}
Could you suggest on when to use the one over the other for holding intermediate values and what is the pros/cons of each approach?
Thank you
The null_data_source data source was introduced prior to the local values mechanism as an interim solution to meet that use-case before that capability became first-class in the language. It continues to be supported only for backward-compatibility with existing configurations using it.
All new configurations should use the Local Values mechanism instead. It's fully integrated into the Terraform language, supports values of any type (while null_data_source can support only strings), and has a much more concise/readable syntax.
Javascript's version of f-strings allows for string escaping through use of a somewhat funny API, e.g.
function escape(str) {
var div = document.createElement('div');
div.appendChild(document.createTextNode(str));
return div.innerHTML;
}
function escapes(template, ...expressions) {
return template.reduce((accumulator, part, i) => {
return accumulator + escape(expressions[i - 1]) + part
})
}
var name = "Bobby <img src=x onerr=alert(1)></img> Arson"
element.innerHTML = escapes`Hi, ${name}` # "Hi, Bobby <img src=x onerr=alert(1)></img> Arson"
Does Python f-strings allow for a similar mechanism? or do you need to bring your own string.Formatter? Would a more pythonic implementation wrap results into a class with an overriden __str__() method before interpolation?
When you're dealing with text that is going to be interpreted as code (e.g., text that the browser will parse as HTML or text that a database executes as SQL), you don't want to solve security issues by implementing your own escaping mechanism. You want to use the standard, widely tested tools to prevent them. This gives you much greater safety from attacks for several reasons:
The wide adoption means the tools are well tested and much less likely to contain bugs.
You know they have the best available approach to solving the problem.
They will help you avoid the common mistakes associated with generating the strings yourself.
HTML escaping
The standard tools for HTML escaping are templating engines, such as Jinja. The major advantage is that these are designed to escape text by default, rather than requiring you to remember to explicitly convert unsafe strings. (You do need to be cautious about bypassing or disabling, even temporarily, the escaping, though. I have seen my share of insecure attempts to insecurely construct JSON in templates, but the risk in templates is still lower than a system that requires explicit escaping everywhere.) Your example is pretty easy to implement with Jinja:
import jinja2
template_str = 'Hi, {{name}}'
name = "Bobby <img src=x onerr=alert(1)></img> Arson"
jinjaenv = jinja2.Environment(autoescape=jinja2.select_autoescape(['html', 'xml']))
template = jinjaenv.from_string(template_str)
print(template.render(name=name))
# Hi, Bobby <img src=x onerr=alert(1)></img> Arson
If you're generating HTML, though, chances are you're using a web framework such as Flask or Django. These frameworks include a templating engine and will require less set up than the above example.
MarkupSafe is a useful tool if you're trying to create your own template engine (Some Python templating engines use it internally, such as Jinja.), and you could potentially integrate it with a Formatter. But there's no reason to reinvent the wheel. Using a popular engine will result in much simpler, easier to follow, more recognizable code.
SQL injection
SQL injection is not solved through escaping. PHP has a nasty history that everyone has learned from. The lesson is use parameterized queries instead of trying to escape input. This prevents untrusted user data from ever being parsed as SQL code.
How you do this depends on exactly what libraries you're using for executing your queries, but for an example, doing so with SQLAlchemy's execute method looks like this:
session.execute(text('SELECT * FROM thing WHERE id = :thingid'), thingid=id)
Note that SQLAlchemy is not just escaping the text of id to ensure it does not contain attack code. It is actually differentiating between the SQL and the value for the database server. The database will parse the query text as a query, and then it will include the value separately after the query has been parsed. This makes it impossible for the value of id to trigger unintended side effects.
Note also that quoting issues are precluded by parameterized queries:
name = 'blah blah blah'
session.execute(text('SELECT * FROM thing WHERE name = :thingname'), thingname=name)
If you can't parameterize, whitelist in memory
Sometimes, it's not possible to parameterize something. Maybe you're trying to dynamically select a table name based on the input. In these cases, one thing you can do is have a collection of known valid and safe values. By validating that the input is one of these values and retrieving a known safe representation of it, you avoid sending user input into your query:
# This could also be loaded dynamically if needed.
valid_tables = {
# Keys are uppercased for look up
'TABLE1' : 'table1',
'TABLE2': 'Table2',
'TABLE3': 'TaBlE3',
...
}
def get_table_name(table_num):
table_name = 'TABLE' + table_num
try:
return valid_tables[table_name]
except KeyError:
raise 'Unknown table number: ' + table_num
def query_for_thing(session, table_num):
return session.execute(text('SELECT * FROM "{}"'.format(get_table_name(table_num))
The point is you never want to allow user input to go into your query as something other than a parameter.
Make sure that this whitelisting occurs in application memory. Do not perform the whitelisting in SQL itself. Whitelisting in the SQL is too late; by that time, the input has already been parsed as SQL, which would allow the attacks to be invoked before the whitelisting could take effect.
Make sure you understand your library
In the comments, you mentioned PySpark. Are you sure you're doing this right? If you create a data frame using just a simpler SELECT * FROM thing and then use PySpark filtering functions, are you sure it doesn't properly push those filters down to the query, precluding the need to format values into it unparameterized?
Make sure you understand how data is normally filtered and manipulated with your library, and check if that mechanism will use parameterized queries or otherwise be efficient enough under the hood.
With small data, just filter in memory
If your data isn't at least in the tens of thousands of records, then consider just loading it into memory and then filtering:
filter_name = 'blah blah blah'
results = session.execute(text('SELECT * FROM thing'))
filtered_results = [r for r in results if r.name == filter_name]
If this is fast enough and parameterizing the query is hard, then this approach avoids all the security headaches of trying to make the input safe. Test its performance with somewhat more data than you expect to see in prod. I would use at least double of the maximum you expect; an order of magnitude would be even safer if you can make it perform.
If you're stuck without parameterized query support, the last resort is very strict limits on inputs
If you're stuck with a client that doesn't support parameterized queries, first check if you can use a better client. SQL without parameterized queries is absurd, and it's an indication that the client you're using is very low quality and probably not well maintained; it may not even be widely used.
Doing the following is NOT recommended. I include it only as an absolute last resort. Don't do this if you have any other choice, and spend as much time as you can (even a couple of weeks of research, I dare say) trying to avoid resorting to this. It requires a very high level of diligence on the part of every team member involved, and most developers do not have that level of diligence.
If none of the above is a possibility, then the following approach may be all you can do:
Do not query on text strings coming from the user. There is no way to make this safe. No amount of quoting, escaping, or restricting is guaranteed. I don't know all the details, but I've read of the existence of Unicode abuses that can allow bypassing character restrictions and the like. It's just not worth it to try. The only text strings allowed should be whitelisted in application memory (as opposed to whitelisted via some SQL or database function). Note that even leveraging database level quoting functions (like PostgreSQL's quote_literal) or stored procedures can't help you here because the text has to be parsed as SQL to even reach those functions, which would allow the attacks to be invoked before the whitelisting could take effect.
For all other data types, parse them first and then have the language render them into an appropriate string. Doing so again means avoiding having user input parsed as SQL. This requires you to know the data type of the input, but that's reasonable since you'll need to know that to construct the query. In particular, the available operations with a particular column will be determined by that column's data types, and the operation and column type will determine what data types are valid for the input.
Here's an example for a date:
from datetime import datetime
def fetch_data(start_date, end_date):
# Check data types to prevent injections
if not isinstance(start_date, datetime):
raise ValueError('start_date must be a datetime')
if not isinstance(end_date, datetime):
raise ValueError('end_date must be a datetime')
# WARNING: Using format with SQL queries is bad practice, but we don't
# have a choice because [client lib] doesn't support parameterized queries.
# To mitigate this risk, we do not allow arbitrary strings as input.
# We tightly control the input's data type (to something other than text or binary) and the format used in the query.
session.execute(text(
"SELECT * FROM thing WHERE timestamp BETWEEN CAST('{start}' AS TIMESTAMP) AND CAST('{end}' AS TIMESTAMP)"
.format(
# Make the format used explicit
start=start_date.strftime('%Y-%m-%dT%H:%MZ'),
end=end_date.strftime('%Y-%m-%dT%H:%MZ')
)
))
user_input_start_date = '2019-05-01T00:00'
user_input_end_date = '2019-06-01T00:00'
parsed_start_date = datetime.strptime(user_input_start_date, "%Y-%m-%dT%H:%M")
parsed_end_date = datetime.strptime(user_input_end_date, "%Y-%m-%dT%H:%M")
data = fetch_data(parsed_start_date, parsed_end_date)
There's several details that you need to be aware of.
Notice that in the same function as the query, we're validating the data type. This is one of the rare exceptions in Python where you don't want to trust duck typing. This is a safety feature that ensures insecure data won't be passed into your function accidentally.
The format passed of the input when it's rendered into the SQL string is explicit. Again, this is about control and whitelisting. Don't leave it to any other library to decide what format the input will be rendered to; make sure you know exactly what the format is so that you can be certain that injections are impossible. I'm fairly certain that there's no injection possibility with the ISO 8601 date/time format, but I haven't confirmed that explicitly. You should confirm that.
The quoting of the values is manual. That's okay. And the reason it's okay is because you know what data types you're dealing with and you know exactly what the string will look like after it's formatted. This is by design: you're maintaining very strict, very tight control over the input's format to prevent injections. You know whether quotes need to be added or not based on that format.
Don't skip the comment about how bad this practice is. You have no idea who will read this code later and what knowledge or abilities they have. Competent developers who understand the security risks here will appreciate the warning; developers who weren't aware will be warned to use parameterized queries whenever available and to avoid carelessly including new conditions. If at all feasible, require that changes to these areas of code be reviewed by additional developers to further mitigate the risks.
This function should have full control over generating the query. It should not delegate its construction out to other functions. This is because the data type checking needs to be kept very, very close to the construction of the query to avoid mistakes.
The effect of this is a sort of looser whitelisting technique. You can't whitelist specific values, but you can whitelist the kinds of values you're working with and control the format they're delivered in. Forcing callers to parse the values into a known data type reduces the possibility of an attack getting through.
I'll also note that callering code is free to accept the user input in whatever format is convenient and to parse it using whatever tools you wish. That's one of the advantages of requiring a dedicated data type instead of strings for input: you don't lock callers into a particular string format, just the data type. For date/times in particular, you might consider some third party libraries.
Here's another example with a Decimal value instead:
from decimal import Decimal
def fetch_data(min_value, max_value):
# Check data types to prevent injections
if not isinstance(min_value, Decimal):
raise ValueError('min_value must be a Decimal')
if not isinstance(max_value, Decimal):
raise ValueError('max_value must be a Decimal')
# WARNING: Using format with SQL queries is bad practice, but we don't
# have a choice because [client lib] doesn't support parameterized queries.
# To mitigate this risk, we do not allow arbitrary strings as input.
# We tightly control the input's data type (to something other than text or binary) and the format used in the query.
session.execute(text(
"SELECT * FROM thing WHERE thing_value BETWEEN CAST('{minv}' AS NUMERIC(26, 16)) AND CAST('{maxv}' AS NUMERIC(26, 16))"
.format(
# Make the format used explicit
# Up to 16 decimal places. Maybe validate that at start of function?
minv='{:.16f}'.format(min_value),
maxv='{:.16f}'.format(max_value)
)
))
user_input_min = '78.887'
user_input_max = '89789.78878989'
parsed_min = Decimal(user_input_min)
parsed_max = Decimal(user_input_max)
data = fetch_data(parsed_min, parsed_max)
Everything is basically the same. Just a slightly different data type and format. You're free to use whatever data types your database supports, of course. For example, if your DB does not require specifying a scale and precision on the numeric type or would auto-cast a string or can handle the value unquoted, you can structure your query accordingly.
You do not need to bring your own formatter if you're using python 3.6 or newer. Python 3.6 introduced formatted string literals, see PEP 498: Formatted string literals.
Your example in python 3.6 or newer would look like this:
name = "Bobby <img src=x onerr=alert(1)></img> Arson"
print(f"Hi, {name}") # Hi, Bobby <img src=x onerr=alert(1)></img> Arson
The format specification that can be used with str.format() can also be used with formatted string literals.
This example,
my_dict = {'A': 21.3, 'B': 242.12, 'C': 3200.53}
for key, value in my_dict.items():
print(f"{key}{value:.>15.2f}")
will print the following:
A..........21.30
B.........242.12
C........3200.53
Additionally, since the string is evaluated at runtime, any valid python expression can be used, for example,
name = "Abby"
print(f"Hello, {name.upper()}!")
will print
Hello, ABBY!
So, it is not a problem but I would want an opinion what would be a better way. So I need to read data from a outside source (TCP), that comes basically in this format:
key: value
okey: enum
stuff: 0.12240
amazin: 1020
And I need to parse it into a Haskell accessible format, so the two solutions I thought about, were either to, parse that into a strict String to String map, or record syntax type declarations.
Initially I thought to make a type synonym for my String => String map, and make extractor functions like amazin :: NiceSynonym -> Int, and do the necessary treatment and parsing within the method, but that felt like, sketchy at the time? Then I thought an actual type declaration with record syntax, with a custom Read instance. That was a nightmare, because there is a lot of enums and keys with different types and etc. And it felt... disappointing. It simply wraps the arguments and creates reader functions, not much different from the original: amazin :: TypeDeclaration -> Int.
Now I'm kind of regretting not going with reader functions as I initially envisioned. So, anything else I'm forgetting to consider? Any pros and cons of either sides to take note on? Is one objectively better then the other?
P.S.: Some considerations that may make one or the other better:
Once read I won't need to change it at all whatsoever, it's basically a status report
No need to compare, add, etc., again just status report no point
Not really a need for performance, I wont be reading hundreds a second or anything
TL;DR: Given that input example, what's the best way to make into a Haskell-readable format? map, data constructor, dependent map...
Both ways are very valid in their own respects, but since I was making an API to interact with such protocol too, I preferred the record syntax so I could cover all the properties more easily. Also I wasn't really going to do any checking or treatment in the getter functions, and no matter how boring making the reader instance for my type might have seemed, I bet doing all the get functions manually would be worse. Parsing stuff manually is inherently boring, I guess I was just looking for a magical funcional one liner to make all the work for me.
I would like to transition from using puppet to plain old scripts. During this transition I would like for scripts to access the information in hiera. Is there a way for puppet to pass the all key value pairs to a script as an argument through an exec? If I could get puppet to pass a json blob of hiera into a script that would be perfect.
Through experimentation in my hiera file contains
{
"a" : ["a, b"],
"b" : "b",
"c" : {
"a" : {
"b" : "c"
}
}
}
hiera("a"): "ab"
hiera("b"): "b"
hiera("c"): ""
hiera(""): ""
Ideally I'd like to pass the entire json string from all hiera data sources into my scripts from puppet's exec? Can anyone confirm if this is possible, or if there is some work around?
Is there a way for puppet to pass the all key value pairs to a script as an argument through an exec?
Not in general, no, because Hiera resolutions can be context-sensitive. There is no guarantee that "all key value pairs" is well-defined on a whole-node basis. Therefore, before even talking about how the data can be exchanged, you need to deal with the problem of what the data are.
Even if you suppose that none of the Hiera facilities are in use that contextualize data more narrowly than on a node-by-node basis, and that you need be concerned only with priority lookups (not array- or hash-merge lookups), Hiera has no built-in facility for compiling a composite of all data pertaining to a node. If you're dealing only with the standard YAML or JSON back-end, however, you could probably create and use your own hacked version that extracts the desired data, maybe as the value of some special key.
Even then, however, passing the data themselves as a command-line argument is highly questionable. You would need, first, to serialize the data into a form that will be interpreted as a single shell word. That surely can be automated, but next you risk running afoul of argument-length limits. A conforming POSIX system can impose a maximum argument length as small as 4096 bytes (though many have much larger limits) and that could easily be too little. And if you're trying to do this with Windows then know that its limits are even smaller.
As an alternative to passing the data as a command-line argument, you could consider writing them to a file that your script(s) will read. Even that seems a bit silly, however. Hiera has a CLI -- why not just distribute the Hiera data and hierarchy configuration, and have your scripts use Hiera to query the needed data from it?
In general, this is no supposed to work.
You might be able to write a custom Hiera backend that recognizes a special lookup key and return a fully merged hash of all data from the hierarchy.
I am not sure I have got the question correctly, but I assume you are talking about some templating? You can use templates and put value in the placeholders.
Check out this link, if this helps.
http://codingbee.net/tutorials/puppet/puppet-generate-files-templates-using-hiera-data/
In a C function called from my Lua script, I'm using luaL_ref to store a reference to a function. However, if I then try to use the returned integer index to fetch that function from a different thread which isn't derived from the same state, all I get back is nil. Here's the simplest example that seems to demonstrate it:
// Assumes a valid lua_State pL, with a function on top of the stack
int nFunctionRef = luaL_ref(pL, LUA_REGISTRYINDEX);
// Create a new state
lua_State* pL2 = luaL_newstate();
lua_rawgeti(pL2, LUA_REGISTRYINDEX, nFunctionRef);
const char* szType = luaL_typename(pL2, -1);
I'm finding that szType then contains the value 'nil'.
My understanding was that the registry was globally shared between all C code, so can anyone explain why this doesn't work?
If the registry isn't globally shared in that way, how can I get access to my values like I need to from another script?
The registry is just a normal table in a Lua state, therefore two unrelated Lua states can't access the same registry.
As Kknd says, you'll have to provide your own mechanism. A common trick is creating an extra state that doesn't execute any code, it's used only as a storage. In your case, you'd use that extra state's registry from your C code. unfortunately, there's no available method to copy arbitrary values between two states, so you'll have to unroll any tables.
copying functions is especially hard, if you're using the registry for that, you might want to keep track of which state you used to store it, and execute it on the original state, effectively turning it into a cross-state call, instead of moving the function.
luaL_newstate() creates another separeted state, as the name says. The registry is only shared between 'threads', created with lua_newthread(parent_state);
Edit to match the question edit:
You can run the scripts in the same state, or, if you don't want that, you will need to provide your own
mechanism to synchronize the data between the two states.
To use multiple Lua universes (states) you might find Lua Lanes worth a look. There is also a rough comparison of multi-state Lua solutions.
Lanes actually does provide the 'hidden state' that Javier mentions. It also handles locks needed to access such shared data and the ability to wait for such data to change. And it copies anything that is copyable (including functions and closures) between your application states and the hidden state.