Does mogrify fix the concern with injection attacks? - psycopg2

I understand that psycopg2 queries should not be formed by text replacement like f-strings, %s forms, etc. for fear of injection attacks. The docs make that clear. However, what's not clear to me is if the cursor.mogrify method is subject to the same concerns.
The docs say,
-- Method: mogrify (operation[, parameters])
Return a query string after arguments binding. The string
returned is exactly the one that would be sent to the database
running the execute() method or similar.
https://www.psycopg.org/docs/cursor.html#cursor.mogrify
That makes it sound like execute basically runs mogrify behind the scenes. The red box talking about guns is pretty scary, though. I don't know what to trust.
Basically, if this is bad,
# don't do this even at gunpoint
my_id = 1234
my_values = ['a', 'b', 'c']
my_query = ''
for val in values:
insert_statement = f"""INSERT INTO my_table VALUES ({my_id}, {val});"""
my_query = '\n'.join([my_query, insert_statement])
with self.connection, self.connection.cursor() as cursor:
cursor.execute(my_query)
is this a good substitute?
# is this a footgun?
my_id = 1234
my_values = ['a', 'b', 'c']
with self.connection, self.connection.cursor() as cursor:
for val in values:
my_query = cursor.mogrify("INSERT INTO my_table VALUES (%s, %s);", (my_id, val))
cursor.execute(my_query)

That description is a bit misleading.
The wire protocol of PostgreSQL fully supports parametrization. A parametrized query will be sent as a "multipart" packet containing (among others) the parametrized query + the values to 'substitute' into the query.
Execution of the query is done by PostgreSQL engine, where the (parametrized) query will be tokenized, and tokens indicating substution will be substituted with actual desired values.
So, although I have never read the actual source code for psycopg, I'm quite certain that behind the scenes, cursor.execute() will leverage this native support of parametrization, and not doing a .mogrify() first.

Related

Neo4j graph blind search for any node and relationship containing an expression?

I am trying to build a blind search, given an expression/string.
Using Python Neo4j driver I am running:
from neo4j import GraphDatabase
driver = GraphDatabase.driver("neo4j://localhost:7687")
def query_engine(tx, query):
res = tx.run(query)
values = [record for record in res]
return values
def fuzzy_search(tx, search_expression):
query = f"MATCH (n) WHERE ANY(x in keys(n) WHERE n[x] =~ '(i?){search_expression}.*') RETURN n"
res = query_engine(tx, query)
return res
with driver.session() as session:
result = session.read_transaction(fuzzy_search, "kuku.*")
driver.close()
I know I need to add full text index to make it faster, please advise what is the best practice to define the full text index in Neo4j when I want to perform full graph search on the nodes/relations params?
For example, I am searching for 'kuku' in my graph across all nodes and relations and if there are any nodes/relations that contain kuku, I would like to be able to return it as a result.
Additional info:
I have added to all my nodes an additional label (FTIndex) and I am able to create a full text index, BUT(!), how can I config it to index ALL nodes available params + to be sure it will be updated if I will add new ones?
You would have to enumerate the properties you want to search for in the full-text index.
Unfortunately there is no way around that.
So basically create an index for your label FTIndex and all properties then that index should efficiently find your results.
In general please don't use string interpolation but parameters, i.e. $search_expression to avoid injection security issues.
and then
MATCH (n)
WHERE ANY(x in keys(n)
WHERE n[x] =~ '(i?)'+$search_expression+'.*')
RETURN n

Is it possible to do lazy formatting of a python string? [duplicate]

I want to use f-string with my string variable, not with string defined with a string literal, "...".
Here is my code:
name=["deep","mahesh","nirbhay"]
user_input = r"certi_{element}" # this string I ask from user
for element in name:
print(f"{user_input}")
This code gives output:
certi_{element}
certi_{element}
certi_{element}
But I want:
certi_{deep}
certi_{mahesh}
certi_{nirbhay}
How can I do this?
f"..." strings are great when interpolating expression results into a literal, but you don't have a literal, you have a template string in a separate variable.
You can use str.format() to apply values to that template:
name=["deep","mahesh","nirbhay"]
user_input = "certi_{element}" # this string i ask from user
for value in name:
print(user_input.format(element=value))
String formatting placeholders that use names (such as {element}) are not variables. You assign a value for each name in the keyword arguments of the str.format() call instead. In the above example, element=value passes in the value of the value variable to fill in the placeholder with the element.
Unlike f-strings, the {...} placeholders are not expressions and you can't use arbitrary Python expressions in the template. This is a good thing, you wouldn't want end-users to be able to execute arbitrary Python code in your program. See the Format String Syntax documenation for details.
You can pass in any number of names; the string template doesn't have to use any of them. If you combine str.format() with the **mapping call convention, you can use any dictionary as the source of values:
template_values = {
'name': 'Ford Prefect',
'number': 42,
'company': 'Sirius Cybernetics Corporation',
'element': 'Improbability Drive',
}
print(user_input.format(**template_values)
The above would let a user use any of the names in template_values in their template, any number of times they like.
While you can use locals() and globals() to produce dictionaries mapping variable names to values, I'd not recommend that approach. Use a dedicated namespace like the above to limit what names are available, and document those names for your end-users.
If you define:
def fstr(template):
return eval(f"f'{template}'")
Then you can do:
name=["deep","mahesh","nirbhay"]
user_input = r"certi_{element}" # this string i ask from user
for element in name:
print(fstr(user_input))
Which gives as output:
certi_deep
certi_mahesh
certi_nirbhay
But be aware that users can use expressions in the template, like e.g.:
import os # assume you have used os somewhere
user_input = r"certi_{os.environ}"
for element in name:
print(fstr(user_input))
You definitely don't want this!
Therefore, a much safer option is to define:
def fstr(template, **kwargs):
return eval(f"f'{template}'", kwargs)
Arbitrary code is no longer possible, but users can still use string expressions like:
user_input = r"certi_{element.upper()*2}"
for element in name:
print(fstr(user_input, element=element))
Gives as output:
certi_DEEPDEEP
certi_MAHESHMAHESH
certi_NIRBHAYNIRBHAY
Which may be desired in some cases.
If you want the user to have access to your namespace, you can do that, but the consequences are entirely on you. Instead of using f-strings, you can use the format method to interpolate dynamically, with a very similar syntax.
If you want the user to have access to only a small number of specific variables, you can do something like
name=["deep", "mahesh", "nirbhay"]
user_input = "certi_{element}" # this string i ask from user
for element in name:
my_str = user_input.format(element=element)
print(f"{my_str}")
You can of course rename the key that the user inputs vs the variable name that you use:
my_str = user_input.format(element=some_other_variable)
And you can just go and let the user have access to your whole namespace (or at least most of it). Please don't do this, but be aware that you can:
my_str = user_input.format(**locals(), **globals())
The reason that I went with print(f'{my_str}') instead of print(my_str) is to avoid the situation where literal braces get treated as further, erroneous expansions. For example, user_input = 'certi_{{{element}}}'
I was looking for something similar with your problem.
I came across this other question's answer: https://stackoverflow.com/a/54780825/7381826
Using that idea, I tweaked your code:
user_input = r"certi_"
for element in name:
print(f"{user_input}{element}")
And I got this result:
certi_deep
certi_mahesh
certi_nirbhay
If you would rather stick to the layout in the question, then this final edit did the trick:
for element in name:
print(f"{user_input}" "{" f"{element}" "}")
Reading the security concerns of all other questions, I don't think this alternative has serious security risks because it does not define a new function with eval().
I am no security expert so please do correct me if I am wrong.
This is what you’re looking for. Just change the last line of your original code:
name=["deep","mahesh","nirbhay"]
user_input = "certi_{element}" # this string I ask from user
for element in name:
print(eval("f'" + f"{user_input}" + "'"))

List, tuples or dictionary, differences and usage, How can I store info in python

I'm very new in python (I usually write in php). I want to understand how to store information in an associative array, and if you can explain me whats the difference of "tuples", "arrays", "dictionary" and "list" will be wonderful (I tried to read different source but I still not caching it).
So This is my code:
#!/usr/bin/python3.4
import csv
import string
nidless_keys = dict()
nidless_keys = ['test_string1','test_string2'] #this contain the string to
# be searched in linesreader
data = {'type':[],'id':[]} #here I want to store my information
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader: #every line in this csv have a url like
#www.test.com/?test_string1&id=123456
current_row_string = str(row)
for needle in nidless_keys:
current_needle = str(needle)
if current_needle in current_row_string:
data[current_needle[current_row_string[-8:]]) += 1 # also I
#need to count per every id how much rows there are.
In conclusion:
my_data_stored = [current_needle][current_row_string[-8]]
current_row_string[-8] is a url which the last 8 digit of the url is an ID.
So the array should looks like this at the end of the script:
test_string1 = 123456 = 20
= 256468 = 15
test_string2 = 123155 = 10
Edit 1:
Which type I need here to store the information?
Can you tell me how to resolve this script?
It seems you want to count how many times an ID in combination with a test string occurs.
There can be multiple ID/count combinations associated with every test string.
This suggests that you should use a dictionary indexed by the test strings to store the results. In that dictionary I would suggest to store collections.Counter objects.
This way, you would have to add a special case when a key in the results dictionary isn't found to add an empty Counter. This is a common problem, so there is a specialized form of dictionary in the collections module called defaultdict.
import collections
import csv
# Using a tuple for the keys so it cannot be accidentally modified
keys = ('test_string1', 'test_string2')
result = collections.defaultdict(collections.Counter)
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader:
for key in keys:
if key in row:
id = row[-6:] # ID's are six digits in your example.
# The first index is into the dict, the second into the Counter.
result[key][id] += 1
There is an even easier way, by using regular expressions.
Since you seem to treat every row in a CSV file as a string, there is little need to use the CSV reader, so I'll just read the whole file as text.
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
pattern = r'\?(.*)&id=(\d+)'
The pattern is a regular expression. This is a large topic in and of itself, so I'll only cover briefly what it does. (You might also want to check out the relevant HOWTO) At first glance it looks like complete gibberish, but it is actually a complete language.
In looks for two things in a line. Anything between ? and &id=, and a sequence of digits after &id=.
I'll be using IPython to give an example.
(If you don't know it, check out IPython. It is great for trying things and see if they work.)
In [1]: import re
In [2]: pattern = r'\?(.*)&id=(\d+)'
In [3]: text = """www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=234567
....: www.test.com/?foo&id=234567
....: www.test.com/?foo&id=123456
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234"""
The text variable points to the string which is a mock-up for the contents of your CSV file.
I am assuming that:
every URL is on its own line
ID's are a sequence of digits.
If these assumptions are wrong, this won't work.
Using findall to extract every match of the pattern from the text.
In [4]: re.findall(pattern, test)
Out[4]:
[('test_string1', '123456'),
('test_string1', '123456'),
('test_string1', '234567'),
('foo', '234567'),
('foo', '123456'),
('foo', '1234'),
('foo', '1234'),
('foo', '1234')]
The findall function returns a list of 2-tuples (that is key, ID pairs). Now we just need to count those.
In [5]: import collections
In [6]: result = collections.defaultdict(collections.Counter)
In [7]: intermediate = re.findall(pattern, test)
Now we fill the result dict from the list of matches that is the intermediate result.
In [8]: for key, id in intermediate:
....: result[key][id] += 1
....:
In [9]: print(result)
defaultdict(<class 'collections.Counter'>, {'foo': Counter({'1234': 3, '123456': 1, '234567': 1}), 'test_string1': Counter({'123456': 2, '234567': 1})})
So the complete code would be:
import collections
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
result = collections.defaultdict(collections.Counter)
pattern = r'\?(.*)&id=(\d+)'
intermediate = re.findall(pattern, test)
for key, id in intermediate:
result[key][id] += 1
This approach has two advantages.
You don't have to know the keys in advance.
ID's are not limited to six digits.
A brief summary of the python data types you mentioned:
A dictionary is an associative array, aka hashtable.
A list is a sequence of values.
An array is essentially the same as a list, but limited to basic datatypes. My impression is that they only exists for performance reasons, don't think I've ever used one. If performance is that critical to you, you probably don't want to use python in the first place.
A tuple is a fixed-length sequence of values (whereas lists and arrays can grow).
Lets take them one by one.
Lists:
List is a very naive kind of data structure similar to arrays in other languages in terms of the way we write them like:
['a','b','c']
This is a list in python , but seems very similar to array structure.
However there is a very large difference in the way lists are used in python and the usual arrays.
Lists are heterogenous in nature. This means that we can store any kind of data simultaneously inside it like:
ls = [1,2,'a','g',True]
As you can see, we have various kinds of data within a list and is a valid list.
However, one important thing about them is that we can access the list items using zero based indices. So we can write:
print ls[0],ls[3]
output: 1 g
Dictionary:
This datastructure is similar to a hash map data structure. It contains a (key,Value) pair. An empty dictionary looks like:
dc = {}
Now, to store a key,value pair, e.g., ('potato',3),(tomato,5), we can do as:
dc['potato'] = 3
dc['tomato'] = 5
and we saved the data in the dictionary dc.
The important thing is that we can even store another data structure element like a list within a dictionary like:
dc['list1'] = ls , where ls is the list defined above.
This shows the power of using dictionary.
In your case, you have difined a dictionary like this:
data = {'type':[],'id':[]}
This means that your dictionary will consist of only two keys and each key corresponds to a list, which are empty for now.
Talking a bit about your script, the expression :
current_row_string[-8:]
doesn't make a sense. The index should have been -6 instead of -8 that would give you the id part of the current row.
This part is the id and should have been stored in a variable say :
id = current_row_string[-6:]
Further action can be performed as seen the answer given by Roland.

Unpredictable behaviour of scala triple quoted string

I'm using junit in scala to compare string output from my scala code. Something like :
val expected = """<Destination id="0" type="EMAIL">
<Address>
me#test.com
</Address>
<TimeZone>
US/Eastern
</TimeZone>
<Message>
</Message>
</Destination>
"""
val actual = getTextToLog()
info(" actual = " + actual)
assert(expected == actual)
The issue is that for some strings, assertions like :
assert(expected == actual)
work and for some they strings they dont. Even when I copy actual (logged to Eclipse console) from Eclipse console and paste it into expected just to be sure , the assertion still fails.
What am I missing?
OK, since this turns out to be a whitespace issues, you should sanitise the two strings before comparing them. Look at the RichString methods like .lines, for example, which might let you create a line-ending or whitespace-agnostic comparison method.
Here is one naive way of doing this with implicit conversions:
import scala.language.implicitConversions
object WhiteSpace {
implicit def strToWhiteSpace(s: String) = new WhiteSpace(s)
}
class WhiteSpace(val s: String) {
def `~==` (other: String) = s.lines.toList == other.lines.toList
}
which allows
import WhiteSpace._
assert(expected ~== actual)
Or you could extend the appropriate jutils class to add an agnostic version of assertEquals.
Note that this comparison deconstructs both strings in the same way. This is much safer than sending one of them on a round-trip conversion.
Whitespace/crlf issues are so common that there's no point fighting it by trying to stop the mismatches; just do agnostic comparisons.

A more "pythonic" approach to "check for None and deal with it"

I have a list of dict with keys ['name','content','summary',...]. All the values are strings. But some values are None. I need to remove all the new lines in content, summary and some other keys. So, I do this:
...
...
for item in item_list:
name = item['name']
content = item['content']
if content is not None: content = content.replace('\n','')
summary = item['summary']
if summary is not None: summary = summary.replace('\n','')
...
...
...
...
I somewhat feel that the if x is not None: x = x.replace('\n','') idiom not so intelligent or clean. Is there a more "pythonic" or better way to do it?
Thanks.
The code feels unwieldy to you, but part of the reason is because you are repeating yourself. This is better:
def remove_newlines(text):
if text is not None:
return text.replace('\n', '')
for item in item_list:
name = item['name']
content = remove_newlines(item['content'])
summary = remove_newlines(item['summary'])
If you are going to use sentinel values (None) then you will be burdened with checking for them.
There are a lot of different answers to your question, but they seem to be missing this point: don't use sentinel values in a dictionary when the absence of an entry encodes the same information.
For example:
bibliography = [
{ 'name': 'bdhar', 'summary': 'questioner' },
{ 'name': 'msw', 'content': 'an answer' },
]
then you can
for article in bibliography:
for key in article:
...
and then your loop is nicely ignorant of what keys, if any, are contained in a given article.
In reading your comments, you claim that you are getting the dict from somewhere else. So clean it of junk values first. It is much more clear to have a cleaning step then it is to carry their misunderstanding through your code.
Python has a ternary operator, so one option is to do this in a more natural word order:
content = content.replace('\n', '') if content is not None else None
Note that if "" and None are equivalent in your case (which appears to be so), you can shorten it to just if content, as non-empty strings evaluate to True.
content = content.replace('\n', '') if content else None
This also follows the Python idiom of explicit is better than implicit. This shows someone following the code that the value can be None very clearly.
It's worth noting that if you repeat this operation a lot, it might be worth encapsulating it as a function.
Another idiom in Python is ask for forgiveness, not permission. So you could simply use try and except the AttributeError that follows, however, this becomes a lot more verbose in this case, so it's probably not worth it, especially as the cost of the check is so small.
try:
content = content.replace('\n', '')
except AttributeError:
content = None
#pass #Also an option, but as mentioned above, explicit is generally clearer than implicit.
One possibility is to use the empty string instead of None. This is not a fully general solution, but in many cases if your data is all of a single type, there will be a sensible "null" value other than None (empty string, empty list, zero, etc.). In this case it looks like you could use the empty string.
The empty string evaluates to False in Python, so the Pythonic way is if content:.
In [2]: bool("")
Out[2]: False
In [3]: bool("hello")
Out[3]: True
Side note but you can make your code a little clearer:
name, content = item["name"], item["content"]
And:
content = content.replace('\n','') if content else None
You might also consider abstracting some of your if clauses into a separate function:
def remove_newlines(mystr):
if mystr:
mystr = mystr.replace('\n')
return mystr
(edited to remove the over-complicated solution with dictionaries, etc)
Try:
if content: content = content.replace('\n','')
--
if content will (almost1) always be True as long as content contains anything except for 0, False, or None.
1As Lattyware correctly points out in the comments, this is not strictly True. There are other things that will evaluate to False in an if statement, for example, an empty list. See the link provided in the comment below.
I think that the "pythonic" thing is to use the fact that None will evaluate to False in an if statement.
So you can just say:
if content: content = content.replace('\n','')

Resources