How to pass efficiently a huge sized parameter to a function?

How to pass efficiently a huge sized parameter to a function? - python-3.x

thank you reading. Please forgive my English, I'm french. Is it a justification ?
The aim of my function is to scan many IPv4:PORT via multiprocessing/threading (do not, I forsee you, do not !) in Python3, send a SYN and register any result. Multisomething is called here because of the timeout needed to wait to any potential reply.
So.
This function is called a producer. It stores potential TCP connection into a database. Consumers
then read the database and do some pentesting indepently. So.
I prepared an IPv4 list. It's a random valid IP list of 10's K elements. Remember we have 65K ports to test per IPv4.
My method is then to suffle the list port with a new seed for each producer launched. I have many. Each have a valid IPv4 list, but, if we consider 65K ports with 100K IP's, we have a 6.5G elements to pass.
Why ?
IP[] are random.shuffle()-like, by construction. Ports[] are too.
If I read p in Ports and for each p, join IPv4:Ports and append into params[], I can launch the parallelized job via scan(u) for u in params[]
I launch it via run(rounds) like this :
def run(rounds):
for r in rounds:
scan(r)
But, the problem is size(rounds) = size(params) ~ 6.5G elements
I cannot find a efficient (memory talking) way to pass such a huge parameter (big list) to a parallelized job. I'm running out of memory.
I'm not asking how to manage it on a mind-blowing-capable workstation, while having designed this function on a paper, it doesn't fit into the raspi2 (1GB mem).
I do not need Python3 at all for myself, it a teaching product. Thus, i'm stuck.
My question is : could you help me to find a efficient way to attack a huge list by a parallelized function that pop's the list without sending it via a parameter ?
I googled, followed forums and threads I'm aware of, but, as I refactor the programm, the constant problem stays, laughing at me, at the same place in the main().
What I dont want to reproduce :
MAX_IPV4 = i.IPv4Address._ALL_ONES
MAX_PORT = 65535
def r_ipv4():
while True:
a=i.IPv4Address._string_from_ip_int(r.randint(0, MAX_IPV4))
b=i.IPv4Address(a)
if b.is_multicast or b.is_private or b.is_loopback or b.is_reserved or b.is_link_local :
continue
else :
return a
break
def generate_target():
return r_ipv4()
def generate_r_targets(n):
ip=[]
for i in range(0,n):
ip.append(generate_target())
ports=[]
for p in range(1,65535):
ports.append(p)
r.shuffle(ports)
targets=[]
for p in ports:
for z in ip:
targets.append((str(z)+","+str(p)))
It's not my optimal method, but the way I can explain and show the problem at best.
I dont want to lost the (shuffled ports) iterable, it is my principal vector and it is at the heart.
I'm testing a dynamic configuration of knockd.
Say. Computer_one register a sequence in knockd.
Computer_one is the only one to know the seed.
The seed is the key of the sequence of keys of knockd, that's done.

We passed it via a plain file :)

Related

Calculation dynamic delay in AnyLogic

Good day!
Please, help me understand how the Delay block works in AnyLogic. Suppose we deal with a multichannel transmission network.
The model has 2 sources. Suppose these sources generate packets every 1 sec. Packets from different sources have different priorities and need different quantities of resources to be served (it is set up with Priority and Resource_quantity parameters respectively). The Priority_queue in the model is priority-based. The proposed model put the packets into the channels in accordance with Resource availability in the channel. Firstly, it tries to put the packet to the first channel. If there are no available resources it puts the packet into the second channel. If there are no resources in both channels it waits (it is realized with Hold block).
I noticed that if I set delays in blocks delay1 and delay2 with static parameters (for ex. 2 sec) the model works ok. But then I try to calculate it before these blocks the model doesn't take into consideration it at all. And in this case, the model works without any delays.
What did I do wrong here?
I will appreciate any help.
The delay is calculated in Exit block and is written into the variable delay of the agent. I tried to add traceln(agent.delay) as #Jaco-Ben suggested right after calculation of the delay and it showed zero. In this case it doesn't also seize resources :(

Thank #Jaco-Ben for the useful comments.
The delay is zero because
the result of division in Java depends on the types of the operands.
If both operands are integer, the result will be integer as well. To
let Java perform real division (and get a real number as a result) at
least one of the operands must be of real type.
So it was my problem.
To solve it I assigned double to one of the operands :
agent.delay = (double)agent.Resource_quantity/ChannelResources1.idle();
However, it is strange why it shows correct values in the database.

1000 rows limit for chef-api module/wrapper

So im using this Node module to connect to chef from my API.
https://github.com/normanjoyner/chef-api
The same contains a method called "partialSearch" which will fetch determined data for all nodes that match a given criteria. The problem I have, on of our environments have 1386 nodes attached to it, but it seems the module only returns 1000 as a maximum.
There does not seem to be any method to "offset" the results. This module works pretty well and its a shame this feature is not implemented since its lack really breaks the utility of such.
Does someone bumped into a similar issue with this module and can advise how to workaround it?
Here its an extract of my code :
chef.config(SetOptions(environment));
console.log("About to search for any servers ...");
chef.partialSearch('node',
{
q: "name:*"
},
{
name: ['name'] ,
'ipaddress': ['ipaddress'] ,
'chef_environment': ['chef_environment'] ,
'ip6address': ['ip6address'],
'run_list': ['run_list'],
'chef_client': ['chef_client'],
'ohai_time': ['ohai_time']
}
, function(err, chefRes) {
Regards!

The maximum is 1000 results per page, you can still request pages in order. The Chef API doesn't have a formal cursor system for pagination so it's just separate requests with a different start value, which can sometimes lead to minor desync (as in an item at the end of one page might shift in ordering and also show up at the start of the next page) so just make sure you handle that. That said, the fancy API in the client library you linked doesn't seem to expose that option, so you'll have to add it or otherwise workaround the problem. Check out https://github.com/sethvargo/chef-api/blob/master/lib/chef-api/resources/partial_search.rb#L34 for a Ruby implementation that does handle it.

We have run into similar issues with Chef libraries. One work-around you might find useful is if you have some node attribute that you can use to segment all of your nodes into smaller groups that are less than 1000.
If you have no such natural segmentation friendly already, a simple implementation would be to create a new attribute called segment and during your chef runs set the attribute's value randomly to a number between 1 and 5.
Now you can perform 5 queries (each query will only search for a single segment) and you should find all your nodes and if the randomness is working each group will be sized about 275 (1386/5).
As your node population grows you'll need to keep increasing the number of segments to ensure the segment sizes are less than 1000.

Hazelcast High Response Times

We have a Java 1.6 application that uses Hazelcast 3.7.4 version,
with a topology of two nodes. The application operates mainly with 3 maps.
In normal application working, response times when consulting the maps are
generally in values around some milliseconds tens.
I have observed that in some circumstances such as for example with network
cuts, the response time increases to huge values such as for example, 20 or 30 seconds!!
And this is impacting the application performance.
I would like to know if this kind of situation with network micro-cuts can lead
to increase searches response time in this manner. I do not know if some concrete configuration can be done to minimize this, and also which other elements can provoke so high times.
I provide some examples of some executed consults
Example 1:
String sqlPredicate = "acui='"+acui+"'";
Collection<Agent> agents =
(Collection<Agent>) data.getMapAgents().values(new SqlPredicate(sqlPredicate));
Example 2:
boolean exist = data.getMapAgents().containsKey(agent);
Thanks so much for your help.
Best Regards,
Jorge

The Map operations are all TCP Socket based and thus are subject to your Operating Systems TCP Driver implementation.
See TCP_NODELAY

Knot Resolver: How to observe and modify a resolved answer at the right time

Goal
I would like to stitch up a GNU GPL licensed Knot Resolver module either in C or in CGO that would examine the client's query and the corresponding resolved answer with the goal of querying an external API offering a knowledge base of malware infected hostnames and ip addresses (e.g. GNU AGPL v3 IntelMQ).
If there is a match with the resolved A's (AAAA's) IP address it is to be logged, likewise a match with the queried hostname should be logged or (optionally) it could result in sending the client an IP address of a sinkhole instead of the resolved one.
Means
I studied the layers and I came to the conclusion that the phase I'm interested in is consume. I don't want to affect the resolution process, I just want to step in at the last moment and check the results and possibly modify them.
I ventured to register the a consume function
with
static knot_layer_api_t _layer = {
.consume = &consume,
};
but I'm not sure it is the right place to do the deed.
Furthermore, I also looked into module hints.c, especially its query method
and module stats.c for its _to_wire function usage.
Question(s)
Phase (Layer?)
When is the right time to step in and read/write the answer to the query before it's send to the client? Am I at the right spot in consume layer?
Answer sections
If the following attempt at getting the resolved IP address gives me the Name Server's address:
char addr_str[INET6_ADDRSTRLEN];
memset(addr_str, 0, sizeof(addr_str));
const struct sockaddr *src = &(req->answer->sections);
inet_ntop(qry->ns.addr[0].ip.sa_family, kr_inaddr(src), addr_str, sizeof(addr_str));
DEBUG_MSG(NULL, "ADDR: %s\n", addr_str);
how do I get the resolved (A, AAAA) IP address for the query's hostname? I would like to iterate over A/AAAA IP addresses and CNAMEs in the answer and look at the IP addresses they were resolved to.
Modifying the answer
If the module setting demands it, I would like to be able to "ditch" the resolved answer and provide a new one comprising an A record pointed at a sinkhole.
How do I prepare the record so as it could be translated from char* to Knot's wire format and the proper structure in the right context at the right phase?
I guess it might go along functions such as knot_rrset_init and knot_rrset_add_rdata, but I wasn't able to arrive at any successful result.
THX for pointers and suggestions.

If you want to step in the last moment when the response is finalised but not yet sent to the requestor, the right place is finish. You can do it in consume as well, but you'll be overwriting responses from authoritative servers here, not the assembled response to requestor (which means DNSSEC validator is likely to stop your rewritten answers).
Disclaimer: Go interface is rough and requires a lot of CGO code to access internal structures. You'd be probably better suited by a LuaJIT module, there is another module doing something similar that you may take as an example, it also has wrappers for creating records from text etc. If you still want to do it, that's awesome and improvements to Go interface are welcome, read on.
What you need to do is roughly this (as CGO).
That will walk you through RR sets in the packet (C.knot_rrset_t),
where you can match type (rr.type) and contents (rr.rdata).
Contents is stored in DNS wire format, for address records it is the address in network byte order, e.g. {0x7f, 0, 0, 1}.
You will have to compare that to address/subnet you're looking for - example in C code.
When you find a match, you want to clear the whole packet and insert sinkhole record (you cannot selectively remove records, because the packet is append-only for performance reasons). This is relatively easy as there is a helper for that. Here's code in LuaJIT from policy module, you'd have to rewrite it in Go, using all functions mentioned above and using A/AAAA sinkhole record instead of SOA. Good luck!

Memory leak in IPython.parallel module?

I'm using IPython.parallel to process a large amount of data on a cluster. The remote function I run looks like:
def evalPoint(point, theta):
# do some complex calculation
return (cost, grad)
which is invoked by this function:
def eval(theta, client, lview, data):
async_results = []
for point in data:
# evaluate current data point
ar = lview.apply_async(evalPoint, point, theta)
async_results.append(ar)
# wait for all results to come back
client.wait(async_results)
# and retrieve their values
values = [ar.get() for ar in async_results]
# unzip data from original tuple
totalCost, totalGrad = zip(*values)
avgGrad = np.mean(totalGrad, axis=0)
avgCost = np.mean(totalCost, axis=0)
return (avgCost, avgGrad)
If I run the code:
client = Client(profile="ssh")
client[:].execute("import numpy as np")
lview = client.load_balanced_view()
for i in xrange(100):
eval(theta, client, lview, data)
the memory usage keeps growing until I eventually run out (76GB of memory). I've simplified evalPoint to do nothing in order to make sure it wasn't the culprit.
The first part of eval was copied from IPython's documentation on how to use the load balancer. The second part (unzipping and averaging) is fairly straight-forward, so I don't think that's responsible for the memory leak. Additionally, I've tried manually deleting objects in eval and calling gc.collect() with no luck.
I was hoping someone with IPython.parallel experience could point out something obvious I'm doing wrong, or would be able to confirm this in fact a memory leak.
Some additional facts:
I'm using Python 2.7.2 on Ubuntu 11.10
I'm using IPython version 0.12
I have engines running on servers 1-3, and the client and hub running on server 1. I get similar results if I keep everything on just server 1.
The only thing I've found similar to a memory leak for IPython had to do with %run, which I believe was fixed in this version of IPython (also, I am not using %run)
update
Also, I tried switching logging from memory to SQLiteDB, in case that was the problem, but still have the same problem.
response(1)
The memory consumption is definitely in the controller (I could verify this by: (a) running the client on another machine, and (b) watching top). I hadn't realized that non SQLiteDB would still consume memory, so I hadn't bothered purging.
If I use DictDB and purge, I still see the memory consumption go up, but at a much slower rate. It was hovering around 2GB for 20 invocations of eval().
If I use MongoDB and purge, it looks like mongod is taking around 4.5GB of memory and ipcluster about 2.5GB.
If I use SQLite and try to purge, I get the following error:
File "/usr/local/lib/python2.7/dist-packages/IPython/parallel/controller/hub.py", line 1076, in purge_results
self.db.drop_matching_records(dict(completed={'$ne':None}))
File "/usr/local/lib/python2.7/dist-packages/IPython/parallel/controller/sqlitedb.py", line 359, in drop_matching_records
expr,args = self._render_expression(check)
File "/usr/local/lib/python2.7/dist-packages/IPython/parallel/controller/sqlitedb.py", line 296, in _render_expression
expr = "%s %s"%null_operators[op]
TypeError: not enough arguments for format string
So, I think if I use DictDB, I might be okay (I'm going to try a run tonight). I'm not sure if some memory consumption is still expected or not (I also purge in the client like you suggested).

Is it the controller process that is growing, or the client, or both?
The controller remembers all requests and all results, so the default behavior of storing this information in a simple dict will result in constant growth. Using a db backend (sqlite or preferably mongodb if available) should address this, or the client.purge_results() method can be used to instruct the controller to discard any/all of the result history (this will delete them from the db if you are using one).
The client itself caches all of its own results in its results dict, so this, too, will result in growth over time. Unfortunately, this one is a bit harder to get a handle on, because references can propagate in all sorts of directions, and is not affected by the controller's db backend.
This is a known issue in IPython, but for now, you should be able to clear the references manually by deleting the entries in the client's results/metadata dicts and if your view is sticking around, it has its own results dict:
# ...
# and retrieve their values
values = [ar.get() for ar in async_results]
# clear references to the local cache of results:
for ar in async_results:
for msg_id in ar.msg_ids:
del lview.results[msg_id]
del client.results[msg_id]
del client.metadata[msg_id]
Or, you can purge the entire client-side cache with simple dict.clear():
view.results.clear()
client.results.clear()
client.metadata.clear()
Side note:
Views have their own wait() method, so you shouldn't need to pass the Client to your function at all. Everything should be accessible via the View, and if you really need the client (e.g. for purging the cache), you can get it as view.client.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string