Getting an intersection between 2 CIDR spaces when you have huge data sets - python-3.x

Basically, I have a list of IP subnets (supernets) which contains around 100 elements. In the same time, I have another list (ips) which contains around 300k of IP addresses/networks.
Example:
supernets = ['10.10.0.0/16', '12.0.0.0/8']
ips = ['10.10.10.1', '10.10.10.8/30', '12.1.1.0/24']
The end goal is to classify the IP addresses based upon where they fall in the supernet.
So what I did is to compare every IP addresses/network element in the 2nd list to the first element in the supernet lists and so on.
Baically, I do this:
for i in range(len(supernets)):
for x in ips:
if IPNetwork(x) in IPNetwork(sorted(supernets)[i]):
print(i, x, sorted(supernets)[i])
lod[i][sorted(supernets)[i]].append(x)
This works fine, but it take ages and the CPU goes crazy, so my question is, is there any methodology or clean code that can achieve this and save time?
UPDATE
I have sorted the lists and used list comprehension instead, and the
script took around 11mins to run which is a good optimization in terms
of speed. But the CPU is still 100% during the whole 11mins.
[lod[i][public[i]].append(x) for i in range(len(public)) for x in ips if IPNetwork(x) in IPNetwork((public)[i])]

Related

How to pass efficiently a huge sized parameter to a function?

thank you reading. Please forgive my English, I'm french. Is it a justification ?
The aim of my function is to scan many IPv4:PORT via multiprocessing/threading (do not, I forsee you, do not !) in Python3, send a SYN and register any result. Multisomething is called here because of the timeout needed to wait to any potential reply.
So.
This function is called a producer. It stores potential TCP connection into a database. Consumers
then read the database and do some pentesting indepently. So.
I prepared an IPv4 list. It's a random valid IP list of 10's K elements. Remember we have 65K ports to test per IPv4.
My method is then to suffle the list port with a new seed for each producer launched. I have many. Each have a valid IPv4 list, but, if we consider 65K ports with 100K IP's, we have a 6.5G elements to pass.
Why ?
IP[] are random.shuffle()-like, by construction. Ports[] are too.
If I read p in Ports and for each p, join IPv4:Ports and append into params[], I can launch the parallelized job via scan(u) for u in params[]
I launch it via run(rounds) like this :
def run(rounds):
for r in rounds:
scan(r)
But, the problem is size(rounds) = size(params) ~ 6.5G elements
I cannot find a efficient (memory talking) way to pass such a huge parameter (big list) to a parallelized job. I'm running out of memory.
I'm not asking how to manage it on a mind-blowing-capable workstation, while having designed this function on a paper, it doesn't fit into the raspi2 (1GB mem).
I do not need Python3 at all for myself, it a teaching product. Thus, i'm stuck.
My question is : could you help me to find a efficient way to attack a huge list by a parallelized function that pop's the list without sending it via a parameter ?
I googled, followed forums and threads I'm aware of, but, as I refactor the programm, the constant problem stays, laughing at me, at the same place in the main().
What I dont want to reproduce :
MAX_IPV4 = i.IPv4Address._ALL_ONES
MAX_PORT = 65535
def r_ipv4():
while True:
a=i.IPv4Address._string_from_ip_int(r.randint(0, MAX_IPV4))
b=i.IPv4Address(a)
if b.is_multicast or b.is_private or b.is_loopback or b.is_reserved or b.is_link_local :
continue
else :
return a
break
def generate_target():
return r_ipv4()
def generate_r_targets(n):
ip=[]
for i in range(0,n):
ip.append(generate_target())
ports=[]
for p in range(1,65535):
ports.append(p)
r.shuffle(ports)
targets=[]
for p in ports:
for z in ip:
targets.append((str(z)+","+str(p)))
It's not my optimal method, but the way I can explain and show the problem at best.
I dont want to lost the (shuffled ports) iterable, it is my principal vector and it is at the heart.
I'm testing a dynamic configuration of knockd.
Say. Computer_one register a sequence in knockd.
Computer_one is the only one to know the seed.
The seed is the key of the sequence of keys of knockd, that's done.
We passed it via a plain file :)

Kernel dies when using itertools.permutation

I have a list of strings. And I want to find all possible combination of that list. I use itertools.permutation and it runs for while but then it crashes saying Kernel died, restarting. I try running the code through terminal too. But it crashes there too. Here is my code:
import itertools
sum_stats = ['pi', 'theta W','Tajima D','distVar','distSkew','distKurt','nDiplos',
'diplo_H1','diplo_H12','diplo_H2/H1','diplo_ZnS','diplo_Omega']
permuted_sum_stats = list(itertools.permutations(sum_stats))
Can someone show me an efficient way to create all possible combinations of this list?
Your list has 12 elements. To get all possible permutations, your new list needs 12!, or about 500 million elements. One of these lists takes about 150 bytes, excluding the space of the strings, which I assume is reused.
This leads to about 75 GB of data, which is probably more than the RAM of your machine.

linearK - large time difference between empirical and acceptance envelopes in spatstat

I am interested in knowing correlation in points between 0 to 2km on a linear network. I am using the following statement for empirical data, this is solved in 2 minutes.
obs<-linearK(c, r=seq(0,2,by=0.20))
Now I want to check the acceptance of Randomness, so I used envelopes for the same r range.
acceptance_enve<-envelope(c, linearK, nsim=19, fix.n = TRUE, funargs = list(r=seq(0,2,by=0.20)))
But this show estimated time to be little less than 3 hours. I just want to ask if this large time difference is normal. Am I correct in my syntax to the function call of envelope its extra arguments for r as a sequence?
Is there some efficient way to shorten this 3 hour execution time for envelopes?
I have a road network of whole city, so it is quite large and I have checked that there are no disconnected subgraphs.
c
Point pattern on linear network
96 points
Linear network with 13954 vertices and 19421 lines
Enclosing window: rectangle = [559.653, 575.4999] x
[4174.833, 4189.85] Km
thank you.
EDIT AFTER COMMENT
system.time({s <- runiflpp(npoints(c), as.linnet(c));
+ linearK(s, r=seq(0,2,by=0.20))})
user system elapsed
343.047 104.428 449.650
EDIT 2
I made some really small changes by deleting some peripheral network segments that seem to have little or no effect on the overall network. This also lead to split some long segments into smaller segments. But now on the same network with different point pattern, I have even longer estimated time:
> month1envelope=envelope(months[[1]], linearK ,nsim = 39, r=seq(0,2,0.2))
Generating 39 simulations of CSR ...
1, 2, [etd 12:03:43]
The new network is
> months[[1]]
Point pattern on linear network
310 points
Linear network with 13642 vertices and 18392 lines
Enclosing window: rectangle = [560.0924, 575.4999] x [4175.113,
4189.85] Km
System Config: MacOS 10.9, 2.5Ghz, 16GB, R 3.3.3, RStudio Version 1.0.143
You don't need to use funargs in this context. Arguments can be passed directly through the ... argument. So I suggest
acceptance_enve <- envelope(c, linearK, nsim=19,
fix.n = TRUE, r=seq(0,2,by=0.20))
Please try this to see if it accelerates the execution.

Optimal check if IP is in subnet

I want to check if an IP address belongs to a subnet. The pain comes when I must check against 300.000 CIDR blocks having subnets ranging from /3 to /31, several million times / second.
Take https://github.com/indutny/node-ip for example:
I could ip.cidrSubnet('ip/subnet') for each all of the 300.000 blocks and check if the IP I'm looking for is inside the first-last address range, but this is very costly.
How can I optimally check if an IP address belongs to one of these blocks, without looping everytime through all of them?
Store the information in a binary tree that is optimized for range checks.
One naive way to do it is to turn each CIDR block into a pair of events, one when you enter the block, one when you exit the block. Then sort the list of events by IP address. Run through it and create a sorted array of IP addresses and how many blocks you are in. For 300,000 CIDR blocks there will be 600,000 events, and your search will be 19-20 lookups.
Now you can do a binary search of that file to find the last transition before your current IP address, and return true/false depending on whether that was in one or more blocks versus in none.
The lookup will be faster if instead of searching a file, you are searching a dedicated index of some sort. (The number of lookups in the search is the same or slightly higher, but you make better use of CPU caches.) I personally have used BerkeleyDB's BTree data structure for this kind of thing in other languages, and have been very happy.

How to work with the COUNTER in Nagios or RRD?

I have the following problem:
I want to do the statistics of data that need to be constantly increasing. For example, the number of visits to the link. After some time be restarted these visit and start again from the beginning. To have a continuous increase, want to do the statistics somewhere. For this purpose, use a site that does this. In his condition can be used to COUNTER, GAUGE, AVERAGE, ... a.. I want to use the COUNTER. The system is built on Nagios.
My question is how to use this COUNTER. I guess it is the same as that of the RRD. But I met some strange things in the creation of such a COUNTER.
I submit the values ' 1 ' then ' 2 ' and the chart to come up 3. When I do it doesn't work. After the restart, for example, and submit again 1 to become 4
Anyone dealt with these things tell me briefly how it works with this COUNTER.
I saw that the COUNTER is used for traffic on routers, etc, but I want to apply for a regular graph, which just increases.
The RRD data type COUNTER will convert the input data into a rate, by taking the difference between this sample and the last sample, and dividing by the time interval (note that data normalisation also takes place and this is dependent on the Interval setting of the RRD)
Thus, updating with a constantly increasing count will result in a rate-of-change value to be graphed.
If you want to see your graph actually constantly increasing, IE showing the actual count of packets transferred (for example) rather than the rate of transfer, you would need to use type GAUGE which assumes any rate conversion has already been done.
If you want to submit the rate values (EG, 2 in the last minute), but display the overall constantly increasing total (in other words, the inverse of how the COUNTER data type works), then you would need to store the values as GAUGE, and use a CDEF in your RRDgraph command of the form CDEF:x=y,PREV,+ to obtain the ongoing total. Of course you would only have this relative to the start of the graph time window; maybe a separate call would let you determine what base value to use.
As you use Nagios, you may like to investigate Nagios add-ons such as pnp4nagios which will handle much of the graphing for you.

Resources