I'm implementing a crawler for a website with a growing number of entities. There is no information available how many entities exist and no list of all entities. Every entity can be accessed with an URL like this: http://www.somewebsite.com/entity_{i} where {i} is the number of the entity, starting with 1 and incrementing by 1.
To crawle every entity I'm running a loop which checks if a HTTP requests returns a 200 or 404. If I get a 404 NOT FOUND, the loop stops and I'm sure I have all entities.
The serial way looks like this:
def atTheEnd = false
def i = 0
while(!atTheEnd){
atTheEnd = !crawleWebsite("http://www.somewebsite.com/entity_" + i)
i++
}
crawleWebsite() returns true if it succeed and false if it got an 404 NOT FOUND error.
The problem is crawling those entities can take very long that's why I want to do it in multiple threads but I don't know the total amount of entities so every task isn't independent from the other tasks.
Whats the best way to solve this problem?
My approach would be this: Using binary search with REST HEAD requests to get the total number of entities (between 500 and 1000) and split those to some threads.
Is there maybe a better way doing this?
tl;dr
Basically I want to tell a threadpool to programmatically create new tasks until a condition is satisfied (when the first 404 occured) and to wait until every task has finished.
Note: I'm implementing this code using Grails 3.
As you said, the total number of entities is not known and can go into thousands. In this case I would simply go for a fixed thread pool and speculatively query URLs even though you may have already reached the end. Consider this example.
#Grab(group = 'org.codehaus.gpars', module = 'gpars', version = '1.2.1')
import groovyx.gpars.GParsPool
//crawling simulation - ignore :-)
def crawleWebsite(url) {
println "$url:${Thread.currentThread().name}"
Thread.sleep (1)
Math.random() * 1000 < 950
}
final Integer step = 50
Boolean atTheEnd = false
Integer i = 0
while (true) {
GParsPool.withPool(step) {
(i..(i + step)).eachParallel{atTheEnd = atTheEnd || !crawleWebsite("http://www.somewebsite.com/entity_" + it)}
}
if (atTheEnd) {
break
}
i += step
}
The threadpool is set to 50 and once all 50 URLs are crawled we check if we reached the end. If not we carry on.
Obviously in the worst case scenario you can crawl 50 404s. But I'm sure you could get away with it :-)
Related
I am new in soapui. I make logic for a service stub and I meet a problem.
I have a simple service stub that returns a parameter with a random number (it is randomized in the answer in groovy language), but the problem is that this number is used 2 times per session and cannot be changed, otherwise the session will fail. How would I pass a random number to the next response and then start randomizing again and so on?
I could not find anything similar to my case on the Internet, so I ask the question here. Is it even possible to implement this in soapui, for example through TestSuite and groove scripts?
Groovy code I use in response script to generate random number:
requestContext.actreq = (10000000 + Math.abs(new Random().nextInt() % 9999999));
Then I substitute ${actreq} in the response
If the number 100001 is generated, then I would like to pass it to the next two response. In order for Random to work every 2 iterations.
you could try this aproach
class Glob{
static long callCount=0
static long randValue=0
static long rand(){
callCount ++
if(callCount % 2 == 1){
randValue = (10000000 + Math.abs(new Random().nextInt() % 9999999))
}
return randValue
}
}
requestContext.actreq = Glob.rand()
or official way like this:
https://www.soapui.org/docs/functional-testing/working-with-scripts/
use setup script to assign context variables
in script you could access those variables like i did in code above to increment call-count and re-calculate random if needed...
Currently, I'm trying to improve a code that sends multiple HTTP requests to a webpage until it can capture some text (which the code localizes through a known pattern) or until 180 seconds runs out (the time we wait for the page to give us an answer).
This is the part of the code (a little edited for privacy purposes):
if matches == None:
txt = "No answer til now"
print(txt)
Solution = False
start = time.time()
interval = 0
while interval < 180:
response = requests.get("page address")
subject = response.text
matches = re.search(pattern, subject, re.IGNORECASE)
if matches != None:
Solution =matches.group(1)
time = "{:.2f}".format(time.time()-start)
txt = "Found an anwswer "+ Solution + "time needed : "+ time
print(txt)
break
interval = time.time()-start
else:
Solution = matches.group(1)
It runs OK, but I was told that doing "infinite requests in a loop" could cause an hight CPU usage of the server. Do you guys know of something I can use in order to avoid that?
Ps: I heard that in PHP people use curl_multi_select() for things like these. Don't know if I'm correct though.
Usually an HTTP REST API will specify in the documentation how many requests you can make in a given time period against which endpoint resources.
For a website, if you are not hitting a request limit and getting flagged/banned for too many requests, then you should be okay to continuously loop like this, but you may want to introduce a time.sleep call into your while loop.
An alternative to the 180 second timeout:
Since HTTP requests are I/O operations and can take a variable amount of time, you may want to change your exit case for the loop to a certain amount of requests (like 25 or something) and then incorporate the aforementioned sleep call.
That could look like:
# ...
if matches is None:
solution = None
num_requests = 25
start = time.time()
while num_requests:
response = requests.get("page address")
if response.ok: # It's good to attempt to handle potential HTTP/Connectivity errors
subject = response.text
matches = re.search(pattern, subject, re.IGNORECASE)
if matches:
solution = matches.group(1)
elapsed = "{:.2f}".format(time.time()-start)
txt = "Found an anwswer " + solution + "time needed : " + elapsed
print(txt)
break
else:
# Maybe raise an error here?
pass
time.sleep(2)
num_requests -= 1
else:
solution = matches.group(1)
Notes:
Regarding PHP's curl_multi_select - (NOT a PHP expert here...) it seems that this function is designed to allow you to watch multiple connections to different URLs in an asynchronous manner. Async doesn't really apply to your use case here because you are only scraping one webpage (URL), and are just waiting for some data to appear there.
If the response.text you are searching through is HTML and you aren't already using it somewhere else in your code, I would recommend Beautiful Soup or scrapy to (before regex) for searching for string patterns in webpage markup.
I have a list of issues (jira issues):
listOfKeys = [id1,id2,id3,id4,id5...id30000]
I want to get worklogs of this issues, for this I used jira-python library and this code:
listOfWorklogs=pd.DataFrame() (I used pandas (pd) lib)
lst={} #dictionary for help, where the worklogs will be stored
for i in range(len(listOfKeys)):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
i+=1
else:
for j in range(len(worklogs)):
lst = {
'self': worklogs[j].self,
'author': worklogs[j].author,
'started': worklogs[j].started,
'created': worklogs[j].created,
'updated': worklogs[j].updated,
'timespent': worklogs[j].timeSpentSeconds
}
listOfWorklogs = listOfWorklogs.append(lst, ignore_index=True)
########### Below there is the recording to the .xlsx file ################
so I simply go into the worklog of each issue in a simple loop, which is equivalent to referring to the link:
https://jira.mycompany.com/rest/api/2/issue/issueid/worklogs and retrieving information from this link
The problem is that there are more than 30,000 such issues.
and the loop is sooo slow (approximately 3 sec for 1 issue)
Can I somehow start multiple loops / processes / threads in parallel to speed up the process of getting worklogs (maybe without jira-python library)?
I recycled a piece of code I made into your code, I hope it helps:
from multiprocessing import Manager, Process, cpu_count
def insert_into_list(worklog, queue):
lst = {
'self': worklog.self,
'author': worklog.author,
'started': worklog.started,
'created': worklog.created,
'updated': worklog.updated,
'timespent': worklog.timeSpentSeconds
}
queue.put(lst)
return
# Number of cpus in the pc
num_cpus = cpu_count()
index = 0
# Manager and queue to hold the results
manager = Manager()
# The queue has controlled insertion, so processes don't step on each other
queue = manager.Queue()
listOfWorklogs=pd.DataFrame()
lst={}
for i in range(len(listOfKeys)):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
i+=1
else:
# This loop replaces your "for j in range(len(worklogs))" loop
while index < len(worklogs):
processes = []
elements = min(num_cpus, len(worklogs) - index)
# Create a process for each cpu
for i in range(elements):
process = Process(target=insert_into_list, args=(worklogs[i+index], queue))
processes.append(process)
# Run the processes
for i in range(elements):
processes[i].start()
# Wait for them to finish
for i in range(elements):
processes[i].join(timeout=10)
index += num_cpus
# Dump the queue into the dataframe
while queue.qsize() != 0:
listOfWorklogs.append(q.get(), ignore_index=True)
This should work and reduce the time by a factor of little less than the number of CPUs in your machine. You can try and change that number manually for better performance. In any case I find it very strange that it takes about 3 seconds per operation.
PS: I couldn't try the code because I have no examples, it probably has some bugs
I have some troubles((
1) indents in the code where the first "for" loop appears and the first "if" instruction begins (this instruction and everything below should be included in the loop, right?)
for i in range(len(listOfKeys)-99):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
....
2) cmd, conda prompt and Spyder did not allow your code to work for a reason:
Python Multiprocessing error: AttributeError: module '__ main__' has no attribute 'spec'
After researching in the google, I had to set a bit higher in the code: spec = None (but I'm not sure if this is correct) and this error disappeared.
By the way, the code in Jupyter Notebook worked without this error, but listOfWorklogs is empty and this is not right.
3) when I corrected indents and set __spec __ = None, a new error occurred in this place:
processes[i].start ()
error like this:
"PicklingError: Can't pickle : attribute lookup PropertyHolder on jira.resources failed"
if I remove the parentheses from the start and join methods, the code will work, but I will not have any entries in the listOfWorklogs(((
I ask again for your help!)
How about thinking about it not from a technical standpoint but a logical one? You know your code works, but at a rate of 3sec per 1 issue which means it would take 25 hours to complete. If you have the ability to split up the # of Jira issues that are passed into the script (maybe use date or issue key, etc) you could create multiple different .py files with basically the same code, you would just be passing each one a different list of Jira tickets. So you could just run say 4 of them at the same time and you would reduce your time to 6.25 hours each.
I want to change the number of workers in the pool that are currently used.
My current idea is
while True:
current_connection_number = get_connection_number()
forced_break = False
with mp.Pool(current_connection_number) as p:
for data in p.imap_unordered(fun, some_infinite_generator):
yield data
if current_connection_number != get_connection_number():
forced_break = True
break
if not forced_break:
break
The problem is that it just terminates the workers and so the last items that were gotten from some_infinite_generator and weren't processed yet are lost. Is there some standard way of doing this?
Edit: I've tried printing inside some_infinite_generator and it turns out p.imap_unordered requests 1565 items with just 2 pool workers even before anything is processed, how do I limit the number of items requested from generator? If I use the code above and change number of connections after just 2 items, I will loose 1563 items
The problem is that the Pool will consume the generator internally in a separate thread. You have no way to control that logic.
What you can do, is feeding to the Pool.imap_unordered method a portion of the generator and get that consumed before scaling according to the available connections.
CHUNKSIZE = 100
while True:
current_connection_number = get_connection_number()
with mp.Pool(current_connection_number) as p:
while current_connection_number == get_connection_number():
for data in p.imap_unordered(fun, grouper(CHUNKSIZE, some_infinite_generator)):
yield data
def grouper(n, iterable):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
It's a bit less optimal as the scaling happens every chunk instead of every iteration but with a bit of fine tuning of the CHUNKSIZE value you can easily get it right.
The grouper recipe.
I'm writing a program that downloads data from a website (eve-central.com). It returns xml when I send a GET request with some parameters. The problem is that I need to make about 7080 of such requests because i can't specify the typeid parameter more than once.
def get_data_eve_central(typeids, system, hours, minq=1, thread_count=1):
import xmltodict, urllib3
pool = urllib3.HTTPConnectionPool('api.eve-central.com')
for typeid in typeids:
r = pool.request('GET', '/api/quicklook', fields={'typeid': typeid, 'usesystem': system, 'sethours': hours, 'setminQ': minq})
answer = xmltodict.parse(r.data)
It was really slow when I just connected to the website and made all the requests so I decided to make it use multiple threads at a time (I read that if the process involves a lot of waiting (I/O, HTTP requests), it can be speeded up a lot with multithreading). I rewrote it using multiple threads, but it somehow isn't any faster (a bit slower in fact). Here's the code rewritten using multithreading:
def get_data_eve_central(all_typeids, system, hours, minq=1, thread_count=1):
if thread_count > len(all_typeids): raise NameError('TooManyThreads')
def requester(typeids):
pool = urllib3.HTTPConnectionPool('api.eve-central.com')
for typeid in typeids:
r = pool.request('GET', '/api/quicklook', fields={'typeid': typeid, 'usesystem': system, 'sethours': hours, 'setminQ': minq})
answer = xmltodict.parse(r.data)['evec_api']['quicklook']
answers.append(answer)
def chunkify(items, quantity):
chunk_len = len(items) // quantity
rest_count = len(items) % quantity
chunks = []
for i in range(quantity):
chunk = items[:chunk_len]
items = items[chunk_len:]
if rest_count and items:
chunk.append(items.pop(0))
rest_count -= 1
chunks.append(chunk)
return chunks
t = time.clock()
threads = []
answers = []
for typeids in chunkify(all_typeids, thread_count):
threads.append(threading.Thread(target=requester, args=[typeids]))
threads[-1].start()
threads[-1].join()
print(time.clock()-t)
return answers
What I do is I divide all typeids into as many chunks as the quantity of threads i want to use and create a thread for each chunk to process it. The question is: what can slow it down? (I apologise for my bad english)
Python has Global Interpreter Lock. It can be your problem. Actually Python cannot do it in a genuine parallel way. You may think about switching to other languages or staying with Python but use process-based parallelism to solve your task. Here is a nice presentation Inside the Python GIL