I have the following function that gets some data from a web page and stores in a dictionary. The time stamp as a key, and the data (list) as value.
def getData(d):
page = requests.get('http://www.transportburgas.bg/bg/%D0%B5%D0%BB%D0%B5%D0%BA%D1%82%D1%80%D0%BE%D0%BD%D0%BD%D0%BE-%D1%82%D0%B0%D0%B1%D0%BB%D0%BE')
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all("table", class_="table table-striped table-hover")
rows = table[0].find_all('tr')
table_head = table[0].find_all('th')
header = []
tr_l = []
rows = []
for tr in table[0].find_all('tr'):
value = tr.get_text()
rows.append(value)
time_stamp = rows[1].split("\n")[1]
data = []
for i in rows[1:]:
a = i.split("\n")
if time.strptime(a[1], '%d/%m/%Y %H:%M:%S')[:4] == time.localtime()[:4]:
data.append(int(a[3]))
else:
data.append(np.nan)
d[time_stamp] = data
The data on the web page gets updated every 5 mins. I would like to make the function run automatically every 5 min. I am trying to do it with the time.sleep and this function:
def period_fun(it):
iterations = it
while iterations != 0:
getData(dic)
time.sleep(300)
iterations = iterations -1
However, this function runs only once and I end up with only one item in the dictionary. I have tried it with a simple print (1) instead of the function and it works (1 gets printed several times), but when I implement it with the function it doesn't work.
Would really appreciate any suggestions on the functions or how I could achieve my goal!
Best regards,
Mladen
How about using some library which use cron jobs?
Schedule looks nice although it is not using cron like syntax: https://github.com/dbader/schedule
Related
Having 500, continously growing DataFrames, I would like to submit operations on the (for each DataFrame indipendent) data to dask. My main question is: Can dask hold the continously submitted data, so I can submit a function on all the submitted data - not just the newly submitted?
But lets explain it on an example:
Creating a dask_server.py:
from dask.distributed import Client, LocalCluster
HOST = '127.0.0.1'
SCHEDULER_PORT = 8711
DASHBOARD_PORT = ':8710'
def run_cluster():
cluster = LocalCluster(dashboard_address=DASHBOARD_PORT, scheduler_port=SCHEDULER_PORT, n_workers=8)
print("DASK Cluster Dashboard = http://%s%s/status" % (HOST, DASHBOARD_PORT))
client = Client(cluster)
print(client)
print("Press Enter to quit ...")
input()
if __name__ == '__main__':
run_cluster()
Now I can connect from my my_stream.py and start to submit and gather data:
DASK_CLIENT_IP = '127.0.0.1'
dask_con_string = 'tcp://%s:%s' % (DASK_CLIENT_IP, DASK_CLIENT_PORT)
dask_client = Client(self.dask_con_string)
def my_dask_function(lines):
return lines['a'].mean() + lines['b'].mean
def async_stream_redis_to_d(max_chunk_size = 1000):
while 1:
# This is a redis queue, but can be any queueing/file-stream/syslog or whatever
lines = self.queue_IN.get(block=True, max_chunk_size=max_chunk_size)
futures = []
df = pd.DataFrame(data=lines, columns=['a','b','c'])
futures.append(dask_client.submit(my_dask_function, df))
result = self.dask_client.gather(futures)
print(result)
time sleep(0.1)
if __name__ == '__main__':
max_chunk_size = 1000
thread_stream_data_from_redis = threading.Thread(target=streamer.async_stream_redis_to_d, args=[max_chunk_size])
#thread_stream_data_from_redis.setDaemon(True)
thread_stream_data_from_redis.start()
# Lets go
This works as expected and it is really quick!!!
But next, I would like to actually append the lines first before the computation takes place - And wonder if this is possible? So in our example here, I would like to calculate the mean over all lines which have been submitted, not only the last submitted ones.
Questions / Approaches:
Is this cummulative calculation possible?
Bad Alternative 1: I
cache all lines locally and submit all the data to the cluster
every time a new row arrives. This is like an exponential overhead. Tried it, it works, but it is slow!
Golden Option: Python
Program 1 pushes the data. Than it would be possible to connect with
another client (from another python program) to that cummulated data
and move the analysis logic away from the inserting logic. I think Published DataSets are the way to go, but are there applicable for this high-speed appends?
Maybe related: Distributed Variables, Actors Worker
Assigning a list of futures to a published dataset seems ideal to me. This is relatively cheap (everything is metadata) and you'll be up-to-date as of a few milliseconds
client.datasets["x"] = list_of_futures
def worker_function(...):
futures = get_client().datasets["x"]
data = get_client.gather(futures)
... work with data
As you mention there are other systems like PubSub or Actors. From what you say though I suspect that Futures + Published datasets are simpler and a more pragmatic option.
I'm trying to send 6 API requests in one session and record how long it took. Do this n times. I store the amount of time in a list. After that I print out some list info and visualize the list data.
What works: downloading as .py, removing ipython references, and running the code as a command line script.
What also works: manually running the cells, and the erroring cell after the loop cell completes.
What doesn't work: restarting and running all cells within the jupyter notebook. The last cell seems to not wait for the prior one; the latter appears to execute first, and complain about an empty list. Error in image below.
Cell 1:
# submit 6 models at the same time
# using support / first specified DPE above
auth = aiohttp.BasicAuth(login=USERNAME, password=API_TOKEN)
async def make_posts():
for i in range(0, 6):
yield df_input['deployment_id'][i]
async def synch6():
#url = "%s/predApi/v1.0/deployments/%s/predictions" % (PREDICTIONSENDPOINT,DEPLOYMENT_ID)
async with aiohttp.ClientSession(auth=auth) as session:
post_tasks = []
# prepare the coroutines that post
async for x in make_posts():
post_tasks.append(do_post(session, x))
# now execute them all at once
await asyncio.gather(*post_tasks)
async def do_post(session, x):
url = "%s/predApi/v1.0/deployments/%s/predictions" % (PREDICTIONSENDPOINT, x)
async with session.post(url, data = df_scoreme.to_csv(), headers=PREDICTIONSHEADERS_csv) as response:
data = await response.text()
#print (data)
Cell 2:
chonk_start = (datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S.000Z"))
perf1 = []
n = 100
for i in range(0, n):
start_ts = round(time.time() * 1000)
loop = asyncio.get_event_loop()
loop.run_until_complete(synch6())
end_ts = round(time.time() * 1000)
perf1.append(end_ts - start_ts)
Cell 3:
perf_string(perf1, 'CHONKS')
The explicit error (see image) appears to simply be the result of trying to work on an empty list. The list appears to be empty because that cell is executing before the loop test cell actually populates the list - although I don't know why. This appears to only be a problem inside the notebook.
EDIT: In further testing... this appears to work fine on my local (python3, mac) jupyter notebook. Where it is failing is on a AWS Sagemaker conda python3 notebook.
I was use threading Pool for my script. I have working code for html table to json conversion.
I am using pandas for html table to json.
html_source2 = str(html_source1)
pool = ThreadPool(4)
table = pd.read_html(html_source2)[0]
table= table.loc[:,~table.columns.str.startswith('Unnamed')]
d = (table.to_dict('records'))
print(json.dumps(d,ensure_ascii=False))
results = (json.dumps(d,ensure_ascii=False))
i want something like:
html_source2 = str(html_source1)
pool = ThreadPool(4)
def abcd():
table = pd.read_html(html_source2)[0]
table= table.loc[:,~table.columns.str.startswith('Unnamed')]
d = (table.to_dict('records'))
print(json.dumps(d,ensure_ascii=False))
results = (json.dumps(d,ensure_ascii=False))
You are almost there. You need to make the function take an input argument, here html_str and then have it return the results you need so you can use them outside the function.
html_source2 = str(html_source1)
pool = ThreadPool(4)
def abcd(html_str):
table = pd.read_html(html_str)[0]
table= table.loc[:,~table.columns.str.startswith('Unnamed')]
d = (table.to_dict('records'))
print(json.dumps(d,ensure_ascii=False))
results = (json.dumps(d,ensure_ascii=False))
return results
my_results = abcd(html_source2)
And remove the print call if you don't need to see the output in the function
I guess you don't know much about functions, parameters and how to call functions read here https://www.w3schools.com/python/python_functions.asp
Consider reading it it's a short read.
My program does this:
Get the XML from my website
Run all the URLs
Get data from my web page (SKU, name, title, price, etc.) with requests
Get the lowest price from another website, by comparing the price with the same SKU with requests.
I'm using with lots of requests, on each def:
def get_Price (SKU):
check ='https://www.XXX='+SKU
r = requests.get(check)
html = requests.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
return Price
def get_StoreName (SKU):
check ='https://XXX?keyword='+SKU
r = requests.get(check)
html = requests.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
return storeName
def get_h1Tag (u):
html = requests.get(u)
bsObj = BeautifulSoup(html.content,'xml')
h1 = bsObj.find('h1',attrs={'itemprop':'name'}).get_text()
return h1
How can I reduce the number of requests or connections to the URL - and use with one request or one connection throughout the whole program ?
I assume this is a script with a group of methods you call in a particular order.
If so, this is a good use case for a dict. I would write a function that memorizes calls to URLs.
You can then reuse this function across your other functions:
requests_cache = {}
def get_url (url, format_parser):
if url not in requests_cache:
r = requests.get(url)
html = requests.get(r.url)
requests_cache[url] = BeautifulSoup(html.content, format_parser)
return requests_cache[url]
def get_Price (makat):
url = 'https://www.zap.co.il/search.aspx?keyword='+makat
bsObj = get_url(url, 'html.parser')
# your code to find the price
return zapPrice
def get_zapStoreName (makat):
url = 'https://www.zap.co.il/search.aspx?keyword='+makat
bsObj = get_url(url, 'html.parser')
# your code to find the store name
return storeName
def get_h1Tag (u):
bsObj = get_url(u, 'xml')
h1 = bsObj.find('h1',attrs={'itemprop':'name'}).get_text()
return h1
If you want to avoid a global variable, you can also set requests_cache as attribute of get_url or as a default argument in the definition. The latter would also allow you to bypass the cache by passing an empty dict.
Again, the assumption here is that you are running this code as a script periodically. In that case, the requests_cache will get cleared every time you run the program.
However, if this is part of a larger program, you would want to 'expire' the cache on a regular basis, otherwise you would get the same results every time.
This is a good use case for the requests-cache library. Example:
from requests_cache import CachedSession
# Save cached responses in a SQLite file (scraper_cache.sqlite), and expire after 6 minutes
session = CachedSession('scraper_cache.sqlite', expire_after=360)
def get_Price (SKU):
check ='https://www.XXX='+SKU
r = session.get(check)
html = session.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
return Price
def get_StoreName (SKU):
check ='https://XXX?keyword='+SKU
r = session.get(check)
html = session.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
return storeName
def get_h1Tag (u):
html = session.get(u)
bsObj = BeautifulSoup(html.content,'xml')
h1 = bsObj.find('h1',attrs={'itemprop':'name'}).get_text()
return h1
Aside: with or without requests-cache, using sessions is good practice whenever you're making repeated calls to the same host, since it uses connection pooling: https://docs.python-requests.org/en/latest/user/advanced/#session-objects
I have three functions 2 of which take a string and return a string. I have a third function that takes two strings and returns a string. I am trying to create a simple Tkinter GUI that would take in any parameters of the functions then based on the button press run my algorithm returning the result. Tkinter is giving me a hard time. I need four input fields for all possible parameters then run the correct function on press of button. Functions will look like:
CalculateStrenghtofBrute(Word, Charset)
CalculateDictionary(Word)
CalculatePassPhrase(Phrase)
All return a string created within the functions.
Below is a Sample Function
def wordTime(Password):
with open('Dics/dict.txt','r') as f:
Words = f.read().splitlines()
found = Words.index(Password)
found += 1
timeSec = found*.1
if(timeSec> 31536000):
time = timeSec/31536000
timeType = 'Years'
elif(timeSec>86400):
time = timeSec/86400
timeType = 'Days'
elif(timeSec>360):
time = timeSec/360
timeType = 'Hours'
elif(timeSec>60):
time = timeSec/60
timeType = 'Minutes'
else:
time = timeSec
timeType ='Seconds'
return ('Cracking',Password,'using dictionary attack will take', round(time, 2), timeType+'.')
Thanks
If you want to take input from the user you need to create an entry box, once you have an entry box you can call the get method on it to get the string which currently resides in the entry box, I've took your example function and made a simple tk GUI for it:
import Tkinter as tk
def wordTime():
password = input_box.get()
with open('Dics/dict.txt','r') as f:
Words = f.read().splitlines()
found = Words.index(Password)
found += 1
timeSec = found*.1
if(timeSec> 31536000):
time = timeSec/31536000
timeType = 'Years'
elif(timeSec>86400):
time = timeSec/86400
timeType = 'Days'
elif(timeSec>360):
time = timeSec/360
timeType = 'Hours'
elif(timeSec>60):
time = timeSec/60
timeType = 'Minutes'
else:
time = timeSec
timeType ='Seconds'
print ('Cracking',Password,'using dictionary attack will take', round(time, 2), timeType+'.')
# Make a top level Tk window
root = tk.Tk()
root.title("Cracker GUI v.01")
# Set up a Label
grovey_label = tk.Label(text="Enter password:")
grovey_label.pack(side=tk.LEFT,padx=10,pady=10)
# Make an input box
input_box = tk.Entry(root,width=10)
input_box.pack(side=tk.LEFT,padx=10,pady=10)
# Make a button which takes wordTime as command,
# Note that we are not using wordTime()
mega_button = tk.Button(root, text="GO!", command=wordTime)
mega_button.pack(side=tk.LEFT)
#Lets get the show on the road
root.mainloop()
If you wanted to take multiple values you could use multiple buttons which set multiple variables, also I'm not sure about your function but that's not really the question at hand.
For reference the following sites have some good basic examples:
http://effbot.org/tkinterbook/entry.htm
http://effbot.org/tkinterbook/button.htm
http://www.ittc.ku.edu/~niehaus/classes/448-s04/448-standard/simple_gui_examples/