boto3 read bucket files using concurrency method - python-3.x

I am trying to read bucket files without to saving them as a file:
import boto3
import botocore
from io import StringIO
import pandas as pd
s3 = boto3.resource('s3',config=botocore.config.Config(signature_version=botocore.UNSIGNED))
bucket = self.s3.Bucket('deutsche-boerse-xetra-pds')
objects = self.bucket.objects.filter(Prefix= date)
file = pd.read_csv(StringIO(self.bucket.Object(key=object.key).get().get('Body').read().decode('utf-8')))
This code works quite well. However, I would like to use concurrency (python asyncio) to speed up the reading process. I did a search into documentation but I could only find something for the download function but not for the get function.
Do you have any suggestion?
Thanks in advance.

I found out a solution which works with multiprocessing since my final goal was to reduce the processing time.
As follow the code:
def generate_bucket():
s3_resoursce = boto3.resource('s3',config=botocore.config.Config(signature_version=botocore.UNSIGNED))
xetra_bucket = s3_resoursce.Bucket('deutsche-boerse-xetra-pds')
return s3_resoursce, xetra_bucket
def read_csv(object):
s3local, bucket_local = self.generate_bucket()
return pd.read_csv(StringIO(bucket_local.Object(key=object).get().get('Body').read().decode('utf-8')))
def import_raw_data(date: List[str]) -> pd.DataFrame:
import multiprocessing
s3local, bucket_local2 = self.generate_bucket()
objects = [i.key for i in list(bucket_local2.objects.filter(Prefix= date[0]))]
with multiprocessing.Pool(multiprocessing.cpu_count()) as p:
df = pd.concat(p.map(self.read_csv, objects))
return df
For me it works, but I am sure that there could be the possibility to improve this code. I'm open to suggestions.

Related

Download all of csv files of tensorboard at once

I wanted to download the data of all my runs at once in tensorboard:
But it seems there's not a way to download all of them in one click. Does anyone know any solution to this problem?
This can lead to your answer!
https://stackoverflow.com/a/73409436/11657898
that's for 1 file, but, it's ready to put into a loop
I came up with this to solve my problem. First, you'll need to run the TensorBoard on a local host and then scrape the data from the browser.
import pandas as pd
import requests
from csv import reader
import os
def URLs(fold, trial):
URLs_dict = {
'train_accuracy' : f'http://localhost:6006/data/plugin/scalars/scalars?tag=epoch_accuracy&run=fold{fold}%5C{trial}%5Cexecution0%5Ctrain&format=csv',
'val_accuracy' : f'http://localhost:6006/data/plugin/scalars/scalars?tag=epoch_accuracy&run=fold{fold}%5C{trial}%5Cexecution0%5Cvalidation&format=csv',
'val_loss' : f'http://localhost:6006/data/plugin/scalars/scalars?tag=epoch_loss&run=fold{fold}%5C{trial}%5Cexecution0%5Cvalidation&format=csv',
'train_loss' : f'http://localhost:6006/data/plugin/scalars/scalars?tag=epoch_loss&run=fold{fold}%5C{trial}%5Cexecution0%5Ctrain&format=csv'
}
return URLs_dict
def tb_data(log_dir, mode, fold, num_trials):
trials = os.listdir(log_dir)
fdf = {}
for i, trial in enumerate(trials[:num_trials]):
r = requests.get(URLs(fold, trial)[mode], allow_redirects=True)
data = r.text
data_csv = reader(data.splitlines())
data_csv = list(data_csv)
df = pd.DataFrame(data_csv)
headers = df.iloc[0]
df = pd.DataFrame(df.values[1:], columns=headers)
if i == 0:
fdf['Step'] = df['Step']
fdf[f'trial {trial}'] = df['Value']
fdf = pd.DataFrame(fdf)
return fdf
P.S: It might need a little tweaking based on a different directory.

Understanding argparse to get dynamic maps with Geo-Location of tweets

I have found this python code online (twitter_map_clustered.py) which (I think) help create a map using the geodata of different tweets.:
from argparse import ArgumentParser
import folium
from folium.plugins import MarkerCluster
import json
def get_parser():
parser = ArgumentParser()
parser.add_argument('--geojson')
parser.add_argument('--map')
return parser
def make_map(geojson_file, map_file):
tweet_map = folium.Map(Location=[50, 5], max_zoom=20)
marker_cluster = MarkerCluster().add_to(tweet_map)
geodata= json.load(open(geojson_file))
for tweet in geodata['features']:
tweet['geometry']['coordinates'].reverse()
marker = folium.Marker(tweet['geometry']['coordinates'], popup=tweet['properties']['text'])
marker.add_to(marker_cluster)
#Save to HTML map file
tweet_map.save(map_file)
if __name__ == '__main__':
parser = get_parser()
args = parser.parse_args()
make_map(args.geojson, args.map)
I managed to extract the geo information of different tweets and save it into a geo_data.json file. However, I have trouble understanding the code, specially the function def get_parser().
It seems that we need to add argument when running the file in the command prompt. The argument should be geo_data.json. However, it is also asking for a map ? parser.add_argument('--map')
Why is it the case? In the code, aren't we creating the map here?
#Save to HTML map file
tweet_map.save(map_file)
Can you please help me. How would you run the python script ? Is there anything important I am missing ?
As explained by argparse documentation, it simply asks for the name of the geojson file and a name that your code will use to save the map.
Therefore, you will run:
python twitter_map_clustered.py --geojson geo_data.json --map mymap.html
and you will get a map named mymap.html.

PIL and skimage taking too long to load 56000 URL's

My dataframe data3 has 56000 rows, with an image thumbnail URL as one of its column values. I am evaluating whether each one of those images are low contrast or not. I let the code below run for 9 hours! but still no result, and the kernel was still busy. Can you please let me know whats wrong?
P.S. I tried the code with a subset of my dataframe (100 rows) and it took 3 seconds to succesfully run. Surely by that standard, 56000 rows should take 30 minutes. Is there a memory overrun happening with temp files or something?
Maybe I need to introduce a try block here to catch any exceptions (even though no error is showing)? I'm not sure how to do that.
from PIL import Image
import urllib.request
import skimage
def f(row):
URL=row['ThumbnailURL']
#URL = 'http://www.moma.org/media/W1siZiIsIjU5NDA1Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=137b8455b1ec6167'
with urllib.request.urlopen(URL) as url:
with open('temp.jpg', 'wb') as f:
f.write(url.read())
tutu = Image.open('temp.jpg')
val=skimage.exposure.is_low_contrast(tutu, fraction_threshold=0.4, lower_percentile=1, upper_percentile=99, method='linear')
return val
data3['lowcontornot'] = data3.apply(f, axis=1)
The solution below avoids saving temporary images to disk, thus reducing the overhead:
def f(row):
url = row['ThumbnailURL']
img = Image.open(BytesIO(requests.get(url).content))
return is_low_contrast(img, fraction_threshold=0.4, lower_percentile=1,
upper_percentile=99, method='linear')
To give this code a try you need to include the following imports:
import requests
from io import BytesIO
from PIL import Image
from skimage.exposure import is_low_contrast

Python twisted putChild not forwarding expectedly

Code here.
from twisted.web.static import File
from twisted.web.server import Site
from twisted.web.resource import Resource
from twisted.internet import ssl, reactor
from twisted.python.modules import getModule
import secure_aes
import urllib.parse
import cgi
import json
import os
import hashlib
import coserver
import base64
import sim
if not os.path.exists(os.path.join(os.getcwd(),'images')):
os.mkdir(os.path.join(os.getcwd(),'images'))
with open ('form.html','r') as f:
fillout_form = f.read()
with open ('image.html','r') as f:
image_output = f.read()
port = 80#int(os.environ.get('PORT', 17995))
class FormPage(Resource):
#isLeaf = True
def getChild(self, name, request):
print('GC')
if name == '':
return self
return Resource.getChild(self, name, request)
def render_GET(self, request):
print(request)
#do stuff and return stuff
root = FormPage()
root.putChild('rcs', File("./images"))
#factory = Site(FormPage())
factory = Site(root)
reactor.listenTCP(port, factory)
reactor.run()
As you can see, I did root.putChild towards the end of things, expecting that when I got to http://site/rcs I get given a directory listing of the contents of ./images but of course that doesn't happen. What am I missing? I've tried many things suggested from here. Also this one doesn't work because that's just serving static files anyways. It goes to getChild all the time regardless of whether if have specified putChild or not.
On Python 3, a bare string literal like "rcs" is a unicode string (which Python 3 calls "str" but which I will call "unicode" to avoid ambiguity).
However, twisted.web.resource.Resource.putChild requires a byte string as its first argument. It misbehaves rather poorly when given unicode, instead. Make your path segments into byte strings (eg b"rcs") and the server will behave better on Python 3.

Multithreading in Python/BeautifulSoup scraping doesn't speed up at all

I have a csv file ("SomeSiteValidURLs.csv") which listed all the links I need to scrape. The code is working and will go through the urls in the csv, scrape the information and record/save in another csv file ("Output.csv"). However, since I am planning to do it for a large portion of the site (for >10,000,000 pages), speed is important. For each link, it takes about 1s to crawl and save the info into the csv, which is too slow for the magnitude of the project. So I have incorporated the multithreading module and to my surprise it doesn't speed up at all, it still takes 1s person link. Did I do something wrong? Is there other way to speed up the processing speed?
Without multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading
def crawlToCSV(FileName):
with open(FileName, "rb") as f:
for URLrecords in f:
OpenSomeSiteURL = urllib2.urlopen(URLrecords)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
crawltoCSV("SomeSiteValidURLs.csv")
With multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading
def crawlToCSV(FileName):
with open(FileName, "rb") as f:
for URLrecords in f:
OpenSomeSiteURL = urllib2.urlopen(URLrecords)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
fileName = "SomeSiteValidURLs.csv"
if __name__ == "__main__":
t = threading.Thread(target=crawlToCSV, args=(fileName, ))
t.start()
t.join()
You're not parallelizing this properly. What you actually want to do is have the work being done inside your for loop happen concurrently across many workers. Right now you're moving all the work into one background thread, which does the whole thing synchronously. That's not going to improve performance at all (it will just slightly hurt it, actually).
Here's an example that uses a ThreadPool to parallelize the network operation and parsing. It's not safe to try to write to the csv file across many threads at once, so instead we return the data that would have been written back to the parent, and have the parent write all the results to the file at the end.
import urllib2
import csv
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlToCSV(URLrecord):
OpenSomeSiteURL = urllib2.urlopen(URLrecord)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
return placeHolder
if __name__ == "__main__":
fileName = "SomeSiteValidURLs.csv"
pool = Pool(cpu_count() * 2) # Creates a Pool with cpu_count * 2 threads.
with open(fileName, "rb") as f:
results = pool.map(crawlToCSV, f) # results is a list of all the placeHolder lists returned from each call to crawlToCSV
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
for result in results:
writeFile.writerow(result)
Note that in Python, threads only actually speed up I/O operations - because of the GIL, CPU-bound operations (like the parsing/searching BeautifulSoup is doing) can't actually be done in parallel via threads, because only one thread can do CPU-based operations at a time. So you still may not see the speed up you were hoping for with this approach. When you need to speed up CPU-bound operations in Python, you need to use multiple processes instead of threads. Luckily, you can easily see how this script performs with multiple processes instead of multiple threads; just change from multiprocessing.dummy import Pool to from multiprocessing import Pool. No other changes are required.
Edit:
If you need to scale this up to a file with 10,000,000 lines, you're going to need to adjust this code a bit - Pool.map converts the iterable you pass into it to a list prior to sending it off to your workers, which obviously isn't going to work very well with a 10,000,000 entry list; having that whole thing in memory is probably going to bog down your system. Same issue with storing all the results in a list. Instead, you should use Pool.imap:
imap(func, iterable[, chunksize])
A lazier version of map().
The chunksize argument is the same as the one used by the map()
method. For very long iterables using a large value for chunksize can
make the job complete much faster than using the default value of 1.
if __name__ == "__main__":
fileName = "SomeSiteValidURLs.csv"
FILE_LINES = 10000000
NUM_WORKERS = cpu_count() * 2
chunksize = FILE_LINES // NUM_WORKERS * 4 # Try to get a good chunksize. You're probably going to have to tweak this, though. Try smaller and lower values and see how performance changes.
pool = Pool(NUM_WORKERS)
with open(fileName, "rb") as f:
result_iter = pool.imap(crawlToCSV, f)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
for result in result_iter: # lazily iterate over results.
writeFile.writerow(result)
With imap, we never put the all of f into memory at once, nor do we store all the results in memory at once. The most we ever have in memory is chunksize lines of f, which should be more manageable.

Resources