I want to get 1 million records from a single REST API URL with pagination. Each page can get up to 100k records in approx 1 min. I want to make 10 post requests concurrently so that I can get all 1 million records in 1 min. I used requests library, ThreadPoolExecutor to make concurrent connections and get the data but it's taking a very long time to get the data.
My understanding of aiohttp or grequests is that they call the REST API asynchronously from a single thread and get the data while waiting for others to make network connections.
Sample code
def post_request(page_number):
response = requests.post(url, data = data, fetchPageNumber:page_number)
# logic to convert response.text pandas df and write to RDBMS
with concurrent.futures.ThreadPoolExecutor(10) as executor:
parallel_response = executor.map(post_requst, list(1,11)
Please let me know what is the best way to get the paginated data in parallel.
Related
I have a Python Flask page which is extremely slow to generate. It takes about 1 minute to pull all the data from external APIs, process the data before returning the page. Fortunately, the data is valid for up to 1 hour so I can cache the result and return cached results quickly for most of the requests.
This works well except for the minute after the cache expires. If 10 requests were made within that single minute, there will be 10 calls to veryslowpage() function, this eats up the HTTPS connection pool due to the external API calls and eats up memory due to the data processing, affecting other pages on the site. Is there a method to limit this function to a single instance, so 10 requests will result in only 1 call to veryslowpage() while the rest wait until the cached result is ready?
from flask import Flask, request, abort, render_template
from flask_caching import Cache
#app.route('/veryslowpage', methods=['GET'])
#cache.cached(timeout=3600, query_string=True)
def veryslowpage():
data = callexternalAPIs()
result = heavydataprocessing(data)
return render_template("./index.html", content=result)
You could simply create a function that periodically fetch the data from API (every hour) and store it in your database. Then in your route function read the data from your database instead of external API.
A better approach is creating a very simple script and call it (in another thread) in your app/init.py that fetch the data every one hour and update the database.
You could create a file or a database entry that contains the information that you are calculating the response in a different thread. Then, your method would check if such a file exists and if it does, let it wait and check for the response periodically.
You could also proactively create the data once every hour (or every 59 minutes if that matters) so you always have a fresh response available. You could use something like APScheduler for this.
I'm trying to load more than 20 million records to my Dynamodb table using below code from EMR 5 node cluster. But it is taking more hours and hours time to load completely. I have much more huge data to load, but i want to load it in span of few minutes. How to achieve this?
Below is my code. I just changed original column names and I have 20 columns to insert. The problem here is with slow loading.
import boto3
import json
import decimal
dynamodb = boto3.resource('dynamodb','us-west')
table = dynamodb.Table('EMP')
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='mybucket', Key='emp-rec.json')
records = json.loads(obj['Body'].read().decode('utf-8'), parse_float = decimal.Decimal)
with table.batch_writer() as batch:
for rec in records:
batch.put_item(Item=rec)
First, you should use Amazon CloudWatch to check whether you are hitting limits for your configure Write Capacity Units on the table. If so, you can increase the capacity, at least for the duration of the load.
Second, the code is creating batches of one record, which wouldn't be very efficient. The batch_writer() can be used to process multiple records, such as in this sample code from the batch_writer() documentation:
with table.batch_writer() as batch:
for _ in xrange(1000000):
batch.put_item(Item={'HashKey': '...',
'Otherstuff': '...'})
Notice how the for loop is inside the batch_writer()? That way, multiple records are stored within one batch. Your code sample, however, has the for outside of the batch_writer(), which results in a batch size of one.
I have a script that calls an api (using Requests) to pull financial data for a list of stocks. It reads the data into pandas dataframes, does a transformation, then uploads the pulled data into postgres using psycopg2. There are 10-150 API calls that need to be made per ticker. Each iteration takes a few seconds.
Given that the list of tickers is over 2k long, this script takes about a day to run while the cpu is only utilized at 4% capacity. I want to change the script so that it uses AIOHTTP(https://aiohttp.readthedocs.io/en/stable/client_quickstart.html
) to make all the API needed for each ticker at one given time. Then as each request returns, it can be transformed and loaded into Postgres. My hope is that is will significantly cut down the time it takes to process each ticker by increasing the work done by the CPU.
I've looked at the documentation for aiohttp and async-await but I'm having a hard time wrapping my head around how to structure an asynchronous loop. I also am unsure how to make sure that, as each API request returns, it immediately kicks off the pandas/postgres upload instead of waiting for all API calls to return before moving on.
#here is a simplified example of the code for one ticker
import asyncio
import json
import pandas as pd
import psycopg2 as pg
import Requests as rq
tkr = foo
api_key = 'bar'
url = http://api_call_website.com?{0}&{1}&{2}&{3}&api-key:{4}
list_of_stmnts =
[(True, 2009, 'Q4', 'pl'),
(True, 2018, 'Q3', 'cf'),
(True, 2018, 'Q2', 'cf'),
(True, 2017, 'Q4', 'cf')]
#these "statements" contain the parameters that get passed into the url
async def async_get_loop(list_of_stmnts, tkr, url, api_key):
urls = [url.format(tkr, stmt[1],stmt[2],stmt[3], api_key) for stmt in list_of_stmnts]
#this builds the list of urls that need to be called
await data = rq.request("GET", url)
results = data.json()
df = pd.DataFrame(results['values'])
df.to_sql('table_name', engine, schema='schema', if_exists='append', index = False)
#the postgres engine is defined earlier in the code using psycopg2
return
This shows my rudimentary grasp of how async await should work. I know that to make it asynchronous, I need to implement aiohttp instead of Requests. But frankly I'm lost as to how I use these two packages.
I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?
Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.
I'm using pyspark with a Kafka Receiver to process a stream of tweets. One of the steps of my application includes a call to the Google Natural Language API to get a sentiment score per tweet. However, I'm seeing that the API is getting several calls per processed tweet (I see the number calls in the Google Cloud Console).
Also, if I print the tweetIDs (inside the mapped function) I get the same ID 3 or 4 times. At the end of my application, tweets are being sent to another topic in Kafka and there I get the correct count of tweets (no repeated ID's), so in principle everything is working correctly, but I don't know how to avoid calling Google API more than once per tweet.
Does this has to do with some configuration parameters in Spark or Kafka?
Here's an example of my console output:
TIME 21:53:36: Google Response for tweet 801181843500466177 DONE!
TIME 21:53:36: Google Response for tweet 801181854766399489 DONE!
TIME 21:53:36: Google Response for tweet 801181844808966144 DONE!
TIME 21:53:37: Google Response for tweet 801181854372012032 DONE!
TIME 21:53:37: Google Response for tweet 801181843500466177 DONE!
TIME 21:53:37: Google Response for tweet 801181854766399489 DONE!
TIME 21:53:37: Google Response for tweet 801181844808966144 DONE!
TIME 21:53:37: Google Response for tweet 801181854372012032 DONE!
But in the Kafka receiver I only get 4 processed tweets (which is the correct thing to receive since they are only 4 unique tweets).
The code that does this is:
def sendToKafka(rdd,topic,address):
publish_producer = KafkaProducer(bootstrap_servers=address,\
value_serializer=lambda v: json.dumps(v).encode('utf-8'))
records = rdd.collect()
msg_dict = defaultdict(list)
for rec in records:
msg_dict["results"].append(rec)
publish_producer.send(resultTopic,msg_dict)
publish_producer.close()
kafka_stream = KafkaUtils.createStream(ssc, zookeeperAddress, "spark-consumer-"+myTopic, {myTopic: 1})
dstream_tweets=kafka_stream.map(lambda kafka_rec: get_json(kafka_rec[1]))\
.map(lambda post: add_normalized_text(post))\
.map(lambda post: tagKeywords(post,tokenizer,desired_keywords))\
.filter(lambda post: post["keywords"] == True)\
.map(lambda post: googleNLP.complementTweetFeatures(post,job_id))
dstream_tweets.foreachRDD(lambda rdd: sendToKafka(rdd,resultTopic,PRODUCER_ADDRESS))
I already found the solution to this! I just had to cache the DStream with:
dstream_tweets.cache()
The multiple network calls happened because Spark recalculated the RDDs inside that DStream before performing latter operations in my script. When I cache() the DStream it is only necessary to calculate it one time; and since it is saved in memory, later functions can access that information without re-calculations (in this case a re-calculation involved to call again the API, so it is worth to pay the price of more memory usage).