I'm using pyspark with a Kafka Receiver to process a stream of tweets. One of the steps of my application includes a call to the Google Natural Language API to get a sentiment score per tweet. However, I'm seeing that the API is getting several calls per processed tweet (I see the number calls in the Google Cloud Console).
Also, if I print the tweetIDs (inside the mapped function) I get the same ID 3 or 4 times. At the end of my application, tweets are being sent to another topic in Kafka and there I get the correct count of tweets (no repeated ID's), so in principle everything is working correctly, but I don't know how to avoid calling Google API more than once per tweet.
Does this has to do with some configuration parameters in Spark or Kafka?
Here's an example of my console output:
TIME 21:53:36: Google Response for tweet 801181843500466177 DONE!
TIME 21:53:36: Google Response for tweet 801181854766399489 DONE!
TIME 21:53:36: Google Response for tweet 801181844808966144 DONE!
TIME 21:53:37: Google Response for tweet 801181854372012032 DONE!
TIME 21:53:37: Google Response for tweet 801181843500466177 DONE!
TIME 21:53:37: Google Response for tweet 801181854766399489 DONE!
TIME 21:53:37: Google Response for tweet 801181844808966144 DONE!
TIME 21:53:37: Google Response for tweet 801181854372012032 DONE!
But in the Kafka receiver I only get 4 processed tweets (which is the correct thing to receive since they are only 4 unique tweets).
The code that does this is:
def sendToKafka(rdd,topic,address):
publish_producer = KafkaProducer(bootstrap_servers=address,\
value_serializer=lambda v: json.dumps(v).encode('utf-8'))
records = rdd.collect()
msg_dict = defaultdict(list)
for rec in records:
msg_dict["results"].append(rec)
publish_producer.send(resultTopic,msg_dict)
publish_producer.close()
kafka_stream = KafkaUtils.createStream(ssc, zookeeperAddress, "spark-consumer-"+myTopic, {myTopic: 1})
dstream_tweets=kafka_stream.map(lambda kafka_rec: get_json(kafka_rec[1]))\
.map(lambda post: add_normalized_text(post))\
.map(lambda post: tagKeywords(post,tokenizer,desired_keywords))\
.filter(lambda post: post["keywords"] == True)\
.map(lambda post: googleNLP.complementTweetFeatures(post,job_id))
dstream_tweets.foreachRDD(lambda rdd: sendToKafka(rdd,resultTopic,PRODUCER_ADDRESS))
I already found the solution to this! I just had to cache the DStream with:
dstream_tweets.cache()
The multiple network calls happened because Spark recalculated the RDDs inside that DStream before performing latter operations in my script. When I cache() the DStream it is only necessary to calculate it one time; and since it is saved in memory, later functions can access that information without re-calculations (in this case a re-calculation involved to call again the API, so it is worth to pay the price of more memory usage).
Related
I want to get 1 million records from a single REST API URL with pagination. Each page can get up to 100k records in approx 1 min. I want to make 10 post requests concurrently so that I can get all 1 million records in 1 min. I used requests library, ThreadPoolExecutor to make concurrent connections and get the data but it's taking a very long time to get the data.
My understanding of aiohttp or grequests is that they call the REST API asynchronously from a single thread and get the data while waiting for others to make network connections.
Sample code
def post_request(page_number):
response = requests.post(url, data = data, fetchPageNumber:page_number)
# logic to convert response.text pandas df and write to RDBMS
with concurrent.futures.ThreadPoolExecutor(10) as executor:
parallel_response = executor.map(post_requst, list(1,11)
Please let me know what is the best way to get the paginated data in parallel.
In Mturk, we want to show a set of tweets, e.g., 20 tweets, in boxes in one page. Then, Workers click JUST on the tweets (boxes) that are relevant to a specific concept like "entrepreneurship". For example for 3 tweets:
Tweet 1: Money and supporting customers are essential for a business
Tweet 2: I like tennis
Tweet 3: I spend my invest for buying my home.
Tweets should be shown in boxes and Workers should just click on Tweet 1 (instead of clicking on Yes or No buttons) and MTurk returns the results in a file (like csv format) in this way:
Yes (or 1)
No (or 0)
No (or 0)
We want to show multiple tweets (multiple hits) in one page.
How can we create a code so that for a batch of tweets, MTurk reads 20 tweets from the batch and puts them in their place for the Workers?
Is there such design? If yes, would you please guide me how I can do it? With many thanks. Jafar
All 20 tweets would be required to be part of the same HIT. There's no way to create one HIT per tweet, then have Mechanical Turk display 20 of them as one task for a worker to complete.
I am using pyspark and Flask for interactive spark as service application.
My application should get some request with some parameters and return response back. My code is here:
//first I make udf function
def dict_list(x, y):
return dict((zip(map(str, x), map(str, y))))
dict_list_udf = F.udf(lambda x, y: dict_list(x, y),
types.MapType(types.StringType(), types.StringType()))
//then I read my table from cassandra
df2 = spark.read \
.format("org.apache.spark.sql.cassandra") \
.options(table="property_change", keyspace="strat_keyspace_cassandra_raw2") \
.load()
#app.route("/test/<serviceMatch>/<matchPattern>")
def getNodeEntries1(serviceMatch, matchPattern):
result_df = df2.filter(df2.id.like(matchPattern + "%") & (df2.property_name == serviceMatch)) \
.groupBy("property_name") \
.agg(F.collect_list("time").alias('time'), F.collect_list("value").alias('value'))
return json.dumps(result_df.withColumn('values', dict_list_udf(result_df.time, result_df.value)).select('values').take(1))
When I start my server(using spark submit), and use Postman for get request, i takes about 13 seconds first time to give me response, and after that every other response takes approximately 3 seconds. To serve users with delay of 13 seconds at first is not acceptable. I am new spark user and I am assuming that this behaviour is due to the spark nature, but I do not know what exactly is causing it. Maube something about caching or compiling execution plan like sql queries. Is there any chance that I could solve this problem. Ps I am new user, so sorry if my question is not clear enought or anything else.
Such delay is fully expected. Skipping over simple fact that Spark is not designed to be used directly embedded in an interactive application (nor is suitable for real time queries) there is simply a significant overhead of
Initializing context.
Acquiring resources from the cluster manager.
Fetching metadata from Cassandra.
The question is if it makes any sense to use Spark here at all - if you need close to real time response, and you collect full results to the driver, using native Cassandra connector should be much better choice.
However if you plan to execute logic that is not supported by Cassandra itself then all you can do is accept the cost of such indirect architecture.
I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?
Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.
Does anyone know how does Spark compute its number of records (I think it is the same as the number of events in a batch), as displayed here?
I'm trying to figure out how I can get this value remotely (REST-API does not exist for Streaming option in the UI).
Basically what I'm trying to do it to get the total number of records processed by my application. I need this information for the web portal.
I tried to count the Records for each stage, but it gave me completely different number as it is at the picture above. Each stage contain the infomation about its records. As shown here
I'm using this short python script to count the "inputRecords", from each stage. This is the source code:
import json, requests, urllib
print "Get stages script started!"
#URL REST-API
url = 'http://10.16.31.211:4040/api/v1/applications/app-20161104125052-0052/stages/'
response = urllib.urlopen(url)
data = json.loads(response.read())
stages = []
print len(data)
inputCounter = 0
for item in data:
stages.append(item["stageId"])
inputCounter += item["inputRecords"]
print "Records processed: " + str(inputCounter)
If I understood it correctly: Each Batch has one Job, and each Job has multiple Stages, these Stages have multiple Tasks.
So for me it made sense to count the input for each Stage.
Spark offers a metrics endpoint on the driver:
<driver-host>:<ui-port>/metrics/json
A Spark Streaming application will report all metrics available in the UI and some more. The ones you are potentially looking for are:
<driver-id>.driver.<job-id>.StreamingMetrics.streaming.totalProcessedRecords: {
value: 48574640
},
<driver-id>.driver.<job-id>.StreamingMetrics.streaming.totalReceivedRecords: {
value: 48574640
}
This endpoint can be customized. See Spark Metrics for info.