How to share a mutable variable between executors in spark

How to share a mutable variable between executors in spark - apache-spark

I'm trying to update a column in Cassandra tables for 40M+ records through the rest APIs as it involves a few validations. Using Pyspark to fire multiple parallel requests. But the rest APIs require an auth token that has an expiry time. It is getting expired in middle and I'm not able to update the token in middle.
Is there a way to share a mutable string between executors and update the value when required in pyspark?

Related

Should i store data that i fetch from third party databse

I have an API that fetch airplane schedule from third party databases. When the frontend show the data that the API fetch from the database, should the application take data from local database or take the data from the third party?

I am considering this data as dynamic in nature and pretty critical too ( airplane schedule).
I am also assuming that you are aggregating this data from a number of providers and you must have transformed this data into a generic structure ( a common format across all providers).
In my opinion, you should save it into a local database, with a timestamp which indicates when the data was refreshed last.
Ideally you should display the last refreshed info against each provider ( OR airline) in your site. Also you could run a scheduler to refresh the data in regular intervals.
It would be nice to show that next refresh is at "nn" minutes. ( with a count down ).
If you can afford to, you can let the user refresh the data, but it is risky if the concurrent users are considerably more in number.

This is only my opinion.
If the API data/record is not subject to change then saving it to local database can be a good idea.
Users will fetch the data from the local database and for updating this local database, you can create another program (run in server) to update/fetch the data from API. This way only limited connection is requesting to API.

Given an HTTP Cookie, how do I ensure that the value stored in it is the latest

I am currently running into an issue of concurrent requests, NodeJS, with access points to a cookie that holds information that I attain from a server. The thing is the requests being made are asynchronous, and need to remain that way, but I am in charge of asking for the new data sets when the cookie is about to become stale. How do i keep updating the cookie without bogging the server down with requests for a new cookie, if multiple concurrent requests all assume that they are the ones that should be in charge of refreshing the cookie's value.
I.e. Req1->Req30 are fired off. In the process of handling Req17 the cookies time to live is caught so it sends out the refresh command. The thing is Req18->Req30 all assume that they should be the ones to refresh the cookies value, because they also do the staleness checks and fail in that respect.
I have limited ability to actively change the server side code, and due to the sensitive nature of the data cannot readily decide to place it in a DB because at that point, I become charged with ensuring that the data is again secured.
Should I just store multiple key/values in the cookie, and iterate through them, this could become an expensive operation. Also could overwrite the cookie with invalid data on some request, since to update the cookie and append the new key value pairs requires creating a new one, due to immutability with the cookies themselves.

To handle concurrent access on the cookie :
Use of timestamp; only perform the change if the data is more recent
To handle cookie data renewal :
Instead of having workers to perform the check of new data concurrently. Ask one specific worker to handle data update, meanwhile others workers use the data in read only mode.

Azure Change Feed Support and multiple local clients

We have a scenario where multiple clients would like to get updates from Document Db inserts, but they are not available online all the time.
Example: Suppose there are three clients registered with the system, but only one is online at present time. When the online client inserts/updates a document, we want the offline client(s) on wakes up to go look at change feed and update itself independently.
Now is there a way for each client to maintain it's own feed to the same partition (when they were last synced) and get the changes when the come online based on last sync?

When using change feed, you use continuation token per partition. Change feed continuation tokens do not expire, thus you can continue from any point. Each client can keep its own continuation token and read changes as needed/wakes up, this essentially means that each client can keep its own feed for each partition.

How long do IBM-Graph authorization tokens last for?

In IBM-Graph, in order to avoid excessively long authorization for each request we request a session token first, and send that along in the headers of any subsequent requests. Exactly as explained in the documentation.
In order to persist this single token across our applications cluster, we are currently storing the active IBM-Graph session token in memcached. This way each node of our cluster pulls this token out prior to every request to our graph.
Having monitored this key, it appears to not have changed/expired since we made our first request a couple of days ago. Therefore, I have some questions regarding it:
How long do these session tokens last for?
Is our current method of distributing this single key even required?
Is there a better method?
It would be nice to be able to remove the need to hit memcached for every request altogether. Knowing how long they last for could help us to devise a more elegant solution than constantly hammering a single small memcached instance.

How long do these session tokens last for?
IBM Graph tokens are intended to last for a long while - you should expect somewhere around a day, though it's subject to change. It shouldn't ever be shorter than an hour.
Is our current method of distributing this single key even required?
No, not really. I'd write some code to automatically acquire new tokens on HTTP 403 (i.e., at boot time and when they expire) and use them locally. There's no limit to the number of tokens you can have active at one time.

How to update values in background in flask?

I'm creating flask web site that uses some API to retrieve data.
this API has basic authentication with tokens and each token is valid for X hours.
I'll probably will run this app behind nginx+uwsgi and the configuration will be something like that:
[uwsgi]
# Some other config....
master = true
processes = 2
enable-threads = true
threads = 4
So i'm trying to figure out what the best way to maintain updated Auth Token for my processes and their threads.
a common solution is to use a separate script that updates some memcache or some consul solutions and retrieve the data from there but is seems like a overkill for this specific task...
There is some nice way in flask to run some background thread that updates this token?
(just to be clear its ok if the same server will have couple of auth tokens, like one for each process running....)

Save the token to a db along with the creation time, every now and then check how long it's been and then ask for a new API token if the time has expired. If you are using multiple tokens then specify which token in the db is for what.
If you don't want to use a db then you could write the token and timestamp to a file instead.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to share a mutable variable between executors in spark - apache-spark

Related

Should i store data that i fetch from third party databse

Given an HTTP Cookie, how do I ensure that the value stored in it is the latest

Azure Change Feed Support and multiple local clients

How long do IBM-Graph authorization tokens last for?

How to update values in background in flask?

Categories

Resources