Getstream.io Import follow relations - getstream-io

we have video portal where users can follow each other and get updates via email when followed user uploads new video.
We have about 215 756 followers relation in our database.
I tried to run cron on our server to migrated followers to getstream but it takes too long and sometimes it gives connection timeout.
Is there other way to migrate our relations to getstream database. for example upload some json file somewhere or anything like this?

Getstream-io provides batch import of data in two ways:
Batch operation
First of you can use batch operations such as batch follow and batch activity add. These operations are performed significantly faster than standard follow and add activity operations.
Import
Second you can send us a data dump (preferred format: json) which we will then import to your app. Read more about it on this docs page.

Related

How to improve performance on backend when data is fetched from multiple APIs in sequencial manner?

I am creating a Nodejs app that consumes APIs of multiple servers in a sequential manner as the next request depends on results from previous requests.
For instance, user registration is done at our platform in PostgreSQL database. User feeds, chats, posts are stored at getStream servers. User roles and permissions are managed through CMS. If in a page we want to display a list of user followers with some buttons as per the user permissions then first I need to find list of my current user's followers from getStream then enrich them with my PostgreSQL DB then fetch their permissions from CMS. Since one request has to wait for another it takes long time to give response.
I need to serve all that data in a certain format. I have used Promise.all() where requests were not depending on each other.
I thought of a way to store pre-processed data that is ready to be served but I am not sure how to do that. What is the best way to solve this problem?
sequential manner as the next request depends on results from previous requests
you could try using async/await so that each request will run in a sequential manner.

Firebase Cloud Express queue for storage resource to be generated

I have a large dataset stored in a Firestore collection and a Nodejs express app (exposed as a firebase functions.https.onRequest) with an endpoint which allows users to query this dataset and download large amounts of data.
I need to return the data in CSV format from the endpoint. Because there is a lot of data, I want to avoid doing large database reads each time the endpoint is hit.
My current endpoint does this:
User hits endpoint with a database query requesting documents within a range
Query is hashed into a filename. eg query_"startRange"_"endRange".csv
Check Firebase storage to see if this query has been run before
if the csv already exists:
return a 302 redirect to the csv file with a signed url
if the csv doesn't exist:
Run the query on the Firestore collection
Transform the data into the appropriate CSV format
upload the new CSV to Firebase storage
return a 302 redirect to the newly generated csv file with a signed url
This process is currently working really well, except I can already foresee an issue. The CSV generation stage takes roughly 20s for large queries and there is a high possibility of the same request being hit from multiple users at the same time.
I want to build in some sort of queuing system so that if X number of users hit the endpoint at once, only the first request triggers the generation of the new CSV and the other (X-1) requests will be queued and then resolved once the CSV is generated.
I have currently looked into firebase-queue which appears to be deprecated and not intended to be used with Cloud functions.
I have also seen other libraries like p-queue but I'm not sure I understand how that would work with Firebase Cloud functions and how seperate instances are booted for many requests.
I think that in your scenario the queue approach wouldn't work quite well with Cloud Functions. The queue cannot be implemented in a function as multiple instances won't know about each other, therefore the queue would need to be implemented in some kind of dedicated server, which IMO defeats the purpose of using Cloud Functions as both the queue and the processing could be ran in the same server.
I would suggest having a collection in Firestore that keeps track of the queries that have been requested. This way even if the CSV file isn't still saved on Storage you could check if some function instance is already creating it, then you could sleep the function until the operation completes and return the signed url. Overall the algorithm might look somewhat like this:
# Python PseudoCode
if csv_in_storage():
return signed_url()
if query_in_firestore():
while True:
sleep(X)
if csv_in_storage():
return signed_url()
try:
add_query_firestore()
csv = create_csv()
upload_csv(csv)
return signed_url()
except Exception:
while True:
sleep(X)
if csv_in_storage():
return signed_url()
The final try/catch is there because the add_query_firestore operation might eventually fail if two functions make simultaneous attempts to write the same document into Firestore. Nonetheless this are also good news since you know the CSV creation is in progress and you can wait for it to complete.
Please keep in mind the pseudocode above is just to illustrate the idea, having the while True as it is may lead to infinite loop and function timeout which is plain bad :).
I ended up solving this using a solution similar to what Happy-Monad suggested. I'm using the nodejs admin SDK but the idea is similar.
There is a collection in Firestore which keeps track of executed queries Queries. When a user hits the endpoint, I call the admin doc("Queries/<queryId>").create() method. This method will only create the query doc if it doesn't already exist, so I am able to avoid race conditions between parallel requests if I were to check for existing queries first.
Next the request starts an onSnapshot listener to the query that it attempted to created. The query has a status field which starts as created. The onSnapshot will only resolve once that status has changed to complete.
I have onCreate database trigger listening to "Queries/*". This database trigger handles the requested query and updates the query status to complete. In the case that the query already exists, the status is already in the complete state, so the onSnapshot resolves instantly.

CQRS pattern questions

I'm learning CQRS pattern as we going to use it on our new project. And I have few questions so far:
Example task: I'll have cron command to fetch information from different providers (different API's) and responsibility of this cron command is:
fetch data from all provided;
make additional API call's to get images and videos;
process those videos and images (store to aws s3) and to uploads table in DB;
fetch existing data from DB;
transform new API data to system entities, update existing entities and delete nonexistent;
persist DB. ;
CQRS related questions:
Can I have few CQRS commands and queries inside of one system request? In the example above I need to get existing data from DB (query), persist data (command) and so on.
what about the logic of fetching data from API's can I consider it as a CQRS query as its process of getting data or CQRS query it's the only process of getting data from internal storage, not from external API?
What about the process of storing videos to s3 and storing information to uploads table, can I consider the process of storing assets to S3 as a CQRS command and this command will return data I need to store later to uploads? I do not want to store it immediately as upload entity is a part of aggregate to store main info where main info entity is the main aggregate entity. I know command should return nothing or entity ID but here it will return all data about stores assets
If all the questions from above are true, so I can make:
query to fetch API data
query to get existing data
command to process images/videos
command to insert/update/delete data
Don't judge me very strict, I'm in process of learning concepts of DDD and related patterns. And I just ask the questions what is not clear for me. Thank you very much
Can I have few CQRS commands and queries inside of one system request?
In the example above I need to get existing data from DB (query),
persist data (command) and so on.
No, you cannot. Each request is either one command or one query.
what about the logic of fetching data from API's can I consider it as
a CQRS query as its process of getting data or CQRS query it's the
only process of getting data from internal storage, not from external
API?
Commands and Queries refer to the local database. Fetching data from external services through remote API is an integration with another BC (see DDD context mapping patterns).
What about the process of storing videos to s3 and storing information
to uploads table, can I consider the process of storing assets to S3
as a CQRS command and this command will return data I need to store
later to uploads?
Storing videos to s3 is not a command, is an integration with an external service. You will have to integrate (again context mapping pattern).
I do not want to store it immediately as upload entity is a part of
aggregate to store main info where main info entity is the main
aggregate entity.
I dont know your domain model, but if uploads is a child entity in an aggregate, then storing things in your uploads table isnt a command neither. A command refers to the aggregate. Storing info in uploads table would be a part of the command.
AS A CONCLUSION:
A command or a query is a transactional operation at the application layer boundary (application service). They deal with data from your DB. Each command/query is a transaction.

automate extraction of events using REST API endpoint either frequently or in frequent batches

I have an API endpoint for an event store to which I can query a get request and
receive a feed of events in ndjson format. I need to automate the collection of these events and store them in a database. As these events are in a nested json structure where some of the events have a complex structure, I was thinking of storing them in a document database. Can you please help me with the options I have in capturing these events and storing them w.r.t. the python libraries/frameworks that I can use to achieve this? To understand the events I was able to use REQUESTS library and get the events. I also tried asyncio and aiohttp to try to get these events asynchronously but that ran slower than requests run. can we create any pipeline using to get these events from the endpoint at frequent intervals?
Also some of these nested json keys have dots, MongoDB is not allowing to store them. I tried CosmosDB as well and it worked fine (only thing there was, if the json has a key "ID", it has to be unique. As these json feeds have ID key which is not unique, I had to rename the dict key before storing into cosmosdb).
Thanks,
Srikanth

Spark Machine learning design model from web application

I have developed a web application where user can choose machine learning framework/ number of iterations/ some other tuning parameter. How can I invoke Spark job from user interface by passing all the inputs and display response to user. Depending on the framework (dl4j/ spark mllib/ H2o) user can either upload input csv or the data can be read from Cassandra.
How can I call SPARK job from user interface?
How can I display the result back to user?
Please help.
You can take a look at this github repository.
In this what is being done is as soon as a GET request is arrived, it takes out the data from the Cassandra and then Collect the data and throws it back as the response.
So in your case :
What you can do is , as soon as you recieve a POST request , you can get the parameters from the request and perform the operations accordingly using these parameters and the collect the Result on the master and then throw it back to the user as the Response.
P.S: Collecting on Master is a bit tricky and lot of data can cause OOM. What you can do is save the results on hadoop and send back the URL to the Results or something like that.
For more info look into this blog related to this github:
https://blog.knoldus.com/2016/10/12/cassandra-with-spark/

Resources