Using GTFS data, how should i extend it with realtime gtfs? - node.js

I am building an application using GTFS data. I am a bit confused when it comes to GTFS-realtime.
I have stored all the GTFS information in a database(Mongo), I am able to retrieve stop times of a specific bus stop.
So now I want to integrate GTFS-realtime information to it. What will be the best way to deal with the information retrived? I am using gtfs-realtime-binding (nodsjs library) by Google.
I have the following idea:
Store the realtime-GTFS information in a separate database and query it after getting the stoptime from GTFS. And I can update the database periodically to make sure the real time info is up to date.
Also, I know the retrieve data is in .proto binary format. Should I store them as ascii or is there a better way to deal with it?
I couldnt find much information about how to deal with the realtime data so I hope someone can give me a direction on what to do next.
Thanks!

In your case GTFS-Realtime can be used as "ephemeral" data, and I would go with an object in memory, with the stop_id/route_id as keys.
For every request:
Check if the realtime object contains the id, then present realtime. Else load from the database.

Related

Is it better to prewrite the dashboard data or fetch and do the calculation on demand?

So the thing is that i have some data in my Mongodb that i want to represent in a dashboard,
And its taking some time to fetch the selected documents from different collections and do the calculations needed to send the results back to the client.
So i had this idea to pre-write the required data in the required format in a dedicated collection and whenever the client asks for the dashboard i just fetch its data directly, so that i don t have to wait to fetching data across different collections and to do the calculations when he asks for it.
by the way these data are not getting updated frequently… lets say about 100 updates max per day.
Does this idea sound right or it has some drawbacks that i didn t think about?
Thank you in advance,
That's caching, your idea sounds just right.

Best way to run a script for large userbase?

I have users stored in postgresql database (~10 M) and i want to send all of them emails.
Currently i have written a nodejs script which basically fetches users 1000 at a time (Offset and limit in sql) and queues the request in rabbit MQ. Now this seems clumsy to me, as if the node process fails at any time i have to restart the process (i am currently keeping track of number of users skipped per query, and can restart back at the previous number skipped found from logs). This might lead to some users receiving duplicate email and some not receiving any. I can create a new table with new column indicating whether email has been to that person or not, but in my current situation i cant do so. Neither can i create a new table nor can i add a new row to existing table. (Seems to me like idempotent problem?).
How would you approach this problem? Do you think compound indexes might help. Please explain.
The best way to handle this is indeed to store who received an email, so there's no chance of doing it twice.
If you can't add tables or columns to your existing database, just create a new database for this purpose. If you want to be able to recover from crashes, you will need to store who got the email somewhere so if you are given hard restrictions on not storing this in your main database, get creative with another storage mechanism.

Bulk Data Transfer through REST API

I have been informed that "REST API is not made / good for Bulk Data Transfer. Its a proven fact". I tried to search over google about this, but unable to find any fruitful answer. Can anyone let me know whether this statement is actually True or not? If its TRUE, then why?
Note: I am not exposing Bulk Data (50 million rows from database) over Web. I am saving it to Server as JSON format (Approx. 3GB file size) and transferring it to other system. I am using Node JS for this purpose. Network is not an issue to transfer file.
Nothing wrong with exposing a end point which returns huge data
It might be concern on how you are sending that data, as memory could be a issue
Why don't you consider streaming the data, that way memory needed is only one packet of the data which has to be streamed at a time
NodeJS has many way to pipe the data into response object, you can also consider JSONStream module from npmjs.org

Cloud Functions Http Request return cached Firebase database

I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});
You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.

web development - deletion of user data?

I have finished my first complex web application and I have found out it is probably better to use "isDeleted" flags in db than hard-deleting records. But I wonder what is the recommended approach for data that are stored on filesystem (e.g. photos). Should I delete them when their related entity is (soft-)deleted or keep them as they are? Can junk accumulation cause running out of storage in practice?
It definitely can - you'll need to gather some stats on how much data the typical account generates, and then figure out how many deletions you're seeing to sort out how much junk data will pile up and/or when you'll fill up your storage.
You might also want to try using something like S3 to store your data - at that point, the only reason you would need to delete things would be because it was costing you too much to store it.

Resources