I am relatively new to Node.js and I am trying to get more familiar with it by writing a simple module. The module's purpose is take an id, scrape a website and return an array of dictionaries with the data.
The data on the website is scattered across pages whereas every page is accessed by a different index number in the URI. I've defined a function that takes the id and page_number, scrapes the website via http.request() for this page_number and on end event the data is passed to another function that applies some RegEx to get the data in a structured way.
In order for the module to have complete functionality, all the available page_nums of the website should be scraped.
Is it ok by Node.js style/philosophy to create a standard for() loop to call the scraping function for every page, aggregate the results of every return and then return them all in once from the exported function?
EDIT
I figured out a solution based on help from #node.js on freenode. You can find the working code at http://github.com/attheodo/katina_node
Thank you all for the comments.
The common method, if you don't want to bother with one of the libraries mentioned by #ControlAltDel, is to to set a counter equal to the number of pages. As each page is processed (ansynchronously so you don't know in what order, nor do you care), you decrement the counter. When the counter is zero, you know you've processed all pages and can move on to the next part of the process.
The problem you will probably encounter is recombining all of the aggregated results. There are several libraries out there that can help, including Async and Step. Or you can use a promises library like Fibers.Promise. But the latter is not really node philosophy and requires direct code changes / additions to the node executable.
With the helpful comments from #node.js on Freenode I managed to find a solution by sequentially calling the scraping function and attaching callbacks, as Node.js philosophy requires.
You can find the code here: https://github.com/attheodo/katina_node/blob/master/lib/katina.js
The code block of interest lies between lines 87 and 114.
Thank you all
Related
I am working on a scraper that is growing bigger and bigger and I'm worried about making the wrong design choices.
I have never done more than short scripts in python and I'm at a loss knowing how to design a project with bigger proportions.
The scraper retrieves data from different, but similar themed websites, so an implementation for each site is needed.
The desired raw text of each website is then put through a parser which searches for the required values.
After retrieving the values they should be stored in a 3N-Database.
In its final evolution the scraper should run on a cloud service and check all the different sites periodically for new data. Speed and performance are not of highest importance but desirable. Most importantly the required data should be retrieved without unnecessary reuse of code.
I'm using the Selenium webdriver and have the driver object implemented as a singleton, so all the requests are done by the same driver object. The website text is then part of state of that object.
All the other functionality is currently modelled as functions, everything in one file. For adding another website to the project I first copied the script and just changed the retrieval part. As it soon occurred to me that that's pretty stupid I wanted to ask for design recommendations.
Would you rather implement a Retriever mother class and inherit from that for every website or is there an even better way to go?
Many thanks for any ideas!
I am having an issue getting the twit node.js bot to recognize multiple parameters when attempting to RT when it is both a listed user in the code, as well as containing a specified hashtag.
var stream = bot.stream('statuses/filter', {follow: approved_users, track: '#somehashtag'}
While this seems to work on an either/or basis how can this me modified so that it is an AND condition?
I've been looking through the documentation, and while I am assuming this should be a relatively simple fix I am not familiar enough with node.js to recognize immediately where the issue is.
Any help on this would be greatly appreciated!
Per the Twitter documentation, it is not possible to AND these filter types:
The track, follow, and locations fields should be considered to be
combined with an OR operator. track=foo&follow=1234 returns Tweets
matching "foo" OR created by user 1234.
The only way you'd be able to do something more complicated in real time would be to use the enterprise PowerTrack API, which is a commercial offering.
It looks like all the Spark examples found in web are built in as single long function (usually in main)
But it is often the case that it makes sense to break the long call into functions like:
Improve readability
Share code between solution paths
A typical signature would look like this (this is Java code but similar signature would appear in all languages)
private static Dataset<Row> myFiltering(Dataset<Row> data) {
return data.filter(functions.col("firstName").isNotNull()).filter(functions.col("lastName").isNotNull());
}
The problem here is that there is no safety for the content of the Row, as there is no enforcement on the fields, and calling the function becomes not only a matter of matching the singnature but also the content of Row. Which obviously may (and does in my case) cause errors.
What is the best practice you enforce in large scale development environments? do you leave the code as one long function? do you suffer every time you change field names?
Yes, you should split your method into smaller methods. Be aware, that many small functions also are not much readable ;)
My rules:
Split some chain of transformation if we can name it - if we have name, it means that this is some kind of subfunction
If there are many functions of the same domain - extract new class or even package.
KISS: don't split if your function call chain is short and also describes some subfunction, even if some lines may be extracted - it is not as much readable to read many many custom functions
Any larger - not in 1-3 lines - filters, maps I suggest to extract to method and use method reference
Marked as Community Wiki, because this is only my point of view and it's OT for StackOverflow. It someone else has any other suggestions, please share it :)
Hello Stackoverflow,
I'm writing API's for quite a bit of time right now and now it came to work with one of these bigger api's. Started wondering how to shape this API, as many times I've seen on a bigger platforms that one big entity (for example product page in shop) is loaded separately (We can see that item body loaded, but comments are still fetching etc.).
Usually what I've done was attaching comments as a relation in SQL query, so my frontend queried single API Endpoint like:
http://api.example.com/items/:id
And it returned all necessary data like seller info, photos etc.
Logically seller info and photos are small pieces of data (Item can only have 1 seller and no more than 10 photos for example), but number of comments might be way larger collection with relationship (comment author).
Does it make sense to separate one endpoint into 2 independent endpoints like:
http://api.example.com/items/:id
http://api.example.com/items/:id/comments
What are downsides of this approach? Is it common practice? Or maybe I misunderstood some concept?
One downside might be 2 request performed, but on the other hand, first endpoint should return data faster (as it's lighter than fetching n of comments), so page might be displayed faster and display spinner for comments section. This way I'll be able to paginate comments too.
Are there any improvements that might be included in this separation of endpoints? Or maybe I'm totally wrong and it should be done totally different way?
I think it is a good approach if:
The number of comments of one item can be large, because with this approach you could paginate it easier.
If you are going to need to access to the comments of one item without needing rest of item information
I think any of the previous conditions justify this decition, and yes, it is common approach.
What are the queries we can use with media/popular. Can we localize it according to country or geolocation?
Also is there a way to get the discovery feature's results with the api?
This API is no longer supported.
Ref : https://www.instagram.com/developer/endpoints/media/
I was recently struggling with same problem and came to conclusion there is no other way except the hard one.
If you want location based popular images you must go with location endpoint.
https://api.instagram.com/v1/locations/214413140/media/recent
This link brings up recent media from custom location, key being the location-id. Your job is now to follow simple pagination api and merge responded arrays into one big bunch of JSON. $response['pagination']['next_max_id'] parameter is responsible for pagination, so you simply send every next request with max_id of previous request.
https://api.instagram.com/v1/locations/214413140/media/recent?max_id=1093665959941411696
End result will depend on the amount of information you gathered. In the end you will just gonna need to sort the array with like count and you're up to go whatever you were going to do.
Of course important part is to save images locally rather than generating every time user opens the webpage. Reason being not only generation time but limited amount of requests per hour.
Hope someone will come up better solution or Instagram API will finally support media/popular by location.