Generating db_gone urls for fetch - nutch

In my crawler system, I have set the fetch interval as 30 days. I initially set my user agent as say "...." then many urls are getting rejected. But after changing my user agent to appropriate name, I want to fetch those urls which are rejected initially.
But the thing is those urls with the db_gone status will have retry interval as 45 days. So generator wont pick that.Hence in this case how would I fetch those urls with db_gone status?
Is nutch by default has any options to crawl those db_gone urls alone?
Or do I need to write a seperate map-reduce program to collect those urls and use freegen to generate segments for them?

You just need to configure nutch-site.xml with a different refetch interval.
ADDITION
<property>
<name>db.fetch.interval.max</name>
<value>7776000</value>
<description>The maximum number of seconds between re-fetches of a page
(90 days). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>

Related

Using web scraping to recive all Twitter followers

I want to create giveaways which require the participants to follow the twitter account of the giveaway creator.
My first idea was to use the Twitter API (endpoint: "/2/users/:id/followers"). This works fine for me however I always run into rating limits. The API allows me to send 15 requests every 15 minutes and returns a maximum of 1000 users per request. Since many accounts have more then 15000 followers and since many request happen at the same time (many users want to participate in a giveaway) this solution is not suitable for me.
My secound idea was to use web scraping instead (e.g Node Fetch). I was following along this tutoria: However doing so I always run into the issue that Twitter uses random strings to name their html elements. You can see in the picture there is no defined class to grap the elements.
So my main question is how can I access these element ?
Random Follower of my Twitter Account
I also have a follow up question regarding the effictivness of this method. Assuming I have multiple people who want to particpate in a short amount of time (e.g 10 people in 5 minutes) and they all need to follow a big twitter account (e.g 100k followers).
Is it efficent to scrape all 100k followers each time or should I instead try to fetch the 100k followers once, safe them to my database and use my database to check for each user later ?
As a side note, I am using node.js and node-fetch, however I have no problems to switch the framework. In addition I think the grabbing of the element as well as the performance should be universal.
Thanks for your help :)
They're going to detect your servers excessive calls. There is a Twitter Developer Portal where you can request elevated access which may raise the limits for you.
https://developer.twitter.com

Google API dailyLimitExceeded solutions

I've been working with Google Analytics for 2 months now. I created a custom dashboard with NodeJS (express/serverless), out of it with requesting data from the Core Reporting API and the Real Time Reporting API. I've managed to put it as a Lambda Function on AWS. While I'm very pleased about this, I have some issues I'm facing right now.
I get the following errors:
{
"error":{
"errors":[
{
"domain":"global",
"reason":"dailyLimitExceeded",
"message":"Quota Error: profileId ga:NNNNN has exceeded the daily request limit."
}
],
"code":403,
"message":"Quota Error: profileId ga:NNNNN has exceeded the daily request limit."
}
}
and
{
"error":{
"errors":[
{
"domain":"usageLimits",
"reason":"userRateLimitExceeded",
"message":"User Rate Limit Exceeded"
}
],
"code":403,
"message":"User Rate Limit Exceeded"
}
}
My dashboard looks like this:
When the dashboard get visited, it calls the realtime api 9 times (each block in the image is a querycall). I think I could combine the 'Online users', 'Users today' and 'pageviews today' call into one call. The search today, and orders today are specified by filters for searching for a specific event.
I have a build in a timechecker, which allows the dashboard to be viewed between 07:00 and 19:00. When it's earlier then 07:00 or later then 19:00 a variable checkTime is set to false, which makes the dashboard shows a div with text something like "dashboard offline". When someone has visited the dashboard in the allowed timerange, a variable checkTime is set to true and calls to the Google API's can be made.
The dashboard is running on a tv screen between 07:00 and 19:00. This means that the dashboard is up on a TV screen for 12 hours long. Every 20 seconds there is a function call to update all the data (so again 9 requests are being made).
So let's say there are
60 minutes x 3 = 180 x 12 = 2160 x 9(requests) = 19440 requests for a
day.
I don't think I should reach the 50.000 quota. But I am reaching the profile quota from 10.000.
However when I view the Developers console I can see the following:
I think my options are the following:
Increase the interval to 1 minute ( (60 x 12) x 9 requests each view = 6480), that way the profile quota shouldn't be exceeded. But this doesn't really make the dashboard realtime anymore.
Make a server, which runs the queries(with the increased interval of 1 minute), save the results to a database. The dashboard makes a GET request to the database. This way multiple tv screens should be able to request data.
QUESTION: Could I also make multiple service accounts, and switch to other service account when limit has been reached, or doesn't this fix the profileid limit?
DailylimitExceded can mean one of two things.
You can only make 10000 request against a single view a day. This quota you are sharing with other developers. So if i install your app and Someone elses app in total there can only be made 10000 requests a day against my Google analytics view and then both apps will get that error. If you are making these requests you should be storing the data in the database so that you don't need to request the same information again. Even though its a different user who is trying to view data on the same view. You are probably not going to be able to track this quota hit in the Google Developer console.
The second issue is that by default an application can make a max of 50000 requests a day across all views. That means that if you have 5 users and you are making 10000 requests a day for each of them you have reached the limit of your requests. I dont think this is what you are hitting.
The first quota the user based one there is nothing you can do about that you cant extend it. You need to limit your requests so that you dont block a users account. The second one you can apply for an extension in the Google Developer console it can take a while to get the extension you should apply for it when you have reached around 80% of your current daily quota.
The main thing here is that you should not be requesting the same data twice. If you have made a request you should be saving it and displaying stored data to your users rather than just requesting it again. That and the real time api you should not be trying to request from that more then every 5 minutes as you will be eating our quota.
I have suggested to Google several times that the realtime api should be on its own quota and not the same as the reporting api. I am still waiting for them to add this feature.

instagram api: getting user information and rate limits

I'm a little confused if what I am trying to do is even possible given the expressed limits to the API.
My app should do this:
user logs in, app gets auth token
user gets list of their followers
https://api.instagram.com/v1/users/self/follows?access_token=ACCESS-TOKEN
this point is easy to get to, but the next step (3) seems potentially problematic
user gets the number of followers each of those followers has
https://api.instagram.com/v1/users/{user-id}/?access_token=ACCESS-TOKEN
If the user has 5000+ (limit is 5000 requests per hour) followers, do I really need to request each users information one by one? If so, it looks like I will definitely hit the rate limit.
user is able to delete followers having under a certain amount of followers (limit 60 / hour)
https://api.instagram.com/v1/users/{user-id}/relationship?access_token=ACCESS-TOKEN
So, it seems, given the limits, that such an app would be impossible to create. Is there some channel where I can request a limit increase? This tool would be used sparingly and infrequently.
There is a section "Relationship Endpoints" where you can use request
GET/users/self/followed-by
https://api.instagram.com/v1/users/self/follows?access_token=ACCESS-TOKEN
to Get the list of users this user is followed by, so this is only one request.

Instagram API: Any way to increase result returned from hash tag based query?

I recenty saw change imposed by instagram on sandbox account that limits returned result to last 20 recent images https://www.instagram.com/developer/sandbox/#api-behavior.
I need to fetch last six tag based images but in last 20 image there are 3 images with that hash tag.
Is there any way to overcome this?
Its not just last 20 posts in Sandboc Mode, its also last 20 posts from your app's sandbox users. This is a limitation of Sandbox mode.
The behavior of the API when you are in sandbox mode is the same as
when your app is live, but comes with the following restrictions:
Data is restricted to sandbox users and the 20 most recent media from each sandbox user
Reduced API rate limits
Only way to get all is to go live.

How do you prevent crawling from your web site?

I am running a website on IIS with more than 1000 page links at pagination and I want to prevent others to crawl/steal these pages by running a crawler script and get the info page by page.
Is there any way to understand the request if it is a user request or being ran by a script? or maybe some filters for this on highest level before coming to request?
You can't prevent automated crawling.
You can make it harder to automatically crawl your content, but if you allow users to see the content it can be automated (i.e. automating browser navigation is not hard and computer generally don't care to wait long time between requests).
One option is to require single "user" (either authenticated or not) to have some minimal delay between requests (i.e. 1-5 seconds). This way generic crawling will not be useful (require some "user id" in request and delay between requests), and one would have to write custom crawling code which is clearly more time intensive.
Note that writing special "crawler" for your site may be considered as "noble" action and significantly increase incentive to create one (i.e. check out "how to make Google maps available offline" questions).

Resources