I am trying to extract some data from this website which refreshes every minute. I have tried researching about web scraping and tried chrome extensions but none seem to work for me.
Some background information about the website: it is a website where people go to monitor bid prices for COE (certificate of entitlement for cars in Singapore). Every alternate Wednesday, from 1430 to 1600, I would have to manually copy and paste the data into an Excel spreadsheet before it refreshes every minute.
Details for COE
I have attached screenshots to illustrate further.
This is the website to scrape; https://www.onemotoring.com.sg/1m/coe/coeDetail.html
You can get a very low cost with AWS Lambda with node-js.
Create a Lambda function and trigger it at your cron schedule you want to crawl the website. You can use library like
https://github.com/bda-research/node-crawler
to simplify crawling.
Also,
To get the exact nodes in the page use serverside jquery or any progressive script that can extract elements from the crawled page.
Once you have the details, you can store them under DynamoDB which is a nosql with very low latency.
You can use,
ODM like https://github.com/clarkie/dynogels to access DynamoDB with very less code.
Hope it helps.
Related
I want to create giveaways which require the participants to follow the twitter account of the giveaway creator.
My first idea was to use the Twitter API (endpoint: "/2/users/:id/followers"). This works fine for me however I always run into rating limits. The API allows me to send 15 requests every 15 minutes and returns a maximum of 1000 users per request. Since many accounts have more then 15000 followers and since many request happen at the same time (many users want to participate in a giveaway) this solution is not suitable for me.
My secound idea was to use web scraping instead (e.g Node Fetch). I was following along this tutoria: However doing so I always run into the issue that Twitter uses random strings to name their html elements. You can see in the picture there is no defined class to grap the elements.
So my main question is how can I access these element ?
Random Follower of my Twitter Account
I also have a follow up question regarding the effictivness of this method. Assuming I have multiple people who want to particpate in a short amount of time (e.g 10 people in 5 minutes) and they all need to follow a big twitter account (e.g 100k followers).
Is it efficent to scrape all 100k followers each time or should I instead try to fetch the 100k followers once, safe them to my database and use my database to check for each user later ?
As a side note, I am using node.js and node-fetch, however I have no problems to switch the framework. In addition I think the grabbing of the element as well as the performance should be universal.
Thanks for your help :)
They're going to detect your servers excessive calls. There is a Twitter Developer Portal where you can request elevated access which may raise the limits for you.
https://developer.twitter.com
I want to get or buy google search results content (structured) from Google itself or any other source that can sell google data legally. I want all results about a specific keyword for the recent 6 months for example.
It will be a good turnaround if I can only get the page content as a raw text for this stage.
Automatic reading out / scraping of Google SERP is against Google ToS. From this point of view there is no one who sells such data legally - any seller violates Googles ToS.
Tere are many offers on markt, where you can get SERP data as JSON or full HTML through API access - just google for it.
The way every seller does SERP scraping is always the same - you can do it by your own. Run many proxies with IP addresses of countries, from where you need SERPs, and query Google with a kind of headless browser. Use captcha solving services to get data even if IP should be banned. Multithread your queries to get more data at once. Thats the whole magic.
Do you think that they plugged directly into the Twitter API, or do they have some sort of backend which is what connects to the Twitter API directly instead? I didn't realize this kind of functionality was available to standard users.
Link: NoHomophobes.com
This site has a (short) piece about the technology used - it does seem like they're using the standard, public API:
"Using Twitter's API, tweets [...] were pulled, tracked and displayed
in real time
[...]
We couldn't simply pull every tweet ... A lot of research and testing
was conducted to determine which words and phrases to capture, as well
as what parameters the tweets had to follow in order to be funneled
onto the site"
Also, the site's own T&C's mention
This website contains a licensed real time display of Tweets
At a guess, they're effectively continually searching for certain terms in public tweets (as any Twitter client can do) and displaying the results.
Basically, the site uses the Twitter Streaming APIs which allow a persistent connection with Twitter. And as filtered tweets come through, it processes the data and delivers it to website users through web sockets via a 3rd party service called Pusher.
I want to submit my site to Google. How much time does it take to crawl a new post on the website?
Also, is there a way to feed this post to Google crawler as soon as a post is created?
Google has three modes of entering a website into its results - discover, crawl, index.
In order to 'discover' your site, it must be made aware of it's existence - normally through back-links. If you're site is brand new you can use the submit URL form - but this isn't really a trusted method. You're better off signing up for a Google Webmaster Tools account and submitting your site. An additional step is to submit an XML sitemap of your site. If you are publishing to your site in a blogging/posting way - you can always consider PubSubHubbub.
From there on, crawl frequency is normally based on site popularity (as measured by ye olde PageRank). Depth of crawl (crawl-budget) is also determined by PR.
There are a couple ways to help "feed" the Google Crawler a URL.
The first way is to go here and submit a URL ---> www.google.com/webmasters/tools/submit-url/
The second way is to go to your Google Webmasters Tools and clicking "Fetch as GoogleBot"
And then inputting the URL you want to add:
http://i.stack.imgur.com/Q3Iva.png
The URL will then appear similar to this:
http:\\example.site Web Success URL submitted to index 1/22/12 2:51 AM
As for how long it takes for a question on here to appear on google, there are many factors that are put in to this.
If the owners of the site use Google Webmasters Tools, the following setting is available:
http://i.stack.imgur.com/RqvOi.png
For fast crawl you should submit your xml sitemap in google web master and manually crawled and index your web pages url through google webmaster fetch.
I also used google crawled and index method and after that this practices give me best result.
This is a great resource that really breaks down all the factors that affect a crawl budget and how to optimize your website to increase it. Cleaning up your broken links and removing outdated content, for example, can work wonders. https://prerender.io/crawl-budget-seo/
I acknowledged error in my response by adding a comment to original question a long time ago. Now, I am updating this post in interest of keeping future readers from being misguided as I was. Please see notes from other users below - they are correct. Google does not make use of the revisit-after meta tag. I am still keeping the original response text here to make sure that anyone else looking for similar answer will find it here along with this note confirming that this meta tag IS NOT VALID! Hope this helps someone.
You may use HTML meta tag as follows:
<meta name="revisit-after" content="1 day">
Adjust time period as necessary. There is no guarantee that robots will return in given time frame but this is how you are telling robots about how often a given page is likely to change.
The Revisit Meta Tag is used to tell search engines when to come back next.
Comparing google analytics results to one&one hosting monthly statics shows a huge discrepancy.
For last month:
Google shows 1046 visits.
One&one stats show 15304 unique visits.
The google code is in the footer which appears on every page.
I'm aware ga only works with js enabled but to assume that many non js users???
Google Analytics is a good indicator of how many humans are visiting your website.
Here are some things to check:
how many bots are in your monthly stats? You can usually find something that says User-Agent in your stats page. GoogleBot, Slurp, msnbot & others will be visiting every page on your site.
that you've read Google Analytics' definition of a visit.
that you have read what your statistics provider means by unique visit. Does that mean unique visitor, page view or something else?
Raw hits on servers can be misleading for a number of reasons..
If you have external style sheets & JavaScript etc, they could be counted as a hit in the webserver log
RSS feed readers will periodically update without being asked to by a human
Check the page views in Google Analytics - it's possible that 1&1 is tracking unique page views instead of the actual visits.
Google Analytics works for almost all users (I believe less than 5% have JS disabled). I have had the same discrepancy, in my case the difference was zeroed out when I took into account the bots (which server-side statistics often take into account, as they produce http-requests). You probably have the same "problem".
Neither stats are wrong, they just count different things. Google Analytics is the more "accurate", i.e. the numbers you want to take a look at. The hosting stats, which look only at http requests, often without filtering, are less interesting.
Blogger, and probably other sites, serve a different page template or skin to mobile visitors. In my case, that template didn't contain the google analytics snippet of code and so those hits were uncounted, until I noticed and fixed it.