Get result of search of website, without actually "opening" the site - search

I'd like to code a website where you can find search results from many websites.
So my question is, if this scenario is possible and if yes, if you guys have any suggestions how I would be able to do this.
Here my workflow:
I search for something on my website. For example: "asdf"
My code then executes the search from the other website. for example:
https://www.google.ch/#q=asdf&safe=images
Then there will be shown some results, of course. But how can I directly take the results and show them on my website, without opening the other website?
I have to say, that the websites I'm looking for, haven't got any API for that.

I probably wouldn't recommend to scrap a web page directly in the client.
I'm not even sure if you can achieve it easily without getting some Cross Domain Policy problems anyway.
A solution like APIfy might help you doing what you want:
http://apify.heroku.com/resources
Or you can still make your own server site API "layer" for this particular website?
Keeping in mind that scrapping a web page is always a fragile process where the format can chance at any moment.

Related

Requesting removal of stored website data from search engines

Greetings fellow developers,
I would like to ask for help regarding the following problem: Is there a way to request removal of stored website data from search engines? Most of the links that show up when searching my domain are old and non-existent.
What I've found from personal research regarding this question/problem:
From my personal research I have found that removal requests can be made individually to the well-known search engines such as Google, Yahoo and Bing, but this is not what I am looking for, since I am well-aware that it would take a lot of time for the requests to be processed and the removal of the data to be done. Also, I wasn't able to find this "removal-request" webpage for the other search engines.
To be more precise/clear...
... I want to request this website-data-removal to all (most) search engines at once, so that when I upload my new website (to the same domain), working and functional links (URLs) would be displayed. Can this be anyhow achieved and, if so, how? Also, how much time would it take for this removal to be finished?
Hope my question is clear enough, and any answer/help would be very much appreciated.
No, there is not a way to do this for all search engines at once. You will have to request it from each site individually. As for the smaller search engines you can try and find any contact information or customer support however their is a chance they will ignore your request (heck, some sites ignore the robot.txt file and just search your site anyways... it's just a part of being on the web).

detecting if website has e-commerce in Node.js

I need to detect programatically if a website has an e-commerce platform/system
I don't need to know which one, I just need to know if the website has one.
(I have a big list of websites so I probably need to scrape them)
any suggestions on how I could do this without using external websites (like rescan.io/builtwith/etc) would be greatly appreciated!
thank you!
You can use a package called Puppeteer which is used to do web-scraping in node.js.
I don't know what platforms you are trying to look for, but I guess you could try something like giving the list of websites you want to check to a node.js process and ask Puppeteer to scrape them all. Then you look at the content you get back and for example look for Shopify's CDN in the tags or check the tags for keywords.
You will definitely need to check each different platform like Magento or Shopify for unique source code that clearly sets apart the framework you are looking at from other tools.

Retrieving Google Instant Data

I want to develop an application that will visualize the recommendations of Google instant. It is for a course project and for now, I don't know much about web programming tools. What I wonder is that is it possible to retrieve that data from another web page. If you think it is possible and it is possible with which platform, could you please guide me to the correct direction?
Without more information on what you're actually trying to do, it's difficult to give a proper answer. From what I can understand, you just want a list of the auto-completed items from a Google search, to manipulate however you like?
In which case, using the highest-rated answer from here, you can use http://suggestqueries.google.com/complete/search?client=firefox&q=YOURQUERY to give you a JSON object which you can then manipulate to get the auto-complete results. The client= part is needed, but I haven't looked at various options you can put in there.
Personally, I've never used JSON before, so can't give you any help on how to go about parsing it, but you can find more information about it on the JSON website, and w3 website.
Will need to act like javascript or run a javascript engine OR a browser add on and communication with that add on.
What happens as you type is a javascript function is called. So you need to call this function in your own or mimic what it does. I guess it calls a web service/ web page form programamtically (ajax) with what you have typed. The server responds with the suggestions. Not very difficult as long as Google does not deny you if it realizes your not a browser. i think they like only 100 free API calls but you can google google about that.
Http Components in java will help calling the serice, with cookeis etc. You should use the dev tools on firefox to see what happens under the hood when you type in the google search bar and see the code.

Add search feature to simple website without mySQL database

I have a simple HTML site with 100+ pages or so. I want to add a search bar at the top so the user can search the site. I know about Google Custom Search, but it shows ads unless you pay at least $100. Obviously I'd like ad-less search on my site for free if at all possible!
I've also heard about Lucene/Solr, but they do not actually crawl the site. For that I would apparently need Nutch.
Anyway, the site I have runs on a Microsoft IIS6 server, but I have basically no knowledge as to how Solr, Nutch, etc. gets "installed" on the server.
Also: I'd like to point out that I do have a local copy of the site. Perhaps I can do one big initial nutch "crawl" locally that will create an .xml for Solr?? That would help me get "up and running", but probably wouldn't be a good long-term solution.
..so should I just use Google Custom Search? or is there a not-extremely-painful-to-implement alternative? The brain hurts folks.
You did not mention how many search requests you want to handle but if you use the json-rest-api of google's custom search you have 100 searchqueries a day for free and you can display them without any ads on your page.
An simple example request can be found here.
Here is an easy way that works pretty well, although you may be looking for something more than this.
http://sitecomber.com/getsitecomber/
You can create code to paste into your site in about 2 minutes. It doesn't get easier than that. Search is powered by Google, but results are isolated to your website.
EDIT: This no longer works.

How to get the lists of file and directory names of a site?

How exactly do you do this? The reason is my CMS has been breached, well, mainly because the username and password is fairly common (my bad). But I've always thought that it is save, since the directory name is pretty un-common and hard to guess (not the usual /cms/ or /admin/). Brute-forcing from a script? or maybe some Google tricks?
update : my CMS is in PHP and I developed it myself. I don't remember putting the link to it everywhere, except once in email I sent to my friend via gmail.
update 2 : as this could be used by some people to attack a site, please don't put any script in the answer. My intention is just to know the general ways to do it, so that I could prevent further attacks like this.
Thanks in advance.
Did you ever surf somewhere via a link from your CMS? Your browser would have sent a referer (note the misspelling) header, indicating where you came from.
Maybe you had a link to administrative area somewhere?
Or maybe accessing main directory without filename renders directory index?
I.e. you're using mod_autoindex?
My guess is, that somebody linked to your CMS URL and an automated (evil) script found it using Google search results looking for some common patterns.
Search in Google using this query
link:http://www.example.com/myCmsFolder
to verify if your link/pages are contained in Google.

Resources