I want to add a search feature to a site using TYPO3 9.5.13.
No problem so far, just do a composer require "typo3/cms-indexed-search" ^9 as indexed_search supports 9.5.
The docs then tell me to install the crawler extension (supposedly "aoepeople/crawler": "^6.7").
The catch is that the crawler docs state that it supports TYPO3 up to 8.7.99.
No-risk-no-fun so give it a try and install crawler even if it does not explicitly support TYPO3 9.5.
When selecting "info" on a page, it tells me:
Fatal error: Class 'TYPO3\CMS\Core\Controller\CommandLineController' not found in /var/www/html/public/typo3conf/ext/crawler/Classes/Command/QueueCommandLineController.php on line 38
Looks like crawler really does not support 9.5.
This raises a few questions:
Is it impossible to use indexed_search on TYPO3 9.5, because crawler does not support it?
Is there a workaround? Do I really need crawler or is there another option?
Should I opt for an alternative to indexed search, such as solr?
What is the best practice here?
The crawler is only needed if you want to update the search index on a regular basis with the scheduler. If you dont install the crawler, the index is updated whenever the page is loaded from a user who is not logged in in the backend. For small up to mediate large sites this should work.
Related
I am prototyping a Shopware App right now, where I want to extend the search with our search API. We already have a working plugin in the store for that.
I found those two references for hooks:
https://developer.shopware.com/docs/resources/references/app-reference/webhook-events-reference
https://developer.shopware.com/docs/resources/references/app-reference/script-reference/script-hooks-reference
Seems like there is no webhook for the search at all and just a script-hook for a finished search. In the plugin, we could just extend the ProductSearchRoute and be completely flexible.
Are search extension not planned right now?
Cheers,
Tobias
I assume you want to alter the criteria for fetching the products. As of today this is not yet possible with non-self-hosted apps. You could use the app scripts to enrich or replace the contents of an already loaded page as you already mentioned. Obviously that comes with some drawbacks regarding performance. The capabilities of apps are being enhanced continuously though so there's chance search manipulation might become possible rather soon.
What is another way to crawl the web besides following hyperlinks?
Most major sites use Sitemaps. This gives your crawler a fast way of discovering URLs and can be used with or instead of following outlinks.
The crawler commons project provides a Sitemap parser in Java.
I need to detect programatically if a website has an e-commerce platform/system
I don't need to know which one, I just need to know if the website has one.
(I have a big list of websites so I probably need to scrape them)
any suggestions on how I could do this without using external websites (like rescan.io/builtwith/etc) would be greatly appreciated!
thank you!
You can use a package called Puppeteer which is used to do web-scraping in node.js.
I don't know what platforms you are trying to look for, but I guess you could try something like giving the list of websites you want to check to a node.js process and ask Puppeteer to scrape them all. Then you look at the content you get back and for example look for Shopify's CDN in the tags or check the tags for keywords.
You will definitely need to check each different platform like Magento or Shopify for unique source code that clearly sets apart the framework you are looking at from other tools.
I have a simple HTML site with 100+ pages or so. I want to add a search bar at the top so the user can search the site. I know about Google Custom Search, but it shows ads unless you pay at least $100. Obviously I'd like ad-less search on my site for free if at all possible!
I've also heard about Lucene/Solr, but they do not actually crawl the site. For that I would apparently need Nutch.
Anyway, the site I have runs on a Microsoft IIS6 server, but I have basically no knowledge as to how Solr, Nutch, etc. gets "installed" on the server.
Also: I'd like to point out that I do have a local copy of the site. Perhaps I can do one big initial nutch "crawl" locally that will create an .xml for Solr?? That would help me get "up and running", but probably wouldn't be a good long-term solution.
..so should I just use Google Custom Search? or is there a not-extremely-painful-to-implement alternative? The brain hurts folks.
You did not mention how many search requests you want to handle but if you use the json-rest-api of google's custom search you have 100 searchqueries a day for free and you can display them without any ads on your page.
An simple example request can be found here.
Here is an easy way that works pretty well, although you may be looking for something more than this.
http://sitecomber.com/getsitecomber/
You can create code to paste into your site in about 2 minutes. It doesn't get easier than that. Search is powered by Google, but results are isolated to your website.
EDIT: This no longer works.
I am looking into the simplest way to integrate Wikipedia into a node.js app.
The requirements are to be able to search for entries and find entities in each entry.
Any known existing libs/methods for that?
Thanks
There's a newly available open source parser for wiki text (http://sweble.org/) that might be useful to you if you roll your own solution. Of course that would require you downloading the wikipedia data dump, parsing, and storing entities in a db.
You could also look at dbpedia (http://dbpedia.org/About), though that would require integrating the rdf stack into your app (either running a local rdf repository or communicating with the often flaky online version via sparql).
One easy approach is to use a search engine api and restrict to site:wikipedia.org - e.g:
http://www.google.com/search?q=node.js+site%3Awikipedia.org
I've found that can work really well.
Spider for scraping using jquery is fantastic:
https://github.com/mikeal/spider
Mikeal is the man
Presumably you'd be using this for a side (personal) project though. Not sure how kosher it is to run wild on wikipedia with a scraper.