Searching across multiple languages -- how to? - search

TLDR: I wanna build multi-language search on my website ala Pinterest, how do I do that?
I am starting a website, where people can publish content that gets metadata typed by the user. People can then interact with the content by looking at it, liking it, commenting on it, sharing it to social media. Also content discovery is mostly done through search.
I do not wish to create geographic boundaries on my website. I would like people who speak any language to find content that is relevant to them in any language. This requirement makes sense because the content is highly visual, ala Pinterest. So even if I don't understand that the word "car" is written in French in the description, it's fine because I'll mostly be interested in seeing the car.
Pinterest is really really good with search across language. For example, on uk.pinterest.com I typed "coupe carrée" which is the French for "bob haircut" and all the results are visually relevant. Even if the pin metadata is in English and the original web site is all in English.
How is that possible? how was Pinterest able to match to my french search query content whose text is all in English? is there translation at some step: coupe carrée > bob haircut > content containing "bob haircut"?
I looked at their engineering blog and all I found is tech to detect the original country and language of a website. Nothing about managing language in search.
please let me know if this is the wrong place to ask the how-it-works question.
Thanks in advance for any help/pointers you will be able to share!

The general strategy in this case is to index your content with every language translation you wish to search.
This would require use of a language translation API at index-time. And a language identification model. Here's a Solr example.

Related

Does Google engine penalize pages containing (machine or human) translated content?

Google SE has zero-tolerance policy against duplicate and spun content, but I am not sure how it deals with translated text? Any guesses on how it might detect translated content? The first thing occurs to my mind is they use their own Google Translate to back-translate the translated content into the source language, but if that's the case do they have to try back-translating into all languages? Are there any specific similarity metrics for such a task? Thank you!
From this video with a Google employee, auto-generated / machine translated versions of webpages can count against your site as duplicate content. If you append the machine translated version with some text of your own you might be able to get around this 'Yes, it's duplicated content' flag, but we can't know how much original text needs to be added to a translation in order for the Google robots to flag the page as original content instead of duplicated content.
Your best bet would be to have an actual human translate the whole web page or you could have a human translator augment or modify a machine-translated version of your webpage so that human-edited translation of your website is sufficiently different (what 'sufficiently' is we don't know) from the machine translated version.

Changing the frontend language & multiple countries support in Shopware

So, I'm trying to understand how this E-Commerce solution works. I have installed the Community Edition and added the default shop. Everything is fine. I read in the Shopware documentation that you can add multiple languages to the web shop by creating additional shops and configuring them as Language Shops. All is fine, that worked, I now have two websites.
Problem is - even though the localisation information is set to Romanian - the website is all in German. Do I really need to purchase the language packs that are offered in the Shopware store? Or can I change the text manually? If so, how do you do that? Also, apparently the flag for the selected language is off... I have the Language Shop configured for RO, but it displays the shop as DE (Germany).
Also, can Shopware make a difference between selected languages when talking about product stock, prices and payment method? The idea is that depending on what country is selected, the product stock and price is changed. With this, the product code might get changed. Also, payment methods and accounts have to be changed as well. Can Shopware do that? If so, is there a tutorial or something regarding this? (I didn't really find something like this...)
Thanks for the help !
Crossposted at the Shopware Community Forum.
So many questions ... trying to sort this out:
As far as I know the backend is available in 2 languages only, German and Englisch.
Just changing the localisation will not change the language being used in the shop frontend. Changing the language is a complex thing to do, which require many aspects.
First of all, each subshop is assigned to a root category eg. 'German' or 'English' or 'Romanian' - this facilitates to show different stock in different languages. (Btw, do not delete the category Deutsch or German, it will break the system. If you don't need it, just leave it empty.)
Many of the predefined attributes and any custom defined attribute for articles can be translated.
Any other static text can be translated, as Shopware uses the Smarty template technology - very easy and straightforward to do.
Bottom line: No you do not need to buy the language packs, if you want to the translation yourself.
Prices can be set per customer group, subshops in turn are associated to customer groups, so this way you could show different prices in different regions/languages.

Is there a good way to retrieve company summary from wikipedia?

My question is not about parsing.
I have been looking through the wikipedia API. I need to search for companies and get a one sentence summary. It's working good, the only problem I have is when I need to disambiguate. It's hard for my code to know whether "dropbox (service)" or "dropbox (band)" is the dropbox company my user is looking for.
I tried to put the word "company" in the query, expecting it to work like a google search, but unfortunately it didn't.
so my question is: is there an easy way to disambiguate the results I get by telling wikipedia it is a "company" that I want?
If you're looking for companies only then consider using their full names instead of short forms. In case of Dropbox, the name of the company is Dropbox, Inc. If you search for Dropbox, Inc in Wikipedia you will be redirected to the page Dropbox(Service) which i believe is the page youre looking for.
If you dont have the resources to have the name of the company in the perfect format, then consider using Category:Companies to refine your results further.
When you get to the page, you can mine for the extract of the company by using the Mediawiki API as follows
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Dropbox%20(service)
Note: The extract is called section0 in MediaWiki
I recommend trying Wikidata. Wikidata are a multilingual factual database of everything, and they have a query interface at query.wikidata.org. The language the interface uses is called SPARQL. For instance, if you're interested in a list of well-known cats, https://w.wiki/W4W is your query. More details can be found at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service.
import wikipedia
print(wikipedia.summary("COMPANY_NAME"))
Try to filter out the companies by categories - there is a list provided in the end of the page:
xx = wikipedia.page("Dropbox")
xx.title
print(xx.categories)

Does SharePoint Search support range tags?

I am working on a project to digitize approximately 1 million images for which metadata will be added to facilitate search.
Each image is, for example, a page in a dictionary. But not text. Just a static scanned image. OCR is not an option :(
My objective is to emulate the current search procedure which consists of looking up the alphabetical entries till the correct page is found. In absence of machine readable text, I am looking at tagging each page with Dictionary range tag. For Example (Apple-Canada). So if someone searches for "Banana", it should hit the (Apple-Canada) range Tag.
Is this supported in SharePoint out of the box? If not, is there an addon product which provides this functionality or am I looking at building a customized extension?
Any help will be appreciated :)
Installing the IFilter for TIF files is done with a couple of clicks and gives you free OCR along the way. Very good for scanned pages.
On your question though: No, SharePoint does not have any kind of "range" tags or fields. The only vaguely similar thing to what you are requesting is the Thesaurus of the search. There you could define acronyms and synonyms for words and it would actually search for something else. So you could enter Banana but it would actually search for Apple. Some examples here: How to: Customize the Thesaurus in SharePoint Search and Search Server.
Other than that I can only think of a custom implemented search provider giving you the flexibility you need.

How to get a description of a URL

I have a list of URLs and am trying to collect their "descriptions." By description I mean what comes up, for example, if you Googled the link. For example, http://stackoverflow.com">Google: http://stackoverflow.com shows the description as
A language-independent collaboratively
edited question and answer site for
programmers. Questions and answers
displayed by user votes and tags.
This the data I'm trying to accumulate for the URLs I have.
I tried parsing the URL's meta-descriptions, however most of them are lacking a meta-description (yet Google and other search engines manage to get a description somehow).
Any ideas? Should I just "google" each link and scrape the data? I have a feeling Google wouldn't like this...
Thanks guys.
Different search engines have different algorithms to get the description out of the page if/when they are lacking the description meta tag. Some ignore the tag even it it's there.
If you want the description Google has, the most accurate way to get it would be to scrape it. Otherwise, you could write your own or look around on the web for code that does it.
These are called snippets.
Google use proprietary (and possibly patented) methods to garner this information, so there is no simple answer.
As you suggest, they will use meta-description information if it is there. (How to set the meta-information to help Google.)
They will also honour requests from the page authors to NOT include snippets. (How to prevent Google from displaying snippets) You should probably respect this too (as well as robots.txt, of course.)
You may have some luck with existing auto-summary packages, such as OTS.
You may want to check AboutUs.org (i.e. http://www.aboutus.org/StackOverflow.com).
But, there's little chance that the site will have an aboutus page and not have a meta description.
Some info that might explain how google does this:
Webmasters/Site owners Help
Adding a URL to google
I am not familiar with Google APIs, but perhaps there is an official way to get such information.
Interesting. some sources are better than others.
For "audiotuts.com" google has a worse description than AboutUs.com.
Google
Nov 18th in General by Joel Falconer ·
1. Recently, an AUDIOTUTS reader asked me about creative process. While this
is a topic that can’t be made into a
...
AboutUs.com:
AUDIOTUTS is a blog/tutorial site for
musicians, producers and audio
junkies! It is the sister site of the
popular PSDTUTS, VECTORTUTS and
NETTUTS.
I hate problems like these... they should be trivial but they aren't!
If you can assume English content, you can first look for Meta Description, and if that doesn't work, you can look for the first two or three sentence-like word sequences.
A product I worked on looked for the first P or DIV that contained more than one sequence of > n "words" delimited by periods. It would use the two or three sentence-like sequences, up to x total words, as a summary paragraph. It wasn't 100% accurate, but good enough for the average case. The number of words was adjusted a few times to eliminate things like navigation elements.

Resources