Effective methods for DBpedia queries - dbpedia

I've been trying to implement a query to find the birth date of a given person using DBPedia. This seems easy enough but I'm finding a lot of different tags to match names and birth dates, not all of which show up under all entities.
For instances we have dbp:name, dbp:birthName, foaf:givenName, foaf:surname, dbo:birthDate, dbo:birthYear, dbp:dateOfBirth, etc.
I'm wondering if...
1 - Is there a higher level library I can use (preferably in python but Java is ok too) that "understands" these different tags and that can help form appropriate queries to search and combine the different information.
2 - Is there any guidance on how to accomplish this. For instance, is it better to use the ontology (dbo) or the property (dbp) or the foaf. I'm also curious if a common approach is to pre-download all the possible tags and search on all that seem applicable and then select the "best" returned data. If someone could offer some guidance or point me to a good tutorial, I'd appreciate it. I've found some info on the web but nothing that really addresses my question.

Related

Smart search for acronyms in Salesforce

In Salesforce's Service Cloud one can enable the out of the box search function where the user enters a term and the system searches all parts of the database for a match. I would like to enable smart searching of acronyms so that if I spell an organizations name the search functionality will also search for associated acronyms in the database. For example, if I search type in American Automobile Association, I would also get results that contain both "American Automobile Association" and "AAA".
I imagine such a script would involve declaring that if the term being searched contains one or more spaces or periods, take the first letter of the first word and concatenate it with the letters that follow subsequent spaces or periods.
I have unsuccessfully tried to find scripts for this or articles on enabling this functionality in Salesforce. Any guidance would be appreciated.
Interesting question! I don't think there's a straightforward answer but as it's standard search functionality, not 100% programming related - you might want to cross-post it to salesforce.stackexchange.com
Let's start with searchable fields list: https://help.salesforce.com/articleView?id=search_fields_business_accounts.htm&type=0
In Setup there's standard functionality for Synonyms, quite easy to use. It's not a silver bullet though, applies only to certain objects like Knowledge Base (if you use it). Still - it claims to work on Cases too so if there's "AAA" in Case description it should still be good enough?
You could also check out the trick with marking a text field as indexed and/or external ID and adding there all your variations / acronyms: https://success.salesforce.com/ideaView?id=08730000000H6m2 This is more work, to prepare / sanitize your data upfront but it's not a bad idea.
Similar idea would be to use Tags although that could explode in size very quickly. It's ridiculous to create a tag for every single company.
You can do some really smart things in data deduplication rules. Too much to write it all here, check out the trailhead: https://trailhead.salesforce.com/en/modules/sales_admin_duplicate_management/units/sales_admin_duplicate_management_unit_2 No idea if it impacts search though.
If you suffer from bad address data there are State & Country picklists, no more mess with CA / California / SoCal... https://resources.docs.salesforce.com/204/latest/en-us/sfdc/pdf/state_country_picklists_impl_guide.pdf Might not help with Name problem...
Data.com cleanup might help. Paid service I think, no idea if it affects search too. But if enabling it can bring these common abbreviations into your org - might be better than reinventing the wheel.

Is there a way to get all complete sentences that a search engine (e.g. Google) has indexed that contain two search terms?

as the question says: "Is there a way to get all complete sentences that a search engine (e.g. Google) has indexed that contain two search terms?"
I would like to use the (e.g. Google) search syntax: BMW AND Toyota. (<-- this is just an example)
And I would then like to have returned all sentences that mention BMW and Toyota. They must be in a single (ideally: short) sentence though.
Is that possible?
Many thanks!
PS.: Sorry - I have difficulties finding the right tags for my question... Please feel free to suggest more appropriate ones and I will update the question.
PPS.: Let me rephrase my question: If it is not readily possible with an existing search engine, are there any programmatical ways to do that? Would one have to write a crawler for that purpose?
No this may not be possible, as google stores this info based on keywords and other algorithms.
For any given keyword or set of keywords, google must be maintaining a reference to one or many matching (some accurate, some not so accurate) titles.
I do not work for google, but that could one way they are maintaining their search results.

Better or Not combine Search Engine and Recommend System?

In our project, we use search engine, but the result need to be ranked based on each user's interest, similar to recommendation according to users' keyword.
If we separate the two system, it would cost a lot time.
Is there a better way to combine Search Engine and Recommend System together?
Or is there a simple way to customize my ranking strategy to achieve this?
This is what we were trying to do in our project as well. There are two things while solving this problem - Relevancy vs Personalization. You should look at how much of personalization is ruining the relevancy of the query. For example, if I'm suggesting news, then it makes sense to suggest based on location. I hope you already would have analyzed the use cases.
The way that I followed was - after getting the results on the search, then re-rank results to give personal suggestions. For example if I was searching for a specific algorithm to code, then getting the result set and re-ranking on my preference, lets say on, Java (based on my previous history) will make sense. In any case relevancy is of utmost importance and then we fit in user's preferences.
Again the use case is important, if this was for a news search, then directly querying and retrieving on location is best way to do it.

what algorithm does freebase use to match by name?

I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get
"Apo Hiking Society"
"Hiking"
"Hiking Georgia"
"Hiking Virginia's national forests"
"Hiking trail"
Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.
The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).
How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).
So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?
BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive
Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.
For the algorithm side, I find the following paper a good overview
Zobel and Moffat (2006): "Inverted Files for Text Search Engines".
Most likely it's a trie with lexicographical order.
There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.
An implementation of such algorithms is the Simmetrics library by the University of Sheffield.

Cross Referencing Databases on Fuzzy Data

I am currently working on project where I have to match up a large quantity of user-generated names with a separate list of the same names in a canonical format. The problem is that the user-generated names contains numerous misspellings, abbreviations, as well as simply invalid data, making it hard to do a cross-reference with the canonical data. Any suggestions on methods to do this?
This does not have to be done in real-time and in this case accuracy is more important than speed.
Current ideas for this are:
Do a fuzzy search for the user entered name in the canonical database using an existing search implementation like Lucene or Sphinx, which I presume use something like the Levenshtein distance for this.
Cross-reference on the SOUNDEX hash (which is supposedly computed on the sound of the name rather than spelling) instead of using the actual name.
Some combination of the above
Anyone have any feedback on any of these or ideas of their own?
One of my concerns is that none of the above methods will handle abbreviations very well. Can anyone point me in a direction for some machine learning methods to actually search on expanded abbreviations (or tell me I'm crazy)? Thanks in advance.
First, I'd add to your list the techniques discussed at Peter Norvig's post on spelling correction.
Second, I'd ask what kind of "user-generated names" you're talking about. Having dealt with both, I believe that the heuristics you'd use for street names are somewhat different from the heuristics for person names. (As a simple example, does "Dr" expand to "Drive" or "Doctor"?)
Third, I'd look at a combination using testing to establish the set of coefficients for combining the results of the various techniques.

Resources