For a project, search issues whose title includes ".*.", or strictly "operator", or through a regular expression - gitlab

I used to use Bugzilla and its very powerful search engine.
But the project and its bug tracker have been moved to Gitlab.
When trying to search (in online Gitlab), for the project, all issues whose title includes some item like "./", or ".*." (Kronecker product), or "//" (1-line comments), etc, no issue is returned, while many issues matching the query actually exist! I tried with "\.\*\." and other trials, with no more success.
What should be the query syntax to return the right list?
When querying "operator" (with the double quotes for exact matching), when validating the query, quotes disappear, and i get a list of issues whose title includes operand, or operation, or oper, etc. How can i get only issues exactly matching "operator" ?
Is is possible to filter issues with a title matching a regular expression?
All this (and much more) was possible and very useful with Bugzilla. And for the time being, i am quite handicapped and loose a lot of time when Searching things for the project on Gitlab.
Thanks for any hints.

Related

NLP Challenge: Automatically removing bibliography/references?

I recently came across following problem: When applying a topic model on a bunch of parsed PDF files, I discovered that content of the references unfortunately also counts for the model. I.e. words within the references appear in the tokenized list of words.
Is there any known "best-practice" to solve this problem?
I thought about a search strategy where the python code automatically removes all content after the last mention of "references" or "bibliography". If I would go by the first, or a random mention of "references" or "bibliography" within the full text, the parser might not capture the true full content.
The input PDF are all from different journals and thus have a different page structure.
The syntax is what makes a bibliography entry distinct from a regular sentence.
Test for the pattern that coincides with whatever (or multiple) reference styles you are trying to remove.
Aka date, unquoted string, string, page numbers in a certain format.
I'd spend some time searching for a tool that already recognizes bibliography before doing this, as it will be unique to each style (MLA etc.)
Couple of additional features to consider for detecting the start of reference setion
Check if the mention of "references" or "bibliography" is in the last pages as opposed to earlier pages
Run entity recognition on some length of words (~50?) after the word and if a high number of tokens in the 50 are entities, that indicates journal names, author names, etc.

Code fragment repository search on github.com

How can I search for code fragments on github.com? When I search for MSG_PREPARE in the repository ErikZalm/Marlin github shows up nothing.
I'm using the repository code search syntax described on https://github.com/search with
repo:ErikZalm/Marlin MSG_PREPARE
No results, but MSG_PREPARE can be found inside this repository here. Am I missing something? Is there no code search on github.com?
At the time of writing this answer, compared to time this question was asked i.e. about 8 years ago, github has come a good way, though still not to the length which you are looking at.
GitHub code searches are limited on the following rules: https://docs.github.com/en/github/searching-for-information-on-github/searching-code . Quoting the same:
Code in forks is only searchable if the fork has more stars than the parent repository.
Forks with fewer stars than the parent repository are not indexed for code search.
To include forks with more stars than their parent in the search results, you will need to add fork:true or fork:only to your query.
For more information, see "Searching in forks."
So we can search within the fork using the fork:true option, though as expected, since the repo ErikZalm/Marlin is low on star count as compared to parent MarlinFirmware/Marlin, the code in the fork is still not indexed. Hence the advance search shows no good except a match to the repo.
Though, if you perform the same search on the parent, it would show the matches on the code. Here are the matches for MSG_PREPARE in the parent repo MarlinFirmware/Marlin
Fortunately, one company which I know working on this domain is SourceGraph: https://about.sourcegraph.com/
Hence, you can easily search what you intended with SourceGraph:
Here are the matches for MSG_PREPARE in the ErikZalm/Marlin using SourceGraph Cloud
Update July 2013: "Preview the new Search API"
The GitHub search API on code now supports fragments, through text-match metadata.
Some API consumers will want to highlight the matching search terms when displaying search results. The API offers additional metadata to support this use case. To get this metadata in your search results, specify the text-match media type in your Accept header. For example, via curl, the above query would look like this:
curl -H 'Accept: application/vnd.github.preview.text-match+json' \
https://api.github.com/search/code?q=octokit+in:file+extension:gemspec+-repo:octokit/octokit.rb&sort=indexed
This produces the same JSON payload as above, with an extra key called text_matches, an array of objects. These objects provide information such as the position of your search terms within the text, as well as the property that included the search term.
Original answer (November 2012)
I don't think there is anything that you would have missed.
If you search for SdFile, you would find results in .pde file, but none in cpp files like in this SdFile.cpp file.
The search was introduced 4 years ago (November 2008), but, as mentioned in "Search a github repository for the file defining a given function", GitHub repository code is simply not fully indexed.

Solr behind Drupal returns too many results for specific query

We've got Solr sat behind one of our client's Drupal 7 websites, and while it's working well, it returns too many results for what should be quite specific queries. (It also has relevance/weighting problems; but I'm hoping that solving this problem will remove the - literally - irrelevant results.)
For example, searching for the phrase 'particular phrase in london' should return the node with that as its title, quite high up; I don't even think that any other content should be returned. But I find that it's returning lots of content, purely on the fact that it mentions "London"!
Frivolously, searching for the ridiculous phrase 'piecrusts in london' returns a lot of results too, apparently just because they mention London. No content on the site mentions actual piecrusts.
When I search for 'particular phrase in london', here are the parameters that end up in the catalina.out log on the server (whitespace added for clarity):
{spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1
&facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR
&spellcheck.q=particular+phrase+in+london
&qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0
&qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200
&facet.date=ds_created
&f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR
&f.bundle.facet.mincount=1&hl.fl=content,ts_comments
&json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,
label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,
tos_name,tm_node,zs_entity
&start=0&facet.sort=count&f.bundle.facet.limit=50&q=special+phrase+in+london
&f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR
&bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0
&facet.field=im_field_health_topic&facet.field=bundle
&f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50}
hits=1998 status=0 QTime=14
Note that these parameters have been built by Drupal's Apache Solr module; I don't believe we've got any particular custom code of our own that's doing anything to it.
This corresponds to the following URL, if entered directly in the browser:
http://example.com:8081/solr/CLIENT/select?spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1&facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR&spellcheck.q=particular+phrase+in+London&qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0&qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200&facet.date=ds_created&f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR&f.bundle.facet.mincount=1&hl.fl=content,ts_comments&json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,tos_name,tm_node,zs_entity&start=0&facet.sort=count&f.bundle.facet.limit=50&q=particular+phrase+in+London&f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR&bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0&facet.field=im_field_health_topic&facet.field=bundle&f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50
This URL returns nearly 2000 results - that's most of the content on the site! I've experimented with removing each query parameter at a time, and the only one to make any difference seems to be qf and q: if I remove qf, zero results; if I remove q, I get more results back!
I guess there are two questions here:
Is there anything in these parameters that tell Solr "don't worry if 'particular phrase', or 'piecrusts' appears: just collate the results for 'london'" and then order by relevancy? I would add that I think 'in' is mentioned in the stopwords file, so we can probably ignore the effect of that (?)
Or is this something in the (standard Drupal) schema that I need to change?
I appreciate that sometimes search is better for the visitor if it's inclusive; Google does return results even if it doesn't find perfect matches. But, stopwords and stemming aside, the client does require that searches return only results where all words appear in the content.
As mentioned in the post at http://drupal.org/node/1783454, the Apache Solr Search Integration module makes use of the mm param, which is more or less configured to effect rankings by how closely the keywords are in the dataset. Looking through the docs there are other ways you can use the parameter to effect rankings as well. Therefore the results produced by Apache Solr Search Integration are weighted more closely to the AND operator even though it will return more results as you add more keywords. The benefit of this param is that in cases where the user enters keywords that are too restrictive, results will still be returned. Displaying no results is a really quick way to guide people away from your site.
How are you displaying the search ?
Maybe you could solr views to limit the search range ?
http://drupal.org/project/apachesolr_views
thanks
Nick

API returning crazy results for 2 word searches

Starting fairly recently, the API has started returning crazy results for 2 word searches. For example:
https://api.foursquare.com/v2/venues/search?oauth_token=XXX&query=local_edition&radius=35000&ll=37.8%2C-122.4&limit=20&intent=browse
Will only return things matching 'ion' it seems. If I search for either 'local' or 'edition', the intended location is one of the first few results.
Is it time to stop replacing spaces with underscores? For a while, that was the only way to get reasonable results when searching for multiple words. (see this thread for more information: What's the best way to tune my Foursquare API search queries?)
I'm not sure why you're getting results for "ion", but if you replace the underscore with a plus sign or %20, it seems to work fine for me:
https://developer.foursquare.com/docs/explore#req=venues/search%3Fquery%3Dlocal+edition%26radius%3D35000%26ll%3D37.8%252C-122.4%26limit%3D20%26intent%3Dbrowse
https://developer.foursquare.com/docs/explore#req=venues/search%3Fquery%3Dlocal%2520edition%26radius%3D35000%26ll%3D37.8%252C-122.4%26limit%3D20%26intent%3Dbrowse

How to find a match within a single term using Lucene

I am using the Lucene search engine but it only seems to find matches that occur at the beginning of terms.
For example:
Searching for "one" would match "onematch" or "one day a time" but not "loneranger".
The Lucene doc says it doesnt support wildcards at the front of a search string so I am not sure whether Lucene even searches inter-term matches or only can match documents that start with the search term.
Is this a problem with how I have created my index, how I am building my search query or just a limitation of Lucene?
Found some info in another post here on Stack Overflow [LUCENE.NET] Leading wildcard throws an error"
You can set the SetAllowLeadingWildcardCharacters property on your Query Parser to allow leading wildcards during your search. This will of course have the obvious large performance impact but will allow user to find matches within a search term.
Lucene will find a document if the search term appears anywhere within it, but it doesn't allow you to do wildcard queries where the wildcard is on the front of the search term, because it performs horribly. If that is functionality you care about, you will either have to do some low-level Lucene hacking change a config flag (thanks for the interesting link), find a third-party library that has already done that hacking, or find a different search implementation (for small enough datasets, the built in search from a lot of RDBMS engines is sufficient).
Your query should be
"Query query = new WildcardQuery(new Term("contents", "*one *"));"
where contents is the field name in which you are searching.
"one" should be enclosed with asterisk mark. I have given space in the query after *one but there should not be any space. without space the * is not displaying that is why I added star.

Resources