Searching with Lucene with stemming enabled - search

Suppose I store a set of strings (each document in Lucene would be a single word), and then given an input word W, I would like to retrieve all the document that not only match word W but also those documents whose stemmed version also matches W.
Also, suppose a input a word W, I would want to take care of the case where there is a document that matches the stemmed version of the word W as well.
Would writing my own custom analyzer and returning a PorterStemFilter suffice? Do I need to just write this class and reference it as the analyzer in the code?

Writing a custom Analyzer that has a stemmer in the analyzer chain should suffice.
Here is the sample code that uses PorterStemFilter in Lucene 4.1
class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new LowerCaseTokenizer(version, reader);
return new TokenStreamComponents(source, new PorterStemFilter(source));
}
}
Please note that you MUST use the same custom Analyzer while querying which is used for indexing as well.
You may find the sample code for your version of Lucene in the corresponding PorterStemFilter documentation.

Related

Azure Cosmos DB SQL API - read/query document with unknown data structure

I have a Cosmos DB with a container that contains document with varying structure.
I am using the Java SQL API for reading the documents from this container.
The issue I am having is that the API methods for querying/reading the container expects a model class as input param and will return instances of the model class. Because my container contains documents that have varying fields and depth, it is not possible for me to create a model class to represent this.
I need to be able to read/query the documents and then parse it myself and extract the values that I am looking for.
Any ideas? I have used "Object" in the API methods for e.g. queryItem and then it returns a LinkedHashMap that I can parse myself. Is this the way to do it? It looks a bit "raw" but I have not found a better way.
Below is a typical example from the SDK doc. I cannot create a "Family" model class in my code, because the structure can vary from document to document - both which fields are stored and the depth.
private void queryItems() {
CosmosQueryRequestOptions queryOptions = new CosmosQueryRequestOptions();
queryOptions.setQueryMetricsEnabled(true);
CosmosPagedIterable<Family> familiesPagedIterable = container.queryItems(
"SELECT * FROM Family WHERE Family.lastName IN ('Andersen', 'Wakefield', 'Johnson')", queryOptions, Family.class);
familiesPagedIterable.iterableByPage(10).forEach(cosmosItemPropertiesFeedResponse -> {
logger.info("Got a page of query result with {} items(s) and request charge of {}",
cosmosItemPropertiesFeedResponse.getResults().size(), cosmosItemPropertiesFeedResponse.getRequestCharge());
logger.info("Item Ids {}", cosmosItemPropertiesFeedResponse
.getResults()
.stream()
.map(Family::getId)
.collect(Collectors.toList()));
});
}
Per my understanding, it's determined by the sdk funtion's input parameters and output data type. And exactly, we can find that both sample code for java or spring are depends on the data model. So it's really good for you to use Object in your code because of the various documents.
And it's true that we can't design a data model to contain all the properties in the documents but I think it's also a good idea to set a model which contains all the properties required. I mean that maybe you have a useless property in a query, so the query model should exclude it.
I think I found the proper solution:
Create model class. Define the members with unknown depth and structure as JsonNode.
Then the model class could be used and the values of the JsonNode accessed using nice methods.

Create docx file from a template file in java

I need to create docx files based on a templates.
The template should contain the place holders and I should be able to fill the the place holders from java .
Is it possible to do it , If so suggest me the good and efficient way to do it .
A little late for the original question, but if anyone else needs to dynamically create docx documents from templates, you might want to have a look at the DocxStamper Java library which I created on top of docx4j.
It allows to use the Spring Expression Language in docx templates and you can create a document out of a template with a couple lines like this:
MyData data = ...; // your own POJO containing the data
InputStream template = ...; // InputStream to the template file
OutputStream out = ...; // OutputStream to the resulting document
DocxStamper stamper = new DocxStamperConfiguration()
.build();
stamper.stamp(template, context, out);
out.close();
As discussed elsewhere before, there are 3 basic approaches:
BEST: content control data binding
cheap/cheerful: Variable replacement (ie magic strings on the document surface), but brittle (the split run problem)
LEGACY: MERGEFIELD with or without other field codes.
Docx4j supports all three approaches, but we generally recommend content control databinding, since it aligns with Microsoft's direction (as best can be ascertained), and is most powerful.
You'll want to consider the technical skills of your template authors.
See https://github.com/centic9/poi-mail-merge for a simple "Variable replacement" method. It does not work if one replacement-string has multiple formats applied, but does work well for simple cases where the template is carefully crafted.
Basically it reads the template and data from CSV or an Excel file and then merges it into multiple result files, one for each line of data.
It works on the DOCX XML content, so is not fully using Apache POI XWPF support, but this way formatting and other things from the template are used as expected without the need for full support for everything in Apache POI (which has DOCX support still as part of the "scratchpad" component as support is not considered fully done yet).
You can use Word template with following syntax of LINQ Reporting to achieve your requirements using Aspose.Words for Java.
<< tag_name [expression] -switch1 -switch2 ...>>
A tag body typically consists of the following elements:
A tag name
An expression surrounded by brackets
A set of switches available for the tag, each of which is preceded by the “-“ character
Assume, that you have the Sender class defined in your application as follows:
public class Sender {
public Sender(String name, String message) {
_name = name;
_message = message;
}
public String getName() {
return _name;
}
public String getMessage() {
return _message;
}
private String _name;
private String _message;
}
To produce a report containing a message of a concrete sender on its behalf, you can use a template document with the following content.
<<[s.getName()]>> says: "<<[s.getMessage()]>>."
To build a report from the template, you can use the following source code.
Document doc = new Document(getMyDir() + "temp_HelloWorld.docx");
Sender sender = new Sender("LINQ Reporting Engine", "Hello World");
ReportingEngine engine = new ReportingEngine();
engine.buildReport(doc, sender, "s");
doc.save(getMyDir() + "out.docx");
I work with Aspose as Developer evangelist.

Solr plugin capabilities

Currently I'm facing a problem in multi-word synonym in Solr. So I thought up a solution step:
Step:
Solr plugin intercept the search keyword.
Plugin will get the list of acronym, synonym etc from database table
Plugin will compare the search keyword one by one from the synonym list that extract just now at 2
If exist, the search keyword will convert into the synonym word.
Depends on the result, plugin will decide which fieldtype/filter/tokenizer to put into 6 parameter.
Plugin will return (keyword, which field to search into, which analyzer to use) for Solr to search.
The questions:
can plugin intercept the search keyword so that it can be processed in plugin?
can I access and get records directly from DB in Solr plugin?
can plugin tell Solr what to search, search on which field and use what filter/tokenizer to search? Or can plugin straight away do searching within plugin and pop out the result?
Thank you.
You might want to take a look al Nolan Lawson's multi word synonym solution.
https://github.com/healthonnet/hon-lucene-synonyms
Yes. here is an example how to do that by writing your own SearchComponent.
in solrconfig.xml
<requestHandler name="/myHandler" class="solr.SearchHandler">
<arr name="first-components">
<str>myComponent</str>
</arr>
</requestHandler>
<searchComponent name="myComponent" class="com.xyz.MyComponent" />
.
public class MyConponent extends SearchComponent {
....
#Override
public void prepare(ResponseBuilder rb) throws IOException {
String originalQuery = rb.getQueryString(); //get the original query string
// access to DB and get records here
// then construct a new query string and set to rb.
rb.setQueryString(newQueryString);
}
}
Use "/myHandler" instead of "/select" for getting the results.

Return Only Certain Fields From Lucene Search

I'm using Lucene to search an index and it works fine. My only issue is that I only need one particular field of what is being returned. Can you specify to Lucene to only return a certain field in the results and not the entire document?
This is why FieldSelector class exists.
You can implement a class like this
class MyFieldSelector : FieldSelector
{
public FieldSelectorResult Accept(string fieldName)
{
if (fieldName == "field1") return FieldSelectorResult.LOAD_AND_BREAK;
return FieldSelectorResult.NO_LOAD;
}
}
and use it as indexReader.Document(docid,new MyFieldSelector());
If you are interested in loading a small field, this will prevent to load large fields which, in turn, means a speed-up in loading documents. I think you can find much more detailed info by some googling.
What do you mean "return certain fields"? The Document.get() function returns just the field you request.
Yes, you can definitely do what you are asking. All you have to do is include the field name (case-sensitive) in the document.get() method.
string fieldNameText = doc.Get("fieldName");
FYI, it's usually a good idea to include some code in your questions. It makes it easier to provide a good answer.

How to compare 2 xsd schema files for equivalent functionality

I would like to compare 2 XSD schemas A and B to determine that all instance documents valid to schema A would also be valid to schema B. I hope to use this to prove that even though schema A and B are "different" they are effectively the same. Examples of differences this would not trigger would be Schema A uses types and Schema B declares all of it's elements inline.
I have found lots of people talking about "smart" diff type tools but these would claim the two files are different because they have different text but the resulting structure is the same. I found some references to XSOM but I'm not sure if that will help or not.
Any thoughts on how to proceed?
Membrane SOA Model - Java API for WSDL and XML Schema
package sample.schema;
import java.util.List;
import com.predic8.schema.Schema;
import com.predic8.schema.SchemaParser;
import com.predic8.schema.diff.SchemaDiffGenerator;
import com.predic8.soamodel.Difference;
public class CompareSchema {
public static void main(String[] args) {
compare();
}
private static void compare(){
SchemaParser parser = new SchemaParser();
Schema schema1 = parser.parse("resources/diff/1/common.xsd");
Schema schema2 = parser.parse("resources/diff/2/common.xsd");
SchemaDiffGenerator diffGen = new SchemaDiffGenerator(schema1, schema2);
List<Difference> lst = diffGen.compare();
for (Difference diff : lst) {
dumpDiff(diff, "");
}
}
private static void dumpDiff(Difference diff, String level) {
System.out.println(level + diff.getDescription());
for (Difference localDiff : diff.getDiffs()){
dumpDiff(localDiff, level + " ");
}
}
}
After executing you get the output shown in listing 2. It is a List of
differences between the two Schema documents.
ComplexType PersonType has changed: Sequence has changed:
Element id has changed:
The type of element id has changed from xsd:string to tns:IdentifierType.
http://www.service-repository.com/ offers an online XML Schema Version Comparator tool that displays a report of the differences between two XSD that appears to be produced from the Membrane SOA Model.
My approach to this was to canonicalize the representation of the XML Schema.
Unfortunately, I can also tell you that, unlike canonicalization of XML documents (used, as an example, to calculate a digital signature), it is not that simple or even standardized.
So basically, you have to transform both XML Schemas to a "canonical form" - whatever the tool you build or use thinks that form is, and then do the compare.
My approach was to create an XML Schema set (could be more than one file if you have more namespaces) for each root element I needed, since I found it easier to compare XSDs authored using the Russian Doll style, starting from the PSVI model.
I then used options such as auto matching substitution group members coupled with replacement of substitution groups with a choice; removal of "superfluous" XML Schema sequences, collapsing of single option choices or moving minOccurs/maxOccurs around for single item compositors, etc.
Depending on what your XSD-aware comparison tool's features are, or you settle to build, you might also have to rearrange particles under compositors such as xsd:choice or xsd:all; etc.
Anyway, what I learned after all of it was that it is extremely hard to build a tool that would work nice for all "cool" XSD features out there... One test case I remember fondly was to deal with various xsd:any content.
I do wonder though if things have changed since...

Resources