Marklogic Search on different values - search

I am working on Search in MarkLogic and my requirement is to bring result while searching "NAVIN RAWAT" is:
o Search should return for N Rawat
o Navin Rawat
o Navin R
o Navin Raw
o Navin Rawat
I am planning to use search:search to get these results. But, I am not getting how to get these results as they are different from each other. Could any body help or suggest me how to get these result while using MarkLogic search:search or cts:search.

I think for requirements like this, you'll need to code up some search expansion. You could build a custom constraint for the Search API to do it. As a string query, you could then do something like
expand:"Navin Rawat"
A structured query would look different but convey the same thing. Next step is to do the actual expansion. It's not clear what rules you have in mind -- for the last name, is it any number of the starting letters, or is there a reason you didn't include "Navin Ra"? I'll assume you want any number of letters.
You could build a function that looks like this to provide options for the last name:
declare function local:choices($first, $last)
{
for $i in (1 to fn:string-length($last))
return
fn:substring($last, 1, $i) ! ($first || " " || .)
};
local:choices("Navin", "Rawat")
=>
Navin R
Navin Ra
Navin Raw
Navin Rawa
Navin Rawat
With that done, your parse function could return a cts:word-query() with that sequence of strings. Throw in something for your "N Rawat" case, and you're set.

Related

String matching without using builtin functions

I want to search for a query (a string) in a subject (another string).
The query may appear in whole or in parts, but will not be rearranged. For instance, if the query is 'da', and the subject is 'dura', it is still a match.
I am not allowed to use string functions like strfind or find.
The constraints make this actually quite straightforward with a single loop. Imagine you have two indices initially pointing at the first character of both strings, now compare them - if they don't match, increment the subject index and try again. If they do, increment both. If you've reached the end of the query at that point, you've found it. The actual implementation should be simple enough, and I don't want to do all the work for you ;)
If this is homework, I suggest you look at the explanation which precedes the code and then try for yourself, before looking at the actual code.
The code below looks for all occurrences of chars of the query string within the subject string (variables m; and related ii, jj). It then tests all possible orders of those occurrences (variable test). An order is "acceptable" if it contains all desired chars (cond1) in increasing positions (cond2). The result (variable result) is affirmative if there is at least one acceptable order.
subject = 'this is a test string';
query = 'ten';
m = bsxfun(#eq, subject.', query);
%'// m: test if each char of query equals each char of subject
[ii jj] = find(m);
jj = jj.'; %'// ii: which char of query is found within subject...
ii = ii.'; %'// jj: ... and at which position
test = nchoosek(1:numel(jj),numel(query)).'; %'// test all possible orders
cond1 = all(jj(test) == repmat((1:numel(query)).',1,size(test,2)));
%'// cond1: for each order, are all chars of query found in subject?
cond2 = all(diff(ii(test))>0);
%// cond2: for each order, are the found chars in increasing positions?
result = any(cond1 & cond2); %// final result: 1 or 0
The code could be improved by using a better approach as regards to test, i.e. not testing all possible orders given by nchoosek.
Matlab allows you to view the source of built-in functions, so you could always try reading the code to see how the Matlab developers did it (although it will probably be very complex). (thanks Luis for the correction)
Finding a string in another string is a basic computer science problem. You can read up on it in any number of resources, such as Wikipedia.
Your requirement of non-rearranging partial matches recalls the bioinformatics problem of mapping splice variants to a genomic sequence.
You may solve your problem by using a sequence alignment algorithm such as Smith-Waterman, modified to work with all English characters and not just DNA bases.
Is this question actually from bioinformatics? If so, you should tag it as such.

saving and retrieving string data in matlab

Hi can any one help me in dealing with strings in MATLAB. For example, the string
A = 'A good looking boy'
how can we store these individual words in arrays and later retrieve any of the words?
As found here, you could use
>> A = 'A good looking boy';
>> C = regexp(A,'[A-z]*', 'match')
C =
'A' 'good' 'looking' 'boy'
so that
>> C{1}
ans =
A
>> C{4}
ans =
boy
>> [C{:}]
ans =
Agoodlookingboy
The most intuitive way would be using strsplit
C = strsplit(A,' ')
However as it is not available in my version I suppose this is only a builtin function in matlab 2013a and above. You can find the documentation here.
If you are using an older version of matlab, you can also choose to get this File Exchange solution, which basically does the same.
You can use the simple function textscan for that:
C = textscan(A,'%s');
C will be a cell array. This function is in Matlab at least since R14.

Lua - How to find a substring with 1 or 2 characters discrepancy

Say I have a string
local a = "Hello universe"
I find the substring "universe" by
a:find("universe")
Now, suppose the string is
local a = "un#verse"
The string to be searched is universe; but the substring differs by a single character.
So obviously Lua ignores it.
How do I make the function find the string even if there is a discrepancy by a single character?
If you know where the character would be, use . instead of that character: a:find("un.verse")
However, it looks like you're looking for a fuzzy string search. It is out of a scope for a Lua string library. You may want to start with this article: http://ntz-develop.blogspot.com/2011/03/fuzzy-string-search.html
As for Lua fuzzy search implementations — I haven't used any, but googing "lua fuzzy search" gives a few results. Some are based on this paper: http://web.archive.org/web/20070518080535/http://www.heise.de/ct/english/97/04/386/
Try https://github.com/ajsher/luafuzzy.
It sounds like you want something along the lines of TRE:
TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.
Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.
A Lua binding for it is available as part of lrexlib.
If you are really looking for a single character difference and do not care about performance, here is a simple approach that should work:
local a = "Hello un#verse"
local myfind = function(s,p)
local withdot = function(n)
return p:sub(1,n-1) .. '.' .. p:sub(n+1)
end
local a,b
for i=1,#s do
a,b = s:find(withdot(i))
if a then return a,b end
end
end
print(myfind(a,"universe"))
A simple roll your own approach (based on the assumption that the pattern keeps the same length):
function hammingdistance(a,b)
local ta={a:byte(1,-1)}
local tb={b:byte(1,-1)}
local res = 0
for k=1,#a do
if ta[k]~=tb[k] then
res=res+1
end
end
print(a,b,res) -- debugging/demonstration print
return res
end
function fuz(s,pat)
local best_match=10000
local best_location
for k=1,#s-#pat+1 do
local cur_diff=hammingdistance(s:sub(k,k+#pat-1),pat)
if cur_diff < best_match then
best_location = k
best_match = cur_diff
end
end
local start,ending = math.max(1,best_location),math.min(best_location+#pat-1,#s)
return start,ending,s:sub(start,ending)
end
s=[[Hello, Universe! UnIvErSe]]
print(fuz(s,'universe'))
Disclaimer: not recommended, just for fun:
If you want a better syntax (and you don't mind messing with standard type's metatables) you could use this:
getmetatable('').__sub=hammingdistance
a='Hello'
b='hello'
print(a-b)
But note that a-b does not equal b-a this way.

Lua string.match problem?

how can I match following strings with one expression?
local a = "[a 1.001523] <1.7 | [...]> < a123 > < ? 0 ?>";
local b = "[b 2.68] <..>";
local c = "[b 2.68] <>";
local d = "[b 2.68] <> < > < ?>";
local name, netTime, argument1, argument2, argumentX = string:match(?);
-- (string is a or b or c or d)
The problem is, the strings can have various counts of arguments( "<...>" ) and the arguments can have numbers, chars, special chars or spaces in it.
I'm new to Lua and I have to learn string matching, but I cannot learn this in a few hours. I ask YOU, because I need the result tomorrow and I really would appreciate your help!
cheers :)
Lua patterns are very limited, you can't have alternative expressions and no optional groups. So that means all of your arguments would need to be matched with the same expressions and you would need to use a fixed amount of arguments if you only write a single pattern. Check this tutorial, it doesn't take long to get used to lua patterns.
You might be still able to parse those strings using multiple patterns. ^%[(%a+)%s(%d+%.%d+)%]%s is the best you can do to get the first part, assuming local name can have multiple upper and lower case letters. To match the arguments, run multiple patterns on parts of the input, like <%s*> or <(%w+)> to check each argument individually.
Alternatively get a regex library or a parser, which would be much more useful here.
Lua patterns are indeed limited, but you can get around if you can make some assumptions. Like if there will be no >'s in the arguments you could just loop over all matching pairs of <> :
local a = "[a 1.001523] <1.7 | [...]> < a123 > < ? 0 ?>"
local b = "[b 2.68] <..>"
local c = "[b 2.68] <>"
local d = "[b 2.68] <> < > < ?>"
function parse(str)
local name,nettime,lastPos = str:match'%[(%a+)%s(%d+%.%d+)%]()'
local arguments={}
-- start looking for arguments only after the initial part in [ ]
for argument in str:sub(lastPos+1):gmatch('(%b<>)') do
argument=argument:sub(2,-2) -- strip <>
-- do whatever you need with the argument. Here we'll just put it in a table
arguments[#arguments+1]=argument
end
return name,nettime,unpack(arguments)
end
For more complicated things you'll be better of using something like LPEG, like kapep said.

How to "for loop" in J

I tried but the code wont work.
for. T do. B end.
for_xyz. T do. B end.
What would be the equivalent of this from C#
for(int i = 0; i < 10; i++)
Console.WriteLine("Hello World!");
And what's a good keyword to Google for J problems?
A more J-ish way to loop is using Power ^:, this
f^:10 y
will apply f 10 times; first to y, then to f(y), ... :
f(f(f(f(f(f(f(f(f(f(y))))))))))
So if p is a print function, eg: p =: (4) 1!:2~ ]:
(p^:10) 'Hello World!'
Hello World!Hello World!...
In general J (in a way) promotes loop-less code. If you really needed 10 times the string 'Hello World!' for example, you probably would do something like:
10 12 $ 'Hello World!'
Hello World!
Hello World!
Hello World!
...
As noted at the beginning of the Control Structures section, these only apply within Explicit definition. The colon is the key for setting up such a script. The only time 'for.' (or any similar word) can occur is within a script determined by the right parameter to : , i.e. colon, meaning Explicit.
Use the link on the control-word for. on that page to find complete samples. Notice that these special symbols (such as for. and end.) normally occur in multi-line scripts that end with a single lone right-paren. That sort of structure is what you must use if you're to use control words.
Here is the first of the examples given on the Dictionary page documenting the for. structure (http://jsoftware.com/help/dictionary/cfor.htm):
f0=. 3 : 0
s=. 0
for. i. y do. s=. >:s end.
)
Once you have arranged control words inside this sort of structure, they take effect when the script is executed. In this example, when the verb f0 receives an integer as its only (right) parameter (referred to as y in the script) it results in the same integer. It iterates through the for loop to arrive at that number.

Resources