Implementing proximity search in positional inverted index nodejs - search

I am building a positional inverted index from text, the index structure is: (suggest improvements, if any)
{
term: {
documentID: {pageno:[positions], pageno:[positions]},
documentID: {pageno:[positions]}
}
}
I want to implement a proximity search-
Proximity search :
Queries of type X AND Y /3 or X Y /2 are called proximity queries. The expression means retrieve documents that contain both X and Y and 3 words or 2 words apart respectively
Reference - Boolean Retrieval Model Using Inverted Index and Positional Index
I want to implement this in NodeJS for 2 or more words. I am confused about how to implement this.
I thought of creating a search result object for each word from the index. This would have a structure like:
word :{document1: {page1:[positions], page2:[positions]}}
and then somehow compare the positions on every intersecting page for all the words and calculate the proximity.
For the search query nodejs hello world, the proximity in the string hello extra words world more extra words nodejs would be 5 - counting all extra words in between and summing them - regardless of order of the search words. Refer to this Lucene Proximity Search for phrase with more than two words
Is this an efficient index structure? If yes, how to compare the positions on every intersecting page for all the words?
If "jakarta apache lucene"~3, if this is the query and the text is 'jakarta jakarta apache lucene', will it match twice? - 3 being the max proximity allowed.
EDIT:
By doing a lot of things, I generated this for every document:
{
pageno: [
[positions of word 1],
[positions of word 2],
[positions of word n]
]
}
This makes sure to include only those pages which have all the words present.
For eg -
{
1 : [
[1, 5, 6],
[2, 41],
[3, 7, 11]
],
2 : [
[1, 5, 6],
[2, 41],
[3, 7, 11]
]
}
Now what I need to do is find the total number of occurrences of the query text on a particular page using the positions of query words mentioned in the array above such that difference between their positions is less than the proximity value.

Related

How to find match between two 2D lists in Python?

Lets say I have two 2D lists like this:
list1 = [ ['A', 5], ['X', 7], ['P', 3]]
list2 = [ ['B', 9], ['C', 5], ['A', 3]]
I want to compare these two lists and find where the 2nd item matches between the two lists e.g here we can see that numbers 5 and 3 appear in both lists. The first item is actually not relevant in comparison.
How do I compare the lists and copy those values that appear in 2nd column of both lists? Using 'x in list' does not work since these are 2D lists. Do I create another copy of the lists with just the 2nd column copied across?
It is possible that this can be done using list comprehension but I am not sure about it so far.
There might be a duplicate for this but I have not found it yet.
The pursuit of one-liners is a futile exercise. They aren't always more efficient than the regular loopy way, and almost always less readable when you're writing anything more complicated than one or two nested loops. So let's get a multi-line solution first. Once we have a working solution, we can try to convert it to a one-liner.
Now the solution you shared in the comments works, but it doesn't handle duplicate elements and also is O(n^2) because it contains a nested loop. https://wiki.python.org/moin/TimeComplexity
list_common = [x[1] for x in list1 for y in list2 if x[1] == y[1]]
A few key things to remember:
A single loop O(n) is better than a nested loop O(n^2).
Membership lookup in a set O(1) is much quicker than lookup in a list O(n).
Sets also get rid of duplicates for you.
Python includes set operations like union, intersection, etc.
Let's code something using these points:
# Create a set containing all numbers from list1
set1 = set(x[1] for x in list1)
# Create a set containing all numbers from list2
set2 = set(x[1] for x in list2)
# Intersection contains numbers in both sets
intersection = set1.intersection(set2)
# If you want, convert this to a list
list_common = list(intersection)
Now, to convert this to a one-liner:
list_common = list(set(x[1] for x in list1).intersection(x[1] for x in list2))
We don't need to explicitly convert x[1] for x in list2 to a set because the set.intersection() function takes generator expressions and internally handles the conversion to a set.
This gives you the result in O(n) time, and also gets rid of duplicates in the process.

Getting rid of duplicates from a pair of corresponding lists

This is a program that I recently made. The goal of this code is to a pair of corresponding lists. So randomStringpt1[0] corresponds to randomStringpt2[0]. I want to compare randomStringpt1[0] and randomString2[0] to the rest of the pairs that the user gave in the randomStrings. But after using this code, it looks like I have duplicated each pair many times, which is the opposite of what I was looking for. I was thinking of using a dictionary, but then realized that a dictionary key could only have one value, which wouldn't help my case if the user used a number twice. Does anyone know how I can reduce the duplicates?
(The tests I have been running have been with the numbers randomStringpt1 = [1,3,1,1,3] and randomStringpy2 = [2,4,2,3,4]
)
randomStringpt1 = [1, 2, 3, 4, 5] #Pair of strings that correspond to each other("1,2,3,4,5" doesn't actually matter)
randomStringpt2 = [1, 2, 3, 4, 5]
for i in range(len(randomStringpt1)):
randomStringpt1[i] = input("Values for the first string: ")
randomStringpt2[i] = input("Corresponding value for the second string: ")
print(randomStringpt1) #numbers that the user chose for the first number of the pair
print(randomStringpt2) #numbers that the user chose for the second number of the pair
newStart = []
newEnd = []
for num1 in range(len(randomStringpt1)):
for num2 in range(len(randomStringpt1)):
if (int(randomStringpt1[num1]) != int(randomStringpt1[num2]) and int(randomStringpt2[num1]) != int(randomStringpt2[num2])):
newStart.append(randomStringpt1[num1]) # Adding the pairs that aren't equal to each other to a new list
newEnd.append(randomStringpt2[num1])
newStart.append(randomStringpt1[num2])
newEnd.append(randomStringpt2[num2])
# else:
# print("The set of numbers from the randomStrings of num1 are not equal to the ones in num2")
print(newStart)
print(newEnd)
First let's analyze the 2 bugs in your code,
the if condition inside the loop is true every time a pair compares to a different one. this means for your example it should output
[1, 1, 3, 3, 3, 1, 1, 1, 1, 3, 3, 3]
[2, 2, 4, 4, 4, 2, 2, 3, 3, 4, 4, 4]
since you compare every pair to any other pair that exists. But your output is different because you append both pairs every time and getting a very big result, so you shouldn't append the num2 pairs.
Now, from what you described that you want, you should loop every pair and check if it already exists in the output list. So the for loop part can change like this
filtered = []
for pair in zip(randomStringpt1,randomStringpt2):
if pair not in filtered:
filtered.append(pair) # Adding the pairs that aren't equal to each other to a new list
the zip function takes the 2 lists and for every loop it returns 2 values one from each list the first value pair, then the second values and goes on. the filtered list will be in the following format [(1, 2), (3, 4), (1, 3)]
Alternative it can be as a one liner like this:
filtered = list(dict.fromkeys(zip(randomStringpt1, randomStringpt2)))
using the dictionary to identify unique elements and then turn it back into a list
after all that you can get the original format of the lists you had in your code by splitting them like this
newStart = [pair[0] for pair in filtered]
newEnd = [pair[1] for pair in filtered]
Finally i should tell you to read a little more on python and it's for loops, since the range(len(yourlist)) is not the python intended way to loop over lists, as python for loops are equivalent to for each loops on other languages and iterate over the list for you instead on relying in a value to get list elements like yourlist[value].

Godot - How do I create a subarray of a list in Gdscript?

I know it's possible to slice an array in python by array[2:4]. The way I get around this is to just loop through the indexes I want and append them to the new_list. This way requires more work is there just a simple way to do it like in python?
You can use the Array.slice() method added in Godot 3.2 for this purpose:
Array slice ( int begin, int end, int step=1, bool deep=False )
Duplicates the subset described in the function and returns it in an array, deeply copying the array if deep is true. Lower and upper index are inclusive, with the step describing the change between indices while slicing.
Example:
var array = [2, 4, 6, 8]
var subset = array.slice(1, 2)
print(subset) # Should print [4, 6]

create list from list where values only increase by 1

I have the code below that gets the maximum value from a list. It then compares it to the maximum value of the remaining values in the list, and if it is more than 1 higher than the next greatest value, it replaces the original list maximum with 1 higher than the next greatest value. I would like the code to search the entire list and make sure that any value in the list is at most 1 larger than any other value in the list. I know this ins’t the best worded explanation, I hope the example lists below make what I’m trying to accomplish clearer.
for example I don’t want to get a final list like:
[0,2,0,3]
I would want the final list to be
[0,1,0,2]
input:
empt=[0,2,0,0]
Code:
nwEmpt=[i for i in empt if i !=max(empt)]
nwEmpt2=[]
for i in range(0,len(empt)):
if (empt[i]==max(empt))&(max(empt)>(max(nwEmpt)+1)):
nwEmpt2.append((max(nwEmpt)+1))
elif (empt[i]==max(empt))&(max(empt)==(max(nwEmpt)+1)):
nwEmpt2.append(max(empt))
else:
nwEmpt2.append(empt[i])
output:
nwEmpt2
[0,1,0,0]
min_value = min(empt)
empt_set = set(empt)
for i in empt:
nwEmpt.append(min_value + len(list(filter(lambda x: x < i, empt_set))))
This gives e.g. for input empt = [8, 10, 6, 4, 4] output nwEmpt = [6, 7, 5, 4, 4].
It works by mapping each element to (the minimum value) + (the number of distinct values smaller than element).

Underscore GroupBy Sort

I have a question regarding programming in function style.
I use underscore.js library.
Let's consider some use-case. I have an array of some labels with repetitions I need to count how many occurrences of each label is in array and sort it according to the number of occurrences.
For counting, how many labels I can use countBy
_.countBy([1, 2, 3, 4, 5], function(num) {
return num % 2 == 0 ? 'even': 'odd';
});
=> {odd: 3, even: 2}
But here, as result I have a hash, which doesn't have meaning for order, so there is no sort. So here, I need to convert the hash to array then to sort it and convert backward to hash.
I am pretty sure there is an elegant way to do so, however I am not aware of it.
I would appreciate any help.
sort it and convert backward to hash.
No, that would loose the order again.
You could use
var occurences = _.countBy([1, 2, 3, 4, 5], function(num) {
return num % 2 == 0 ? 'even': 'odd';
});
// {odd: 3, even: 2}
var order = _.sortBy(_.keys(occurences), function(k){return occurences[k];})
// ["even", "odd"]
or maybe just
_.sortBy(_.pairs(occurences), 1)
// [["even", 2], ["odd", 3]]

Resources