sequential search - search

For sequential search, what is the average number of comparisons needed to find a record in a file?

A sequential search starts from the beginning of the file and checks each element one-by-one until the desired element is found. Assuming that the record you are searching for exists in the file exactly once and could be anywhere in the file with equal probability, the average number of comparisons is equal to half the number of records in the file.
However if the record does not exist in the file, you will have to examine every single record in the file before discovering this.

For a list with n items, the best case is when the value is equal to the first element of the list, in which case only one comparison is needed. The worst case is when the value is not in the list (or occurs only once at the end of the list), in which case n comparisons are needed.
Asymptotically, therefore, the worst-case cost and the expected cost of linear search are both O(n)

I would like to add few points that the previous answers fails to point out:
On the other hand, we must consider whether the file is available on one device or spread over multiple devices. In case of T rams, then the complexity will be O(T*N/(1+log(T))).
In general, sequential search takes O(N) time complexity.
When combined with data structures such as R-Tree, it can give a best case time complexity of O(N/(log(log(N)))) in the case of records in a file.
It depends on the structure/ format of the file such that if the data fields are available in a hash map, sequential search is a backlog.

Related

Best way to find if there is a one-typo word from list of given words

How would you efficiently solve this problem
Suppose we were given a list of words [“apple”, “banana”, “mango”]
If we are given a word in the list that is one typo away,
“Dpple”
“Adple”
“Appld”
We output true
If there is more than one typo, we output false.
For optimizations, I’ve tried storing the list in a hashtable containing the number of letters of each word and looking for the same number of letters upon the given input to reduce the size in which we look for our input. Is there a faster optimization we can make to this problem?
One possible optimisation would be to generate all one-typo words for the given list and put them in a map (or some better string lookup structure). Then lookup the given words - if found output true, else false. The total number of one-typo words is: 25*L, where L is the total number of letters in the input list (assuming case does not matter).

Finding similar strings in large datasets

I'm using levenshtein distance to retrieve similar strings from a list. At the moment the list has just a few thousand items, but we'll need to support at least 100k items.
I'm trying to make this more efficient and one technique I came up with was to calculate the levenshtein distance only on strings that are of similar length. I though about also filtering on the initial character i.e. if the string to search starts with b then I'll run the calculation only on the strings that start with b. But I'm not sure if I could assume this to work all the time.
I was wondering if you all have a better way of getting this done?
Thanks
One way to go would be to hope that a match with small edit distance would have within it a short exact match. If you assume this, then, given the string ABCDEF, retrieve all strings containing ABC, BCD, CDE, or DEF, and compute their edit distances. You may even find that the best match among these is so close that any closer match must have a short match inside it, so you would have found it already. You would have to accept that if you are unlucky you may miss some good matches, or be forced to go through all the possibilities one by one.
As an alternative to building a database of substrings, you could build a http://en.wikipedia.org/wiki/Suffix_array and LCP array from a string obtained by concatenating all the stored strings, separating them with a marker character not otherwise used. This takes time and space linear in the input size. You would then search for exact matches by looking for strings in the suffix array starting ABCDEF, BCDEF, CDEF, and DEF.

Fast repeated row counting in vast data - what format?

My Node.js app needs to index several gigabytes of timestamped CSV data, in such a way that it can quickly get the row count for any combination of values, either for each minute in a day (1440 queries) or for each hour in a couple of months (also 1440). Let's say in half a second.
The column values will not be read, only the row counts per interval for a given permutation. Reducing time to whole minutes is OK. There are rather few possible values per column, between 2 and 10, and some depend on other columns. It's fine to do preprocessing and store the counts in whatever format suitable for this single task - but what format would that be?
Storing actual values is probably a bad idea, with millions of rows and little variation.
It might be feasible to generate a short code for each combination and match with regex, but since these codes would have to be duplicated each minute, I'm not sure it's a good approach.
Or it can use an embedded database like SQLite, NeDB or TingoDB, but am not entirely convinced since they don't have native enum-like types and might or might not be made for this kind of counting. But maybe it would work just fine?
This must be a common problem with an idiomatic solution, but I haven't figured out what it might be called. Knowing what to call this and how to think about it would be very helpful!
Will answer with my own findings for now, but I'm still interested to know more theory about this problem.
NeDB was not a good solution here as it saved my values as normal JSON behind the hood, repeating key names for each row and adding unique IDs. It wasted lots of space and would surely have been too slow, even if just because of disk I/O.
SQLite might be better at compressing and indexing data, but I have yet to try it. Will update with my results if I do.
Instead I went with the other approach I mentioned: assign a unique letter to each column value we come across and get a short string representing a permutation. Then for each minute, add these strings as keys iff they occur, with the number of occurrences as values. We can later use our dictionary to create a regex that matches any set of combinations, and run it over this small index very quickly.
This was easy enough to implement, but would of course have been trickier if I had had more possible column values than the about 70 I found.

Count no. of words in O(n)

I am on an interview ride here. One more interview question I had difficulties with.
“A rose is a rose is a rose” Write an
algorithm that prints the number of
times a character/word occurs. E.g.
A – 3 Rose – 3 Is – 2
Also ensure that when you are printing
the results, they are in order of
what was present in the original
sentence. All this in order n.
I did get solution to count number of occurrences of each word in sentence in the order as present in the original sentence. I used Dictionary<string,int> to do it. However I did not understand what is meant by order of n. That is something I need an explanation from you guys.
There are 26 characters, So you can use counting sort to sort them, in your counting sort you can have an index which determines when specific character visited first time to save order of occurrence. [They can be sorted by their count and their occurrence with sort like radix sort].
Edit: by words first thing every one can think about it, is using Hash table and insert words in hash, and in this way count them, and They can be sorted in O(n), because all numbers are within 1..n steel you can sort them by counting sort in O(n), also for their occurrence you can traverse string and change position of same values.
Order of n means you traverse the string only once or some lesser multiple of n ,where n is number of characters in the string.
So your solution to store the String and number of its occurences is O(n) , order of n, as you loop through the complete string only once.
However it uses extra space in form of the list you created.
Order N refers to the Big O computational complexity analysis where you get a good upper bound on algorithms. It is a theory we cover early in a Data Structures class, so we can torment, I mean help the student gain facility with it as we traverse in a balanced way, heaps of different trees of knowledge, all different. In your case they want your algorithm to grow in compute time proportional to the size of the text as it grows.
It's a reference to Big O notation. Basically the interviewer means that you have to complete the task with an O(N) algorithm.
"Order n" is referring to Big O notation. Big O is a way for mathematicians and computer scientists to describe the behavior of a function. When someone specifies searching a string "in order n", that means that the time it takes for the function to execute grows linearly as the length of that string increases. In other words, if you plotted time of execution vs length of input, you would see a straight line.
Saying that your function must be of Order n does not mean that your function must equal O(n), a function with a Big O less than O(n) would also be considered acceptable. In your problems case, this would not be possible (because in order to count a letter, you must "touch" that letter, thus there must be some operation dependent on the input size).
One possible method is to traverse the string linearly. Then create a hash and list. The idea is to use the word as the hash key and increment the value for each occurance. If the value is non-existent in the hash, add the word to the end of the list. After traversing the string, go through the list in order using the hash values as the count.
The order of the algorithm is O(n). The hash lookup and list add operations are O(1) (or very close to it).

Lucene number extracting

I have this number extracting problem.
I want to get all matches that don't have a certain number in it
ex : 125501874, 125001873
Every number that as 55 at the position 2 are not to be considered.
The first numbers range is 0 to 9 and the second is 1-9 so the real range is [01-99]
(we cannot have 00 as the first two number)
With Lucene I wanted to add NOT field:[01-99]55*
But it doesn't seem to work. Is there an easy way to find ??55* and disregard it in a Search("NOT field:[01-99]55*")?
Thank you Lucene guru
Lucene can do this very efficiently if one creates an "index-only" field with only the third and fourth digits in it. The complete value can be "stored" (or stored and indexed if other queries use the whole number) in the original field.
Update: A followup comment asked, "Is [there] a way to create a temporary index on only the second digit?"
Using a ParallelReader "vertically partitions" the fields of an index. One partition could hold the current index, with its fields, while the other is a temporary index with the new field, possibly stored in a RAMDirectory.
Assuming the number is "stored" in the original index, iterate over each document in the original index, retrieve the stored field, parse out the key digits, and add a Document to the temporary index with the new field. As the ParallelReader documentation states, it is imperative that the document numbers match in both indexes.
Thank you erickson, Your solution is probably the best, using ParallelReader if only I could use temporary indexes, cause we cache the search query, we will need those later.
But like you said before, better start with an index on the relevant digits straighaway.
I have another solution.
NOT field:0?55*
NOT field:1?55*
...
NOT field:9?55*
It is efficient enough for the search I'm doing and it bypass the first character wildcard limitation. I wouldn't use that if their where more digits to check or if they where farther from the start.
Now I'm testing this on a million of row and it's pretty efficient for our needs.

Resources