Most efficient way to find the common prefix of many strings

Most efficient way to find the common prefix of many strings - string

What is the most efficient way to find the common prefix of many strings.
For example:
For this set of strings
/home/texai/www/app/application/cron/logCron.log
/home/texai/www/app/application/jobs/logCron.log
/home/texai/www/app/var/log/application.log
/home/texai/www/app/public/imagick.log
/home/texai/www/app/public/status.log
I wanna get /home/texai/www/app/
I want to avoid char by char comparatives.

You cannot avoid going through at least the common parts to find common prefix.
I don't think this needs any fancy algorithm. Just keep track of the current common prefix, then shorten the prefix by comparing the current prefix with the next string.
Since this is common prefix of all strings, you may end up with empty string (no common prefix).

I'm not sure what you mean by avoid char by char comparative, but you at least need to read the common prefix from each of the strings, so the following algorithm is the best you can achieve (just iterate over the strings until they deviate or until the current longest prefix count is reached):
List<string> list = new List<string>()
{
"/home/texai/www/app/application/cron/logCron.log",
"/home/texai/www/app/application/jobs/logCron.log",
"/home/texai/www/app/var/log/application.log",
"/home/texai/www/app/public/imagick.log",
"/home/texai/www/app/public/status.log"
};
int maxPrefix = list[0].Length;
for(int i = 1; i < list.Count; i++)
{
int pos = 0;
for(; pos < maxPrefix && pos < list[i].Length && list[0][pos] == list[i][pos]; pos++);
maxPrefix = pos;
}
//this is the common prefix
string prefix = list[0].Substring(0, maxPrefix);

Related

Longest common prefix - comparing time complexity of two algorithms

If you comparing these two solutions the time complexity of the first solution is O(array-len*sortest-string-len) that you may shorten it to O(n*m) or even O(n^2). And the second one seems O(n * log n) as it has a sort method and then comparing the first and the last item so it would be O(n) and don't have any effect on the O.
But, what happens to the comparing the strings item in the list. Sorting a list of integer values is O(n * log n) but don't we need to compare the characters in the strings to be able to sort them? So, am I wrong if I say the time complexity of the second solution is O(n * log n * longest-string-len)?
Also, as it does not check the prefixes while it is sorting it would do the sorting (the majority of the times) anyway so its best case is far worse than the other option? Also, for the worst-case scenario if you consider the point I mentioned it would still be worse than the first solution?
public string longestCommonPrefix(List<string> input) {
if(input.Count == 0) return "";
if(input.Count == 1) return input[0];
var sb = new System.Text.StringBuilder();
for(var charIndex = 0; charIndex < input[0].Length; charIndex++)
{
for(var itemIndex = 1; itemIndex < input.Count; itemIndex++)
{
if(input[itemIndex].Length > charIndex)
return sb.ToString();
if(input[0][charIndex] != input[itemIndex][charIndex])
return sb.ToString();
}
sb.Append(input[0][charIndex]);
}
return sb.ToString();
}
static string longestCommonPrefix(String[] a)
{
int size = a.Length;
/* if size is 0, return empty string */
if (size == 0)
return "";
if (size == 1)
return a[0];
/* sort the array of strings */
Array.Sort(a);
/* find the minimum length from first
and last string */
int end = Math.Min(a[0].Length,
a[size-1].Length);
/* find the common prefix between the
first and last string */
int i = 0;
while (i < end && a[0][i] == a[size-1][i] )
i++;
string pre = a[0].Substring(0, i);
return pre;
}

First of all, unless I am missing something obvious, the first method runs in O(N * shortest-string-length); shortest, not longest.
Second, you may not reduce O(n*m) to O(n^2): the number of strings and their length are unrelated.
Finally, you are absolutely right. Sorting indeed takes O(n*log(n)*m), so in no case it would improve the performance.
As a side note, it may be beneficial to find the shortest string beforehand. This would make a input[itemIndex].Length > charIndex unnecessary.

Is there a faster method to find dictionary keys with wildcards in the middle?

Let's say I have a dictionary with strings of 0s, 1s, and '*' as wildcards for my key value.
For example, my dictionary is structured as such:
{'010*10000':'foo', '100*1*000':'bar'......}
Each dictionary value has a fixed string length, however, there are wildcards within the string represented as '*' characters. Thus, values of '010110000' or '010010000' both return 'foo'.
The problem lies in the length of my dictionary. The dictionary I am working with has over 500,000+ entries. Therefore, when I try to iterate over each key in the dict to find if a key exists, then it takes far too long with O(n) complexity.
Ideally, I would like to find a way to just check if a value such as '010110000' is in the dictionary, similar to the .get() function for regular python dictionaries without wildcards.
I've already tried iterating over my dictionary using fnmatch like the following Wildcard in dictionary key:
for k in my_dict.keys():
if fnmatch.fnmatch(string_of_1s_and_0s, k):
print(my_dict[k])
break
##Do some operation here if we have found the matching key pair...and then break.
However, it's just too slow with O(n) complexity. Is there any way to implement get() but with wildcards?

dicts are hash code based; the hash code, if implemented correctly, will differ wildly for a difference of just one character. There is no way to make a dict do what you want, but what you're doing is probably best done with something other than a dict in the first place. Have you considered a relational database, where the LIKE operator could do something like this? It might still have to scan a large part of the DB, but ideally it could use anchors at one end or the other to at least narrow the search to matching prefixes/suffixes.

Rotate the original pattern left (by taking characters from the start and putting them at the end) while keeping track of the rotate count; like this:
'010*10000' -> '*10000010', rotate_count = 3
'100*1*000' -> '*1*000100', rotate_count = 3
Then split it into a "complex part" and a "simple part", and determine the length of the simple part, like this:
'010*10000' -> '*10000010', rotate_count = 3
complex = '*`, simple = `10000010', simple_length = 8
'100*1*000' -> '*1*000100', rotate_count = 3
complex = '*1*`, simple = `000100', simple_length = 6
If the fixed length of the strings is 16, then there will be 16 possible values of rotate_count, and for each one there will be 16 - rotate_count possible values of simple_length. This can be described as a nested loop:
for(rotate_count = 0; rotate_count < 16; rotate_count++) {
for(simple_length = 0; simple_length = 16 - rotate_count; simple_length++) {
}
}
You can associate an "array of entries" with this, like:
entry_number = 0;
for(rotate_count = 0; rotate_count < 16; rotate_count++) {
for(simple_length = 0; simple_length = 16 - rotate_count; simple_length++) {
entry_number++;
}
}
Then you can use the entry number to find a hash table, like:
entry_number = 0;
for(rotate_count = 0; rotate_count < 16; rotate_count++) {
for(simple_length = 0; simple_length = 16 - rotate_count; simple_length++) {
hash_table = array_of_hash_tables[entry_number];
entry_number++;
}
}
You can also rotate the string you're looking for by the rotate_count and extract simple_length characters from that, convert those characters into a hash, and use it to find a list of entries from the hash table, like:
entry_number = 0;
for(rotate_count = 0; rotate_count < 16; rotate_count++) {
rotated_string = rotate_string(original_string, rotate_count);
for(simple_length = 0; simple_length = 16 - rotate_count; simple_length++) {
hash_table = array_of_hash_tables[entry_number];
if(hash_table != NULL) {
hash = get_simple_hash(rotated_string, simple_length);
list = hash_table[hash];
// Use "list" and "original string" to do the hard stuff here...
}
entry_number++;
}
}
This will quickly eliminate lots of entries (where the start and end don't match) and give you a list of "potential matches" where you'd have to check the part containing wild cards against the original string to determine if there is/isn't an actual match.
Note that if the characters are "ones and zeros" this can be improved by converting "strings containing binary digits" into integers.

Char Array Returning Integers

I've been working through this exercise, and my output is not what I expect.
(Check substrings) You can check whether a string is a substring of another string
by using the indexOf method in the String class. Write your own method for
this function. Write a program that prompts the user to enter two strings, and
checks whether the first string is a substring of the second.
** My code compromises with the problem's specifications in two ways: it can only display matching substrings to 3 letters, and it cannot work on string literals with less than 4 letters. I mistakenly began writing the program without using the suggested method, indexOf. My program's objective (although it shouldn't entirely deviate from the assignment's objective) is to design a program that determines whether two strings share at least three consecutive letters.
The program's primary error is that it generates numbers instead of char characters. I've run through several, unsuccessful ideas to discover what the logical error is. I first tried to idenfity whether the char characters (which, from my understanding, are underwritten in unicode) were converted to integers, considering that the outputted numbers are also three letters long. Without consulting a reference, I know this isn't true. A comparison between java and javac outputted permutation of 312, and a comparison between abab and ababbab ouputted combinations of 219. j should be > b. My next thought was that the ouputs were indexes of the arrays I used. Once again, this isn't true. A comparison between java and javac would ouput 0, if my reasoning were true.
public class Substring {
public static char [] array;
public static char [] array2;
public static void main (String[]args){
java.util.Scanner input = new java.util.Scanner (System.in);
System.out.println("Enter your two strings here, the longer one preceding the shorter one");
String container1 = input.next();
String container2 = input.next();
char [] placeholder = container1.toCharArray();
char [] placeholder2 = container2.toCharArray();
array = placeholder;
array2 = placeholder2;
for (int i = 0; i < placeholder2.length; i++){
for (int j = 0; j < placeholder.length; j ++){
if (array[j] == array2[i]) matcher(j,i);
}
}
}
public static void matcher(int higher, int lower){
if ((higher < array.length - 2) && (lower < array2.length - 2))
if (( array[higher+1] == array2[lower+1]) && (array[higher+2] == array2[lower+2]))
System.out.println(array[higher] + array[higher+1] + array[higher+2] );
}
}

The + operator promotes shorts, chars, and bytes operands to ints, so
array[higher] + array[higher+1] + array[higher+2]
has type int, not type char which means that
System.out.println(...)
binds to
System.out.println(int)
which displays its argument as a decimal number, instead of binding to
System.out.println(char)
which outputs the given character using the PrintStream's encoding.

dart efficient string processing techniques?

I strings in the format of name:key:dataLength:data and these strings can often be chained together. for example "aNum:n:4:9879aBool:b:1:taString:s:2:Hi" this would map to an object something like:
{
aNum: 9879,
aBool: true,
aString: "Hi"
}
I have a method for parsing a string in this format but I'm not sure whether it's use of substring is the most efficient way of pprocessing the string, is there a more efficient way of processing strings in this fashion (repeatedly chopping off the front section):
Map<string, dynamic> fromString(String s){
Map<String, dynamic> _internal = new Map();
int start = 0;
while(start < s.length){
int end;
List<String> parts = new List<String>(); //0 is name, 1 is key, 2 is data length, 3 is data
for(var i = 0; i < 4; i++){
end = i < 3 ? s.indexOf(':') : num.parse(parts[2]);
parts[i] = s.substring(start, end);
start = i < 3 ? end + 1 : end;
}
var tranType = _tranTypesByKey[parts[1]]; //this is just a map to an object which has a function that can convert the data section of the string into an object
_internal[parts[0]] = tranType._fromStr(parts[3]);
}
return _internal;
}

I would try s.split(':') and process the resulting list.
If you do a lot of such operations you should consider creating benchmarks tests, try different techniques and compare them.
If you would still need this line
s = i < 3 ? s.substring(idx + 1) : s.substring(idx);
I would avoid creating a new substring in each iteration but instead just keep track of the next position.

You have to decide how important performance is relative to readability and maintainability of the code.
That said, you should not be cutting off the head of the string repeatedly. That is guaranteed to be inefficient - it'll take time that is quadratic in the number of records in your string, just creating those tail strings.
For parsing each field, you can avoid doing substrings on the length and type fields. For the length field, you can build the number yourself:
int index = ...;
// index points to first digit of length.
int length = 0;
int charCode = source.codeUnitAt(index++);
while (charCode != CHAR_COLON) {
length = 10 * length + charCode - 0x30;
charCode = source.codeUnitAt(index++);
}
// index points to the first character of content.
Since lengths are usually small integers (less than 2<<31), this is likely to be more efficient than creating a substring and calling int.parse.
The type field is a single ASCII character, so you could use codeUnitAt to get its ASCII value instead of creating a single-character string (and then your content interpretation lookup will need to switch on character code instead of character string).
For parsing content, you could pass the source string, start index and length instead of creating a substring. Then the boolean parser can also just read the code unit instead of the singleton character string, the string parser can just make the substring, and the number parser will likely have to make a substring too and call double.parse.
It would be convenient if Dart had a double.parseSubstring(source, [int from = 0, int to]) that could parse a substring as a double without creating the substring.

Remove single character occurrence from String

I want an algorithm to remove all occurrences of a given character from a string in O(n) complexity or lower? (It should be INPLACE editing original string only)
eg.
String="aadecabaaab";
removeCharacter='a'
Output:"decbb"

Enjoy algo:
j = 0
for i in length(a):
if a[i] != symbol:
a[j] = a[i]
j = j + 1
finalize:
length(a) = j

You can't do it in place with a String because it's immutable, but here's an O(n) algorithm to do it in place with a char[]:
char[] chars = "aadecabaaab".toCharArray();
char removeCharacter = 'a';
int next = 0;
for (int cur = 0; cur < chars.length; ++cur) {
if (chars[cur] != removeCharacter) {
chars[next++] = chars[cur];
}
}
// chars[0] through chars[4] will have {d, e, c, b, b} and next will be 5
System.out.println(new String(chars, 0, next));

Strictly speaking, you can't remove anything from a String because the String class is immutable. But you can construct another String that has all characters from the original String except for the "character to remove".
Create a StringBuilder. Loop through all characters in the original String. If the current character is not the character to remove, then append it to the StringBuilder. After the loop ends, convert the StringBuilder to a String.

Yep. In a linear time, iterate over String, check using .charAt() if this is a removeCharacter, don't copy it to new String. If no, copy. That's it.

This probably shouldn't have the "java" tag since in Java, a String is immutable and you can't edit it in place. For a more general case, if you have an array of characters (in any programming language) and you want to modify the array "in place" without creating another array, it's easy enough to do with two indexes. One goes through every character in the array, and the other starts at the beginning and is incremented only when you see a character that isn't removeCharacter. Since I assume this is a homework assignment, I'll leave it at that and let you figure out the details.

import java.util.*;
import java.io.*;
public class removeA{
public static void main(String[] args){
String text = "This is a test string! Wow abcdefg.";
System.out.println(text.replaceAll("a",""));
}
}

Use a hash table to hold the data you want to remove. log N complexity.
std::string toRemove = "ad";
std::map<char, int> table;
size_t maxR = toRemove.size();
for (size_t n = 0; n < maxR; ++n)
{
table[toRemove[n]] = 0;
}
Then parse the whole string and remove when you get a hit (thestring is an array):
size_t counter = 0;
while(thestring[counter] != 0)
{
std::map<char,int>::iterator iter = table.find(thestring[counter]);
if (iter == table.end()) // we found a valid character!
{
++counter;
}
else
{
// move the data - dont increment counter
memcpy(&thestring[counter], &thestring[counter+1], max-counter);
// dont increment counter
}
}
EDIT: I hope this is not a technical test or something like that. =S

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Most efficient way to find the common prefix of many strings - string

Related

Longest common prefix - comparing time complexity of two algorithms

Is there a faster method to find dictionary keys with wildcards in the middle?

Char Array Returning Integers

dart efficient string processing techniques?

Remove single character occurrence from String

Categories

Resources