Filtering letter combinatons

Filtering letter combinatons - excel

Hi – I’m looking for help for the following problem.
I have a utility operating that gives me all the combinations for a set of letters (or values). This is in the form of 8 choose n, ie there are 8 letters and I can produce all the combinations for sequences where I want no more than 4 letters. So n can be 2, 3, or 4
Now here it gets a bit more complex: the 8 letters are made up of three lists or groups. Hence, A,B,C,D;E1,E2;F1,F2
As I say, I can get all the 2, 3 and 4-sequences without a problem. But I need to filter them so that I get combinations (or rather can filter the result) where I only want letters in the result that ensures I get (in the n=2 condition) at least one from A,B,C,D and one from either the E set or the F set.
So, as a few examples, where n=2
AE1 or DF2… is ok but AB or E1E2 or E1F1… is not ok
Where n=3 the rules alter slightly but it’s the same principle
ABE1, ABF1, BDF2 or BE2F1… is ok but ABC, ABD, AE1E2, DF1F2 or E1E2F1… is not ok.
Similarly, where n=4
ABE1F1, ABE1F2… is ok but ABCD, ABE1E2, CDF1F2 or E1E2F1F2… is not ok.
I’ve tried a few things using different formulas such as with Match and Countif but can’t quite figure it out. So would be very grateful for any help.
Jon

I've been trying to find an approach to this problem that takes some of the messiness out of it. There are two factors that make this a bit awkward to deal with
(a) Combination of single letters and bigrams (digrams?)
(b) Possibility of several different letters / bigrams at each position in the string.
It's possible to deal with both of these issues by classifying the letters or bigrams into three groups or classes
(1) Letters A-D - let's call this group L
(2) First pair of bigrams E1 & E2 - let's call this group M
(3) Second pair of bigrams F1 & F2 - let's call this group N.
Then we can make a list of the allowed combinations of groups which as far as I can work out is something like this
For N=2
LM
LN
For N=3
LLM
LLN
LMN
For N=4
LLMN
(I don't know if LLLM etc. is allowed but these can be added)
I'm going to make a big assumption that the utility mentioned in OP doesn't generate strings like AAAA or E1E1E1E1 otherwise it would be pretty useless and you would be better off starting from scratch.
So you just need a substitute that looks like this
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"A","L"),"B","L"),"C","L"),"D","L"),"E1","M"),"E2","M"),"F1","N"),"F2","N")
And a lookup in the list of allowed patterns
=ISNUMBER(MATCH(B2,$D$2:$D$10,0))
and filter on the lookup value being TRUE.

Related

Excel formula that produces one of two options

This is my first StackOverflow question, so apologies if I am unclear.
Currently, my work uses an Excel tracking doc to log project info. The column info is like so:
CELL B1 (Project Number) =IF(B2=""," ",MID(B2,FIND("P2",B2),9))
CELL B2 (Project Name) Client / P2XXXXXXX / Name
Thus, the P2XXXXXXX gets pulled out of B2 and populated into B1.
However, management has recently switched systems, so now, some project numbers have the P2XXXXXXX format and others have a PRJ-XXXXX format.
So we need a formula the produces nothing if the cell is blank and EITHER the P2XXXXXXX number or PRJ-XXXXX number if the cell is not blank.
Is it possible? If any further details are needed, let me know. Thanks in advance!

Well, if the / is always there then this can work:
IF(B2="","",MID(B2,FIND("/",B2,1)+2,9))
assuming the name is always 9 characters.

String Between Two Same Characters
Maybe the next month your company will start using a different first letter or could add more numbers e.g. SPRXXXXXXXXXX. So you could solve this problem by extracting whatever is between those two slashes.
=IF(B2="","",TRIM(MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1)))
Find the first character =FIND("/",B2), but we need the next one:
=FIND("/",B2)+1
Find the second character but search from the postition after the first found:
=FIND("/",B2,FIND("/",B2)+1)
Now get the string between them:
=MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1)
(note how the last minus was 'converted' from a plus to a minus (- + + = -)).
Remove the leading and trailing spaces:
=TRIM(MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1))
Add the condition when the cell is blank:
=IF(B2="","",TRIM(MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1)))
Here's another way using LEFT and RIGHT:
=IF(B2="","",TRIM(LEFT(RIGHT(B2,LEN(B2)-FIND("/",B2)),FIND("/",B2))))

Although you can solve this problem with a combination of slicing, trimming, and complex conditionals, the most expressive and easy to maintain solution is to use regular expressions. Regular expressions have a bit of a learning curve, but there's a great playground website where you can experiment with them, and this page has a pretty good writeup on how regular expressions work in excel.
Specifically, this regular expression addresses the two naming conventions you've highlighted, but it can be updated to support more naming conventions as your company inevitably adds more:
P(RJ-)?((\d){9}|(\d){5})
To break that down from left to right:
P: both patterns start with a "P"
(RJ-)? One pattern follows with "RJ-", but the other doesn't. This is a grouped part of the pattern, and the question mark means that this part of the pattern is optional.
((\d){9}|(\d){5}): by far the nastiest part, but this basically means that there is going to be a sequence of numbers (\d), and there will either be nine of them or five of them. By wrapping the whole thing in parenthesis, they are always the second captured group, no matter the length of the sequence of numbers. This means that you can always extract the project id by looking at the value of the second capture group.
You can also make the expression more generalized by replacing ((\d){9}|(\d){5}) with simply (\d+). That just means "one or more digits." That gives you a much more simplified overall expression of this:
P(RJ-)?(\d+)
Depending on whether or not you care about validating strictly that project ids are 5 OR 9 digits long, that pattern above might be suitable, and it has the benefit of being more flexible. Still, the project ID is in the second captured group.

String parsing in optimal way

Suppose I have a string as onehhhtwominusthreehhkkseveneightjnine
Now I want to parse this string to get the numbers out of it. For Example this string should return an array, [one,two,minusthree,seven,eight,nine].
The order of the Integers should be maintained.
Can anyone Please suggest an optimal way to do this parsing? Thanks.

(You haven't mentioned a programming language?)
I would probably search for "minus" and check the number(s) that follow it. Then search for "one", then "two", noting their indexes. This would provide enough information to map and output the results, and order, that you need.
Another option is to look at each character in order, comparing each to the 10 choices. I couldn't tell you which is the most efficient - I think it depends on the possible total string length. I'd probably write both and profile them.
If the string to search is not of inordinate length then I suspect that the second approach might be more efficient. This is because, as soon as you have a match, you can eliminate searching the following (known) length of characters.
That is, if you have "abceightd", once you discover the "e" and its "eight" you can skip four characters. You can also skip the a, b, and c anyway, as they are not the beginning character for any of the 10 choices.
I am assuming your choices are:
one, two, three, four, five, six, seven, eight, nine, minus

Assuming that a) you have access to regular expressions in your choice of programming language and b) your possible choices are as Andy G has assumed... then this regular expression can pick out the numbers grouped with their associated minus, if present:
/((?:minus)*(?:one|two|three|four|five|six|seven|eight|nine))/g
Applied to your example string using JavaScript's RegEx.exec(), for example, this extracts:
one
two
minusthree
seven
eight
nine
You could easily place a space after any minus matched if required. Does this help at all?

List of items find almost duplicates

Within excel I have a list of artists, songs, edition.
This list contains over 15000 records.
The problem is the list does contain some "duplicate" records. I say "duplicate" as they aren't a complete match. Some might have a few typo's and I'd like to fix this up and remove those records.
So for example some records:
ABBA - Mamma Mia - Party
ABBA - Mama Mia! - Official
Each dash indicates a separate column (so 3 columns A, B, C are filled in)
How would I mark them as duplicates within Excel?
I've found out about the tool Fuzzy Lookup. Yet I'm working on a mac and since it's not available on mac I'm stuck.
Any regex magic or vba script what can help me out?
It'd also be alright to see how much similar the row is (say 80% similar).

One of the common methods for fuzzy text matching is the Levenshtein (distance) algorithm. Several nice implementations of this exist here:
https://stackoverflow.com/a/4243652/1278553
From there, you can use the function directly in your spreadsheet to find similarities between instances:
You didn't ask, but a database would be really nice here. The reason is you can do a cartesian join (one of the very few valid uses for this) and compare every single record against every other record. For example:
select
s1.group, s2.group, s1.song, s2.song,
levenshtein (s1.group, s2.group) as group_match,
levenshtein (s1.song, s2.song) as song_match
from
songs s1
cross join songs s2
order by
group_match, song_match
Yes, this would be a very costly query, depending on the number of records (in your example 225,000,000 rows), but it would bubble to the top the most likely duplicates / matches. Not only that, but you can incorporate "reasonable" joins to eliminate obvious mismatches, for example limit it to cases where the group matches, nearly matches, begins with the same letter, etc, or pre-filtering out groups where the Levenschtein is greater than x.

You could use an array formula, to indicate the duplicates, and you could modify the below to show the row numbers, this checks the rows beneath the entry for any possible 80% dupes, where 80% is taken as left to right, not total comparison. My data is a1:a15000
=IF(NOT(ISERROR(FIND(MID($A1,1,INT(LEN($A1)*0.8)),$A2:$A$15000))),1,0)
This way will also look back up the list, to indicate the ones found
=SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A1)*0.8)),$A3:$A$15000,1)),0,1))+SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A2)*0.8)),$A$1:$A1,1)),0,1))
The first entry i.e. row 1 is the first part of the formula, and the last row will need the last part after the +

try this worksheet fucntions in your loop:
=COUNTIF(Range,"*yourtexttofind*")

Finding all words in a paragraph whose first three letters are the same?

How can we solve this problem in a best way? Is there any algorithm for solving this?
"In a paragraph we have to find and print all the words which have starting 3 letters same. Example: we input some paragraph and as a output we get letters like-
a) 1. you 2. your 3. yours 4. yourself
b) 1. early 2. earlier 3. earliest
Like this we get all the words of paragraph which have starting 3 letters common"

A reasonable solution that's not too hard to code up is to maintain a map of some sort where the keys are the first three letters of each word and the values are the sets of words that start with those three letters. You can scan across the words in the paragraph and, for each one you encounter, trim off the first three words, look up the map entry corresponding to those letters, and add in that word to the list. You can then iterate over the map at the end, find all sets containing at least two words, then print out each cluster you find.
Overall, the runtime of this approach is O(L), where L is the total length of all the words in the paragraph. To see this, notice that for each word, we do a map lookup on a constant-sized prefix of that word, then copy all the characters of the word into the map. Overall, this visits each character at most a constant number of times.

Trie with the first three characters and then the word index as the leaf should do the trick.

Transform string from a1b2c3d4 to abcd1234

I am given a string which has numbers and letters.Numbers occupy all odd positions and letters even positions.I need to transform this string such that all letters move to front of array,and all numbers at the end.
The relative order of the letters and numbers needs to be preserved
I need to do this in O(n) time and O(1) space.
eg: a1b2c3d4 -> abcd1234 , x3y4z6 -> xyz346
This previous question has an explanation algorithm, but no matter how hard i try,i cant get a hold of it.
I hope someone can explain me this with a example test case .

The key is to think of the input array as a matrix like this:
a 1
b 2
c 3
d 4
and realize that you want the transpose of this matrix
a b c d
1 2 3 4
Remember, multi-dimensional arrays are really just single-dimensional arrays in disguise so you can do this.
But you need to do this in-place to satisfy the O(1) space requirement. Fortunately, this is a well-known problem complete with several possible approaches.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Filtering letter combinatons - excel

Related

Excel formula that produces one of two options

String parsing in optimal way

List of items find almost duplicates

Finding all words in a paragraph whose first three letters are the same?

Transform string from a1b2c3d4 to abcd1234

Categories

Resources