PowerQuery remove items from a list matching a pattern

PowerQuery remove items from a list matching a pattern - excel

Using PowerQuery in excel how can I remove items from a list that match a pattern.
I have a column with cells that contain names and numeric id's. I want to be left with just a list of names.
LastName, FirstName;#123;#LastName, FirstName;#321;
The numbers are all unique. So if I had regex the pattern would be similar to
/^\#ddd+$/
I can split the cell into a list using ';' as a separator.
= Text.Split([Consultant],";")
If there was a way to remove every 2nd item until the end that could work too. Unfortunately it seems there is no way to specify patterns to match.
List.RemoveItems({1, 2, 3, 4, 2, 5, 5}, {2, 4, 6})
This would be awesome however I have to define all the number patterns that exist. So this fails.
List.RemoveMatchingItems(Text.Split([Consultant], ";#"), {1,2,3,4,5,6,7,8,9})
Method2
I split the text into a list as above. This gave me a column of lists. So I expanded the lists in columns to new rows. My plan was to remove alternate rows. However, remove alternate rows requires an end number. I would need an argument to go until there are no more arguments to process.

There are many ways.
One way is to select every other item with List.Select
In your example, these would be the items with an even number position.
let
x = Text.Split([Column1],";#"),
y = List.Select(x, each Number.IsEven(List.PositionOf(x, _)))
in
y
Edit Nov 2022
Another method or removing every other would be:
=List.Alternate(Text.Split([Column1],";#"),1,1,1)

Related

Comparing the word-counts of two files, accounting for the number of occurrences

I'm currently working on a program which is supposed to find exploits for vulnerabilities in web-applications by looking at the "Document Object Model" (DOM) of the application.
One approach for narrowing down the number of possible DB-entries follows the strategy of further filtering the entries by comparing the word-count of the DOM and the database entry.
I already have two dicts (actually Dataframes, but showing dict here for better presentation), each containing the top 10 words in descending order of their numbers of ocurrences in the text.
word_count_dom = {"Peter": 10, "is": 6, "eating": 2, ...}
word_count_db = {"eating": 6, "is": 6, "Peter": 1, "breakfast": 1, ...}
Now i would like to calculate some kind of value, which represents how similar the two dicts are while accounting for the number of occurences.
Currently im using:
len([value in word_count_db for value in word_count_dom])
>>> 3
but this does not account for the number of occurrences at all.
Looking at the example i would like the program to give more value to the "is"-match, because of the generally better "Ranking-Position to Number of Occurences"-value.

Just an idea:
Compute for each dict the relative probability of each entry to occur (e.g. among all the top counts "Peter" occurs 20% of the time). Do this for each word occuring in either dict. And then use something like:
https://en.wikipedia.org/wiki/Bhattacharyya_distance

pdfplumber - How to extract table with no horizontal lines?

So I have a table like this one, with an unknown number of description lines. Some can have 1, 2, 5, even zero, or more lines:
(I removed all sensitive informations.)
and I use :
with pdfplumber.open("invoice.pdf") as pdf:
pages = pdf.pages
for page in pages:
page.extract_table()
which is does extract all data from the table but the second column it treats as one row.
I want somehow to split the lines of second column (or better all columns) by a small blank row, which so I put it on red rectangles to highlight it.
I know that I need to use table_settings={}, but I can't figure out ... yet, which property (ies), to use ?
What I tried:
print(page.extract_table(table_settings={
"horizontal_strategy": "text",
"snap_y_tolerance": 3,
"keep_blank_chars": True,
}))
Which, again, it splits when he wants ..
So it's possible to extract a mix-borderless table ?

How to check for words in two excel columns?

I have two columns in my excel sheet. I want to add their output in 3rd column. Formula should check for tags in both columns.
If tag totally miss-matches in both columns output should be shown as "inaccurate" or "failed".
If every tag is matching in columns output should be "accurate".
If some of tags are matching, then output should be "incomplete".
EXAMPLE SHOWN:
Another Example where i am unable to add incomplete as output. It only checks for inaccurate and accurate.

You can use LET and FILTERXML to test to see if the entire/a partial/no part of the string is in another. For example:
=LET(x, FILTERXML("<t><s>"&SUBSTITUTE(I1, ",", "</s><s>")&"</s></t>", "//s"),
y, FILTERXML("<t><s>"&SUBSTITUTE(J1, ",", "</s><s>")&"</s></t>", "//s"),
z, COUNT(MATCH(x,y,0)),
IF(z=MIN(COUNTA(x), COUNTA(y)), "Accurate", IF(z>0, "Partial", "Failed")))
Here we first create a dynamic array of the contents of each cell and call them x and y respectively. Next, we count the number of matches in each array. Finally, we test this count: if the count is equal to the size of the smallest dynamic array, then all the tags match and we call it "Accurate". Otherwise, if there is any match in the dynamic arrays, we call it "Partial". Finally, if there are no matches, we call it "Failed".

powerquery split column by variable field lengths

In PowerQuery I need to import a fixed width txt file (each line is the concatenation of a number of fields, each field has a fixed specific length).
When I import it I get a table with one single column that contains the txt lines, e.g. in the following format:
AAAABBCCCCCDDD
I want to add more columns in this way:
Column1: AAAA
Column2: BB
Column3: CCCCC
Column4: DDD
In other words the fields composing the source column are of known length, but this length is not the same for all fields (in the example above the lengths are: 4,2,5,3).
I'd like to use the "Split Column">"By number of character" utility but I can only insert one single length at a time, and to get the desired output I'd have to repeat the process 3 times, adding one column each time and using the "Once, as far left as possible" option for the "Split Column">"By number of character" utility.
My real life case has many different line types (files) to import and convert, each with more then 20 fields, so a less manual approach is needed; I'd like to somehow specify the record structure (the length of each field) and get the lines split automagically :)
There would probably be the need for some M code, which I know nothing about: can anybody point me to the right direction?
Thanks!

Create a query with the formula below. Let's call this query SplitText:
let
SplitText = (text, lengths) =>
let
LengthsCount = List.Count(lengths),
// Keep track of the index in the lengths list and the position in the text to take the next characters from. Use this information to get the next segment and put it into a list.
Split = List.Generate(() => {0, 0}, each _{0} < LengthsCount, each {_{0} + 1, _{1} + lengths{_{0}}}, each Text.Range(text, _{1}, lengths{_{0}}))
in
Split,
// Convert the list to a record to
ListToRecord = (text, lengths) =>
let
List = SplitText(text, lengths),
Record = Record.FromList(List, List.Transform({1 .. List.Count(List)}, each Number.ToText(_)))
in
Record
in
ListToRecord
Then, in your table, add a custom column that uses this formula:
each SplitText([Column1], {4, 2, 5, 3})
The first argument is the text to split, and the second argument is a list of lengths to split by.
Finally, expand the column to get the split text values into your table. You may want to rename the columns since they will be named 1, 2, etc.

Using VLookUp for a partial search

I have two tables in excel.
In table 1, one column contains a list of order numbers. This is done the format of XXXX-YYYY where X is an integer and Y is a letter. For example 3485-XTIP
Table 2 also has an order number column but this time it's in the format XXXX-YYYY (ZZ) where Z is the initials of the customer who made the order. Example: 3485-XTIP (KN)
How can I use a VLookUp to search for the order number in Table 2 but only using the XXXX-YYYY part? I tried using TRUE for an approximate search but it still failed for some reason.
This is what I have
=VLOOKUP("I3",'Table2 '!A:B,2,FALSE)
I am open to any alternatives other than VLookup for this situation.
Note that there are hundreds of order numbers and entering the strings manually will take forever.

You can use * as wildcard and add it at the end of the order number so that your VLOOKUP will match any order plus any other characters that come after it:
=VLOOKUP(I3&"*", 'Table2 '!A:B, 2, 0)
* will match anything after the order number.
Note: 0 and False have the same behaviour here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PowerQuery remove items from a list matching a pattern - excel

Related

Comparing the word-counts of two files, accounting for the number of occurrences

pdfplumber - How to extract table with no horizontal lines?

How to check for words in two excel columns?

powerquery split column by variable field lengths

Using VLookUp for a partial search

Categories

Resources