Regex function to split paragraphs into sentences for Power query - excel

I am attempting to split an example paragraph into sentences using regex in Power Query:
Mr. and Mrs. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Dr. Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.However, this line wont do it. Qr. Test for Website.COM and Labs.ORG looks good.Creatively not working. t and finished. 9 to start
Into:
Mr. and Mrs. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind? Dr. Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
However, this line wont do it.
Qr.
Test for Website.
COM and Labs.
ORG looks good.
Creatively not working. t and finished.
9 to start
Here is a function that enables PQ to utilise regex replace:
FnRegexReplace
// regexReplace
let regexReplace=(text as nullable text,pattern as nullable text,replace as nullable text, optional flags as nullable text) as text =>
let
f=if flags = null or flags ="" then "" else flags,
l1 = List.Transform({text, pattern, replace}, each Text.Replace(_, "\", "\\")),
l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
t = Text.Format("<script>var txt='#{0}';document.write(txt.replace(new RegExp('#{1}','#{3}'),'#{2}'));</script>", List.Combine({l2,{f}})),
r=Web.Page(t)[Data]{0}[Children]{0}[Children],
Output=if List.Count(r)>1 then r{1}[Text]{0} else ""
in Output
in regexReplace
I also have the following regex provided prom a previous post which appears to work on Regex101.
https://regex101.com/r/WEC0M9/6
Pattern: (?<!Mr|Mrs|Dr|Jr)(\.+)(\s+(?![a-z])|(?=[A-Z]))
Replace: $1\r\n (I think this can be anything like *)
flags: gm
The issue I have is that when I attempt this is Power Query I am returned with no result:
Alternatively (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s can be found here but the same issue occurs.
The issue appears to lie with the look-backwards and look-forward respectively ? as the function at least returns a result when this is removed. If anyone can advice on how to best get this paragraph to split using regex as shwon above in PQ that would be great.
M Code:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("TY9BSwNBDIX/yqPnJVRBiicRetBCD1LBQ+1hdid1Qmcmy0zWpf/e2aLgMXnv5X05Hlf7QnDZY18q4ZDEAnqdvoJhCOzGKsY0aMJZC+7oAUliFM3wGqMrtYMQEwJjdOLhENVuXjHCtm2akiT7J2xb0bN3CTvNXLFrowXJl7pYvPj8Oa3X95sWe82N6IrBVe4WT4XUPxVWJiYifHCMHeaF12Es2rteotgVegY9tvp/IXrRmb+5/F6LkhmzZmtP3DjfGss7V1udTj8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
#"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "FnRegexReplace", each FnRegexReplace([Column1], "(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s", "$1\r\n", "gm"))
in
#"Invoked Custom Function"
Update1: M Code with proposed Regex:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("TY9BSwNBDIX/yqPnJVRBiicRetBCD1LBQ+1hdid1Qmcmy0zWpf/e2aLgMXnv5X05Hlf7QnDZY18q4ZDEAnqdvoJhCOzGKsY0aMJZC+7oAUliFM3wGqMrtYMQEwJjdOLhENVuXjHCtm2akiT7J2xb0bN3CTvNXLFrowXJl7pYvPj8Oa3X95sWe82N6IrBVe4WT4XUPxVWJiYifHCMHeaF12Es2rteotgVegY9tvp/IXrRmb+5/F6LkhmzZmtP3DjfGss7V1udTj8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
#"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "FnRegexReplace", each FnRegexReplace([Column1], "((?:\S+\.(?:net|org|com)\b|\b[mdjs]rs?\.|\d*\.\d+|[a-z]\.(?:[a-z]\.)+|[^?.!])+(?:[.?!]+|$))[?!.\s]*)", "$1\n", "gi"))
in
#"Invoked Custom Function"

I think the following will be helpfull:
For demonstration purposes I loaded the data directly from Excel. I'm sure you can figure out how to connect your PDF;
Since the JavaScript-based function is a small HTML-script we have to escape the apostrope in the sample text first using a replace function. Otherwise it will clash with the apostrophes used to write the script in the function (see below). If we don't the function will error out/show nothing. Apostrophe will be shown correctly after applying function;
I edited the pattern to catch a full sentence in 1st capture group and for this sample I replaced what is captured with the backreference to this group and a pipe-symbol to visualize the result. Note there is no use of a negative lookbehind nomore since that is not supported in the engine. This resulted in a lengthy pattern which probably does not yet catch all the quirks possible:
\s*((?:\b[MDJS]rs?\.|\d*\.\d+|\S+\.(?:com|net|org)\b|[a-z]\.(?:[a-z]\.)+|[^.?!])+(?:[.?!]+|$))
M-Code:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Kol", type text}}),
#"Replaced Value" = Table.ReplaceValue(#"Changed Type","'","&apos",Replacer.ReplaceText,{"Kol"}),
#"Invoked Custom Function" = Table.AddColumn(#"Replaced Value", "fnRegexReplace", each fnRegexReplace([Kol], "\\s*((?:\\b[MDJS]rs?\\.|\\d*\\.\\d+|\\S+\\.(?:com|net|org)\\b|[a-z]\\.(?:[a-z]\\.)+|[^.?!])+(?:[.?!]+|$))", "$1|"))
in
#"Invoked Custom Function"
Used function fnRegexReplace:
(x,y,z)=>
let
Source = Web.Page(
"<script>var x="&"'"&x&"'"&";var z="&"'"&z&
"'"&";var y=new RegExp('"&y&"','gmi');
var b=x.replace(y,z);document.write(b);</script>")
[Data]{0}[Children]{0}[Children]{1}[Text]{0}
in
Source
An online demo of the regular expression.

This regex works for most texts from the start but can accommodate for issues that may arise.
\s*((?:\b(?:[djms]rs?|flam|liq|st)\.|\b(?:[a-z]\.){2,}|\.\d|\.(?:com|net|org)\b|[^.?!])+(?:[.?!]+|$)) (Gmi as flags)
Where flam|liq|st are examples where a split would normally occur if the word is followed by a . e.g. for an abbreviation. This section of the regex forced these to be ignored. e.g. If you had the text, St. Bernards typically weigh 80kg. This would usually split on the St. However adding st to this region of the regex ignores this so the sentence is captured as a whole. You can keep adding to this section to try and accommodate for most errors. If you come up with anyway of improving on this further do please post a comment/answer.

Related

Applying function to each row in Query Custom Column

Summary of problem:
I have a PowerQuery Table in Excel that contains 13 columns. The 13th Column is a custom column "Task Start Week Number". I want the PowerQuery to apply a formula to each of the rows generated for this Query. The formula is as follows:
=IFS(AND('Program Dates'!$B$2<WEEKNUM(New_Items_to_Save[Start Date]),
WEEKNUM(New_Items_to_Save[Start Date])<54),
'Program Dates'!$G$2-('Program Dates'!$D$2-(-53+WEEKNUM(New_Items_to_Save[Start Date]))),
WEEKNUM(New_Items_to_Save[Start Date])<'Program Dates'!$B$2,
'Program Dates'!$G$2-('Program Dates'!$D$2-(-53+WEEKNUM(New_Items_to_Save[Start Date])))+53)
What I've done here is reference a cell which contains the formula, that way I can just run the GetValue() function for a named range. I can't get this to work and I don't know what I'm doing wrong.
Thank you in advance for your help!
Context:
This is the query table I need to add the calculation to.
The last column is the custom column, and those values should be calculated using the following cells:
This is the source of the other info needed to calculate the week number of the program, with reference arrows shown.
Note: The dates referenced in the function have already been converted using the WEEKNUM() operation. I am comparing Week# to Week#, not Date to Week#
Function Logic:
AND: if the date falls within the range of the current year ie. week# is less than 54, but after the start of the program, then perform this calc.
IFS: otherwise, if week# is before the end of the program ie. 2023, then perform this calculation.
Edit:
Here is the PowerQuery function I want to call for each of the new cells in this custom column:
Parameter2 = Date.WeekOfYear(StartWeek)
let
GetWeek = ()
if GetValue("Start_Week") < Parameter2 < 54
then (GetValue("Program_Duration") - GetValue("End_Week") + 53 - Parameter2))
else
(GetValue("Program_Duration") - GetValue("End_Week") + 53 - Parameter2 +53))
in
GetWeek
I don't know if I need the let statement or if I should just put it in a function
f(x) => [equation]
and then call "...each f([column name])" in power query?
I think that there are actually three different parts to your question, and maybe your confusion is coming from combining them all together.
The way I see it is in these parts:
How to create a custom function.
How to apply a function to a new column.
How to apply a function to an existing column.
How to create a custom function
There are two main ways to create a custom function in Power Query:
Using the UI (follow steps here):
Step
Description
Image
1
Write your query
2
Parameterise your query
3
Create your function
Using only code (follow steps here):
Example to filter a table:
let fun_FilterTable = (tbl_InputTable as table, txt_FilterValue as text) as table =>
let
Source = tbl_InputTable,
Filter = Table.SelectRows(DayCount, each Text.Contains([Column], txt_FilterValue))
in
Filter
in
fun_FilterTable
Example to check if one string contains another:
let fun_CheckStringContains = (txt_String as text, txt_Check as text) as nullable logical =>
let
Source = txt_String,
Check = Text.Contains(Source, txt_Check)
in
Check
in
fun_CheckStringContains
More resources:
Using custom functions
Custom Functions Made Easy in Power BI Desktop
PowerQuery best practices
DataFlow best practices
How to apply a function to a new column
Also has two different ways to achieve:
Custom Column (follow steps here):
Step
Description
Image
1
Create custom column
2
Add function
Custom Function (follow steps here):
Step
Description
Image
1
Invoke custom function
Sources:
Add a custom column
Using custom functions
Custom Functions Made Easy in Power BI Desktop
How to apply a function to an existing column
Also has two different ways to achieve (unfortunately, only possible with pure code):
Using Transformation:
Example to uppercase an entire column:
let
Source = Table,
#"Uppercased text" = Table.TransformColumns(Source, {{"Column", each Text.Upper(_), type nullable text}})
in
#"Uppercased text"
Example to add a prefix to all rows in one column:
let
Source = Table,
#"Added prefix" = Table.TransformColumns(Source, {{"Column", each "test_" & _, type text}})
in
#"Added prefix"
Example to coerce column to date in Australian format:
let
Source = Table,
#"Fix date" = Table.TransformColumns(Source, {{"DateColumn", each Date.From(_, "en-AU"), type date}})
in
#"Fix date"
Using Replacement
Example to replace some text:
let
Source = Table,
#"Replaced value" = Table.ReplaceValue(Source, "Admin", "Administrator", Replacer.ReplaceText, {"Column"})
in
#"Replaced value"
Example to replace with values from another column
let
Source = Table,
#"Replaced value" = Table.ReplaceValue(Source, each [FixThisColumn], each [OtherColumn], Replacer.ReplaceText, {"FixThisColumn"})
in
#"Replaced value"
Your Specific Problem
Without some dummy data to use, I have created some here. Please note, in future, please provide some data in a minimum reproducible example (see here), so that we can easily recreate the scenario from your example.
Data:
ID
ProgramStartDate
ProgramEndDate
1
1/Jan/2020
1/Dec/2021
2
1/Jan/2022
1/Mar/2023
3
1/Mar/2022
1/Dec/2022
4
1/Sep/2021
1/Dec/2023
5
1/Jan/2023
1/Dec/2023
I think that you should be using a combination of the PowerQuery in-build date functions (see here) and some of the PowerQuery conditional processes (see here).
My code would look something like this:
let
Source = Table.FromColumns({{1,2,3,4,5},{"1/Jan/2020","1/Jan/2022","1/Mar/2022","1/Sep/2021","1/Jan/2023"},{"1/Dec/2021","1/Mar/2023","1/Dec/2022","1/Dec/2023","1/Dec/2023"}},{"ID","ProgramStartDate","ProgramEndDate"}),
fix_Types = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"ProgramStartDate", type date}, {"ProgramEndDate", type date}}),
add_Today = Table.AddColumn(fix_Types, "DateToday", each Date.From(DateTime.LocalNow()), type date),
add_CheckCurrentYear = Table.AddColumn(add_Today, "IsInCurrentYear", each Date.IsInCurrentYear([DateToday]), type logical),
add_CheckProgramRunning = Table.AddColumn(add_CheckCurrentYear, "ProgramIsCurrent", each [DateToday]>[ProgramStartDate] and [DateToday]<[ProgramEndDate], type logical),
add_ConditionalCheck = Table.AddColumn(add_CheckProgramRunning, "DoSomething", each if [IsInCurrentYear] and [ProgramIsCurrent] then "Do Something" else null, type text)
in
add_ConditionalCheck
And the final output would look something like this:
ID
ProgramStartDate
ProgramEndDate
DateToday
IsInCurrentYear
ProgramIsCurrent
DoSomething
1
1/01/2020
1/12/2021
22/12/2022
TRUE
FALSE
null
2
1/01/2022
1/03/2023
22/12/2022
TRUE
TRUE
Do Something
3
1/03/2022
1/12/2022
22/12/2022
TRUE
FALSE
null
4
1/09/2021
1/12/2023
22/12/2022
TRUE
TRUE
Do Something
5
1/01/2023
1/12/2023
22/12/2022
TRUE
FALSE
null
This should help you work towards resolving your issue.

Text.Contains for multiple values power query

I am attempting to create the following query:
The idea is to check if each row in the source query contains any of the following keywords in the Search list and return the Found words is present.
Importantly I need this to be dynamic i.e. the search list could be a single word or could be 100+ words. Therefore I need to work around just stitching a bunch of Text. Contains with or statements is possible.
In effect, I want to create something like
Text.Contains([Column1], {any value in search list}) then FoundWord else null
Data:
Physical hazards Flam. Liq. 3 - H226 Eliminate all sources of ignition.
Health hazards STOT SE 3 - H336. Avoid inhalation of vapours and contact with skin and eyes.
Environmental hazards Not Classified. Avoid the spillage or runoff entering drains, sewers or watercourses.
Personal precautions Keep unnecessary and unprotected personnel away from the spillage.
clothing as described in Section 8 of this safety data sheet. Provide adequate ventilation.
Search List:
Hazards
Eliminate
ventilation
Avoid
try this code for query Table2 after creating query lookfor
let Source = Excel.CurrentWorkbook(){[Name="Table2"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
Findmatch = Table.AddColumn(Source, "Found", (x) => Text.Combine(Table.SelectRows(lookfor, each Text.Contains(x[Column1],[Column1], Comparer.OrdinalIgnoreCase))[Column1],", "))
in Findmatch

Extracting data with an associated tag/unit

I have been attempting to separate out key data hidden within sentences of text e.g:
I have made some progress with the following code however it pulls undesired values too:
let
Source = Excel.CurrentWorkbook(){[Name="Table3"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Input", type text}, {"Desired OutPut", type any}, {"Bonus", type text}}),
#"Added Custom" = Table.AddColumn(#"Changed Type", "Custom", each if Text.Contains([Input], "mmHg") then Text.Remove([Input],Text.ToList(Text.Remove([Input],{"0".."9","-", " ", "."}))) else null),
#"Trimmed Text" = Table.TransformColumns(#"Added Custom",{{"Custom", Text.Trim, type text}})
in
#"Trimmed Text"
As you can see other numerical data is being pulled.
I think however following these rules is perhaps the wrong way to go about this and wonder If it's possible to use mmHg as a Tag to extract 'nearby` data. Ideally the value or range will be touching "mmHg" however there are instances where this isnt the case hence this idea of nearby logic. I apprecaite I could remove all data except numbers and mmgH however I think this idea of tagging if possible will be very useful going forward. In my mind im thinking like: if Text contains mmHg then search for {0..9,"-"} within X charecters (say 10 to the left). Is this possible?
As sort of extra I will attempt to extract the Eye that this pressure is found in. Here I wish to use some soft of logic with a sort of first come first serve basis. I think this it an okay assumption that the first pressure will relate to the first mentioned eye per sentence. I am unsure how to do this in M code. This may however warrant a seperate question.
I think you can utilize regular expressions here:
Step 1):
Add a custom function to the group of your table:
In this case I called it 'fnRegexExtr' (much like a previous question you asked). The source function I used came from here and is a regex-replace function.
(x,y,z)=>
let
Source = Web.Page(
"<script>var x="&"'"&x&"'"&";var z="&"'"&z&
"'"&";var y=new RegExp('"&y&"','g');
var b=x.replace(y,z);document.write(b);</script>")
[Data]{0}[Children]{0}[Children]{1}[Text]{0}
in
Source
Step 2):
On the 'Add Column' tab, invoke this custom function. Use the following parameters:
x - Input
y - (\\d+(?:-\\d+)?)\\D*mmHg|.
z - $1
Step 3):
We can add another column using the same function with different parameters:
x - Input
y - \\b(right|left)\\s*eye\\b|.
z - $1
Please note the trailing spaces. Using spaces inbetween capture group 1 makes that PQ will auto-trim the result.
Step 4):
Under tab 'Transform' I simply replaced errors with 'null' values.
Step 5):
Edited the M-code to replace spaces inbetween values with comma-space delimiters.
Result:
M-Code:
let
Source = Excel.CurrentWorkbook(){[Name="Tabel1_2"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Input", type text}}),
#"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "mmHg", each Text.Replace(fnRegexExtr([Input], "(\\d+(?:-\\d+)?)\\D*mmHg|.", "$1 ")," ",", ")),
#"Invoked Custom Function1" = Table.AddColumn(#"Invoked Custom Function", "Side", each Text.Replace(fnRegexExtr([Input], "\\b(right|left)\\s*eye\\b|.", "$1 ")," ",", ")),
#"Replaced Errors" = Table.ReplaceErrorValues(#"Invoked Custom Function1", {{"mmHg", null}, {"Side", null}})
in
#"Replaced Errors"

Extract CAS Number from Downloaded Data

I have downloaded a CSV file from Pubchem containing over 5000+ records. One of the columns contains a bunch of computed synonyms where CAS Number is the records I wish to extract. Unfortunately, the CAS number isn't necessarily in the same position in this list, making splitting by delimiter more difficult. Below is the source data example and the desired output I am trying to achieve.
An older answer to a post a while back used a Regex function to extract strings of Numbers with a given length.
fnRegexExtr
let fx=(text,regex)=>
Web.Page(
"<script>
var x='"&text&"';
var y=new RegExp('"&regex&"','g');
var b=x.match(y);
document.write(b);
</script>")[Data]{0}[Children]{0}[Children]{1}[Text]{0}
in
fx
Unsure if this is possible here and unfamiliar with Regex but I'm wondering if it is possible to modify this function to extract CAS numbers. The difficulty is that CAS Numbers can be in various formats CAS Numbers are up to 10 digits long using the format xxxxxxx-yy-z.
If anyone has any alternative solutions to extracting CAS numbers with this somewhat complex data feel free to post.
Data:
cid and cmpdname can be anything.
1-Aminopropan-2-ol|1-AMINO-2-PROPANOL|78-96-6|Isopropanolamine|Monoisopropanolamine
1-chloro-2,4-dinitrobenzene|2,4-Dinitrochlorobenzene|97-00-7|Dinitrochlorobenzene|DNCB|Chlorodinitrobenzene|CDNB
1,2-dichloroethane|Ethylene dichloride|107-06-2|Ethylene chloride|Ethane, 1,2-dichloro-|Glycol dichloride|Dutch liquid|Dutch oil|Ethane dichloride|Aethylenchloride
1,2,4-trichlorobenzene|120-82-1|Benzene, 1,2,4-trichloro-|unsym-Trichlorobenzene|Hostetex L-pec|Trojchlorobenzene
CHLOROACETALDEHYDE|2-chloroacetaldehyde|107-20-0|Chloroethanal|2-Chloroethanal|Acetaldehyde, chloro-|Chloroaldehyde|Monochloroacetaldehyde|2-Chloro-1-ethanal
In PQ, this will pull out the contents of any item that does not contain a letter in cmpdsynonym, which I think is basically what you are looking for
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Custom.3", each List.RemoveNulls(List.Transform(Text.Split([cmpdsynonym],"|"), each if _ = Text.Remove (_,{"A".."Z","a".."z"}) then _ else null)){0})
in #"Added Custom"
Here's one way of doing it in PQ, using fnRegexExtr to return the CAS; and a simple Text.Split to return the chemical compound name:
let
//Read in data and set data type as text
Source = Excel.CurrentWorkbook(){[Name="Compounds"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
//Transform to desired output
Result = Table.FromColumns(
{List.Transform(#"Changed Type"[Column1], each Text.Split(_,"|")){0}}
& {List.Transform(#"Changed Type"[Column1],each fnRegexExtr(_, "\\b\\d{1,7}-\\d{2}-\\d"))},
type table[Compound=text, CAS=text]
)
in
Result
Original
Results

Power M query syntax to get the value for a named cell in Excel

I am still learning about Power Query and Power M and I'm trying to get the value of a specific "named" cell in Excel and use this in Power M. It is just a single cell and
=Record.Field(Excel.CurrentWorkbook(){[Name="weekone"]}[Content]{0},Excel.CurrentWorkbook(){[Name="weekone"]}[Content]{0})
Maybe I am not understanding the syntax of how to reach information in a particular field correctly, or I am getting mixed up on how to use the Record.Field() function.
Any help or guidance that can be provided would be greatly appreciated! Thanks!
Record.Field gives the value of a field in a record.
It takes the record as the first argument and the name of the field as the second argument.
In a step by step approach it will be clearer:
let
Source = Excel.CurrentWorkbook(){[Name="weekone"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type date}}),
FirstRecord = #"Changed Type"{0},
RecordValue = Record.Field(FirstRecord,"Column1")
in
RecordValue
Or, in 1 line:
= DateTime.Date(Record.Field(Excel.CurrentWorkbook(){[Name="weekone"]}[Content]{0},"Column1"))
This would be an alternative:
= DateTime.Date(Excel.CurrentWorkbook(){[Name="weekone"]}[Content]{0}[Column1])
My preference would be:
= DateTime.Date(Table.FirstValue(Excel.CurrentWorkbook(){[Name="weekone"]}[Content]))

Resources