A Better way to Extract Sentences from PDFs using Power Query/BI

A Better way to Extract Sentences from PDFs using Power Query/BI - excel

I have been exploring the idea of using Power BI to dynamically try to extract information from any PDF if given a URL.
In this case the URL is https://hpvchemicals.oecd.org/ui/handler.axd?id=fae8d1b1-406b-4287-8a05-f81aa1b16d3f which is a safety assessment profile for formaldehyde. I wish to be able to extract as many sentences from a PDF document as possible.
Assuming ". " identifies the end of a sentence this actually works really well where paragraphs are concerned, along with some other trickery, to split the entire PDFinto sentences which I can then search.
M Code (updated for improved Extraction):
let
Source = Pdf.Tables(Web.Contents("https://hpvchemicals.oecd.org/ui/handler.axd?id=fae8d1b1-406b-4287-8a05-f81aa1b16d3f"), [Implementation="1.3"]),
#"Expanded Data" = Table.ExpandTableColumn(Source, "Data", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29", "Column30", "Column31", "Column32", "Column33", "Column34", "Column35", "Column36", "Column37", "Column38", "Column39", "Column40", "Column41", "Column42"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29", "Column30", "Column31", "Column32", "Column33", "Column34", "Column35", "Column36", "Column37", "Column38", "Column39", "Column40", "Column41", "Column42"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded Data", each ([Kind] = "Page")),
#"Removed Columns1" = Table.RemoveColumns(#"Filtered Rows",{"Name", "Kind"}),
#"Added Index" = Table.AddIndexColumn(#"Removed Columns1", "Index", 0, 1, Int64.Type),
Exclude={"Id"},
List=List.Difference(Table.ColumnNames(#"Removed Columns1"),Exclude),
MergeAllColumns= Table.AddColumn(#"Added Index","Custom", each Text.Combine(Record.ToList( Table.SelectColumns(#"Added Index",List){[Index]}), " ")),
#"Removed Other Columns" = Table.SelectColumns(MergeAllColumns,{"Id", "Custom"}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"Custom", Splitter.SplitTextByDelimiter("#(lf)", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
#"Replaced Value" = Table.ReplaceValue(#"Split Column by Delimiter",". ",".. #",Replacer.ReplaceText,{"Custom"}),
#"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Replaced Value", {{"Custom", Splitter.SplitTextByDelimiter(". ", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
#"Cleaned Text" = Table.TransformColumns(#"Split Column by Delimiter1",{{"Custom", Text.Clean, type text}}),
#"Added Index1" = Table.AddIndexColumn(#"Cleaned Text", "Index", 0, 1, Int64.Type),
#"Added Custom" = Table.AddColumn(#"Added Index1", "Custom.1", each if Text.Contains([Custom], "#") then [Index] else null),
#"Filled Down" = Table.FillDown(#"Added Custom",{"Custom.1"}),
#"Grouped Rows" = Table.Group(#"Filled Down", {"Custom.1"}, {{"Custom", each Text.Combine([Custom], " "), type text}}),
#"Split Column by Delimiter2" = Table.ExpandListColumn(Table.TransformColumns(#"Grouped Rows", {{"Custom", Splitter.SplitTextByDelimiter(". ", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
#"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter2", "Custom.2", each if Text.Contains([Custom], "NOAEL") then [Custom] else null)
in
#"Added Custom1"
So from this:
to this:
Although this sentence splitting works pretty well using ". " I'm now being greedy and wondering if this can be done in an even better way.
There are some instances where the Sentence doesn't split correctly which could be improved. For example, if the end of a sentence is joined to the next e.g. Hello.How are you would not split. Whilst ...e.g. Apple itself is recognised would split into. ...e.g. and Apple.
Python - RegEx for splitting text into sentences (sentence-tokenizing)
Appears to propose doing this using regex however with this excel regex replace function this doesn't appear to work.
fnRegexExtr3 (doesn't require \\ just \):
// regexReplace
let regexReplace=(text as nullable text,pattern as nullable text,replace as nullable text, optional flags as nullable text) as text =>
let
f=if flags = null or flags ="" then "" else flags,
l1 = List.Transform({text, pattern, replace}, each Text.Replace(_, "\", "\\")),
l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
t = Text.Format("<script>var txt='#{0}';document.write(txt.replace(new RegExp('#{1}','#{3}'),'#{2}'));</script>", List.Combine({l2,{f}})),
r=Web.Page(t)[Data]{0}[Children]{0}[Children],
Output=if List.Count(r)>1 then r{1}[Text]{0} else ""
in Output
in regexReplace
please feel free to look at the PDF link and propose any suggestions for improving the capturing of data.
I will continue to update here with any progress.

Related

Extracting Text Between boundaries by applying logical parameters

First I think this is a complicated question to follow. Please see the Steps of my M code which I think will make it clearer.
So I am trying to achieve the following:
The idea is any text in the input box can be limited to relevant sections using the parameters table If the Text in the parameters box is Contained in the Text being searched. For me, at least the question is more complicated than it first appears. If required /interested Please see my explanation below:
Desired Output
Fundamentally I want a way of filtering the Input text, using the parameters box such that each line returned is contained only within the relevant sections of start1-End1, Start2-End2.
Ideally, you can use any part of the text to set the limits. So I could say Everything between SECTION1-2, and between lines 2 and 5. will returns lines 2-5.
Or you could say Everything between SECTION 1-3, lines 4-8. Will return lines 4-8. Note you could even say SECTION1-3, Lines 4 -SECTION 4 which would return lines 4 up to 14.
Finally you could even overlap sections and These should still be captured separately and the lines where overlap occurs should repeat in the output.
M Code:
Parameters:
let
Source = Excel.CurrentWorkbook(){[Name="Parameters"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Start1", type text}, {"End1", type text}, {"Start2", type text}, {"End2", type text}}),
#"Filled Down" = Table.FillDown(#"Changed Type",{"Start1", "End1"}),
#"Duplicated Column1" = Table.DuplicateColumn(#"Filled Down", "Start1", "Start1 - Copy"),
#"Duplicated Column2" = Table.DuplicateColumn(#"Duplicated Column1", "End1", "End1 - Copy"),
#"Duplicated Column3" = Table.DuplicateColumn(#"Duplicated Column2", "Start2", "Start2 - Copy"),
#"Duplicated Column4" = Table.DuplicateColumn(#"Duplicated Column3", "End2", "End2 - Copy"),
#"Added Custom" = Table.AddColumn(#"Duplicated Column4", "Custom", each "X"),
#"Merged Columns" = Table.CombineColumns(#"Added Custom",{"Start1", "Custom"},Combiner.CombineTextByDelimiter("+++", QuoteStyle.None),"Start1"),
#"Duplicated Column" = Table.DuplicateColumn(#"Merged Columns", "End1", "End1 - Copy.1"),
#"Merged Columns3" = Table.CombineColumns(#"Duplicated Column",{"Start1", "End1"},Combiner.CombineTextByDelimiter(",", QuoteStyle.None),"Search1"),
#"Added Custom1" = Table.AddColumn(#"Merged Columns3", "Custom", each "X"),
#"Added Custom3" = Table.AddColumn(#"Added Custom1", "Custom.1", each "X"),
#"Merged Columns1" = Table.CombineColumns(#"Added Custom3",{"Start2", "Custom"},Combiner.CombineTextByDelimiter("+++", QuoteStyle.None),"Start2"),
#"Merged Columns4" = Table.CombineColumns(#"Merged Columns1",{"End2", "Custom.1"},Combiner.CombineTextByDelimiter("+++", QuoteStyle.None),"End2"),
#"Merged Columns2" = Table.CombineColumns(#"Merged Columns4",{"Start2", "End2", "End1 - Copy.1"},Combiner.CombineTextByDelimiter(",", QuoteStyle.None),"Search2"),
#"Added Custom2" = Table.AddColumn(#"Merged Columns2", "Custom", each Table.FromColumns({Text.Split([Search1], ","), Text.Split([Search2], ",")})),
#"Removed Other Columns" = Table.SelectColumns(#"Added Custom2",{"Start1 - Copy", "End1 - Copy","Start2 - Copy","End2 - Copy","Custom"}),
#"Expanded Custom" = Table.ExpandTableColumn(#"Removed Other Columns", "Custom", {"Column1", "Column2"}, {"Search1", "Search2"}),
#"Split Column by Delimiter" = Table.SplitColumn(#"Expanded Custom", "Search1", Splitter.SplitTextByEachDelimiter({"+++"}, QuoteStyle.None, true), {"Search1", "Filter1"}),
#"Split Column by Delimiter1" = Table.SplitColumn(#"Split Column by Delimiter", "Search2", Splitter.SplitTextByEachDelimiter({"+++"}, QuoteStyle.None, true), {"Search2", "Filter2"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Search1", type text}, {"Filter1", type text}, {"Search2", type text}, {"Filter2", type text}}),
#"Filled Down1" = Table.FillDown(#"Changed Type1",{"Search1"}),
#"Sorted Rows" = Table.Sort(#"Filled Down1",{{"Filter1", Order.Ascending}}),
#"Replaced Value" = Table.ReplaceValue(#"Sorted Rows",null,"",Replacer.ReplaceValue,{"Filter1", "Search2", "Filter2"})
in
#"Replaced Value"
Text:
let
Source = Excel.CurrentWorkbook(){[Name="TextToSearch"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Text", type text}}),
Search1 = Table.AddColumn(#"Changed Type", "Search1", (x) => Table.SelectRows(Parameters, each Text.Contains(x[Text],[Search1], Comparer.OrdinalIgnoreCase))),
#"Expanded Search1" = Table.ExpandTableColumn(Search1, "Search1", {"Search1", "Filter1"}, {"Search1", "Filter1"}),
#"Filled Down" = Table.FillDown(#"Expanded Search1",{"Search1", "Filter1"}),
#"Filtered Rows" = Table.SelectRows(#"Filled Down", each ([Filter1] = "X")),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Text", "Search1"}),
Search2 = Table.AddColumn(#"Removed Other Columns", "Search2", (x) => Table.SelectRows(Parameters, each Text.Contains(x[Search1],[Search1], Comparer.OrdinalIgnoreCase) and Text.Contains(x[Text],[Search2], Comparer.OrdinalIgnoreCase))),
#"Removed Other Columns1" = Table.SelectColumns(Search2,{"Text", "Search2"}),
#"Expanded Search2" = Table.ExpandTableColumn(#"Removed Other Columns1", "Search2", {"Start1 - Copy", "End1 - Copy", "Start2 - Copy", "End2 - Copy", "Search2", "Filter2"}, {"Start1 - Copy", "End1 - Copy", "Start2 - Copy", "End2 - Copy", "Search2.1", "Filter2"}),
#"Filled Down1" = Table.FillDown(#"Expanded Search2",{"Search2.1", "Filter2"}),
#"Filtered Rows1" = Table.SelectRows(#"Filled Down1", each ([Filter2] = "X")),
#"Removed Other Columns2" = Table.SelectColumns(#"Filtered Rows1",{"Start1 - Copy", "End1 - Copy", "Start2 - Copy", "End2 - Copy", "Text"}),
#"Filled Down2" = Table.FillDown(#"Removed Other Columns2",{"Start1 - Copy", "End1 - Copy", "Start2 - Copy", "End2 - Copy", "Text"}),
#"Renamed Columns" = Table.RenameColumns(#"Filled Down2",{{"Start1 - Copy", "Start1"}, {"End1 - Copy", "End1"}, {"Start2 - Copy", "Start2"}, {"End2 - Copy", "End2"}})
in
#"Renamed Columns"
Real Example:
https://1drv.ms/x/s!AsrLaUgt0KCLvUgQQctfMtFe057l?e=AkbeP3

Here you go.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WCnZ1DvH091MwVNKBs42A7JzMvFSwIJhhohSrE60E5MEE4EpMwTLIOmFsY7gpBnCWIYpqUyTVpnCT4aqNjJRiYwE=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Start1 = _t, End1 = _t, Start2 = _t, End2 = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Start1", type text}, {"End1", type text}, {"Start2", type text}, {"End2", type text}}),
#"Replaced Value" = Table.ReplaceValue(#"Changed Type","",null,Replacer.ReplaceValue,{"Start1", "End1"}),
#"Filled Down" = Table.FillDown(#"Replaced Value",{"Start1", "End1"}),
#"Added Custom" = Table.AddColumn(#"Filled Down", "Custom", each let
input = Input[Text],
s1 = List.PositionOf(input, [Start1]),
e1 = List.PositionOf(input, [End1]),
r1 = if s1=e1 then List.Range(input,s1) else List.Range(input,s1,e1-s1+1),
s2 = List.PositionOf(r1, [Start2]),
e2 = List.PositionOf(r1, [End2]),
r2 = List.Range(r1,s2,e2-s2+1)
in r2),
#"Expanded Custom" = Table.ExpandListColumn(#"Added Custom", "Custom")
in
#"Expanded Custom"

Here it is with partial matches.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45W8kjNUdJRCnZ1DvH091MwArJzMvNSFQxhDBOlWJ1oJSAPJgBXYgqWQdYJYxvDTTGAswxRVJsiqTaFmwxXbWSkFBsLAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Start1 = _t, End1 = _t, Start2 = _t, End2 = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Start1", type text}, {"End1", type text}, {"Start2", type text}, {"End2", type text}}),
#"Replaced Value" = Table.ReplaceValue(#"Changed Type","",null,Replacer.ReplaceValue,{"Start1", "End1"}),
#"Filled Down" = Table.FillDown(#"Replaced Value",{"Start1", "End1"}),
#"Added Custom" = Table.AddColumn(#"Filled Down", "Custom", each let
input = Input[Text],
s1 = List.PositionOf(input, List.FindText(input,[Start1]){0}),
e1 = List.PositionOf(input, List.FindText(input,[End1]){0}),
r1 = if s1=e1 then List.Range(input,s1) else List.Range(input,s1,e1-s1+1),
s2 = List.PositionOf(r1, List.FindText(input,[Start2]){0}),
e2 = List.PositionOf(r1, List.FindText(input,[End2]){0}),
r2 = List.Range(r1,s2,e2-s2+1)
in r2),
#"Expanded Custom" = Table.ExpandListColumn(#"Added Custom", "Custom")
in
#"Expanded Custom"

A better way to extract Subheading numbers using power query

I am attempting to extract section heading numbers from a column in excel using power query.
I have already achieved this by matching with an existing list. However, I wonder if there is a better way to achieve this in fewer steps.
M Code:
let
Source = Excel.CurrentWorkbook(){[Name="Table7"]}[Content],
#"Trimmed Text1" = Table.TransformColumns(Source,{{"Column1", PowerTrim, type text}}),
SectionNumbers = Table.AddColumn(#"Trimmed Text1", "SectionNumber", (x) => Text.Combine(Table.SelectRows(SectionNumbers, each Text.Contains(x[Column1],[SectionNumbers], Comparer.OrdinalIgnoreCase))[SectionNumbers],", ")),
#"Split Column by Delimiter2" = Table.SplitColumn(SectionNumbers, "SectionNumber", Splitter.SplitTextByEachDelimiter({","}, QuoteStyle.None, true), {"SectionNumber.1", "SectionNumber.2"}),
#"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter2", "Custom", each if [SectionNumber.2] = null then [SectionNumber.1] else [SectionNumber.2]),
#"Removed Other Columns" = Table.SelectColumns(#"Added Custom1",{"Column1", "Custom"})
in
#"Removed Other Columns"
The Section numbers being matched to can be generated using:
SectionNumbers
let
Source = {1..16},
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Added Custom" = Table.AddColumn(#"Converted to Table", "Custom", each {1..9}),
#"Expanded Custom" = Table.ExpandListColumn(#"Added Custom", "Custom"),
#"Added Custom1" = Table.AddColumn(#"Expanded Custom", "Custom.1", each "."),
#"Merged Columns" = Table.CombineColumns(Table.TransformColumnTypes(#"Added Custom1", {{"Custom", type text}}, "en-GB"),{"Custom", "Custom.1"},Combiner.CombineTextByDelimiter("", QuoteStyle.None),"Merged"),
#"Merged Columns1" = Table.CombineColumns(Table.TransformColumnTypes(#"Merged Columns", {{"Column1", type text}}, "en-GB"),{"Column1", "Merged"},Combiner.CombineTextByDelimiter(".", QuoteStyle.None),"SectionNumbers")
in
#"Merged Columns1"
Essentially I would like a way of extracting any decimal at the start of a cell, either 15.0. or 15.0, or even 15.0.1 etc.
I have considered using regex i.e. \d+\.\d+[.] which should work however I have many rows and find that regex sometimes is computationally intensive in this case, so it takes much longer to load than the Above M Code.

Another power query method
Since you know your section numbers you could:
Generate a (buffered) list of all the section numbers
see if the first space-separated part of the string in column 1 exists in the Section Number list.
let
Source = Excel.CurrentWorkbook(){[Name="Table3"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
//create list of all serial numbers
SerialNumbers = List.Buffer(
let
Part1 = List.Transform({1..16}, each Text.From(_) & "."),
Part2 =List.Transform({1..10}, each Text.From(_) & "."),
sn = List.Accumulate(Part1,{}, (state, current)=> state &
List.Generate(
()=>[s=current & Part2{0}, idx=0],
each [idx] < List.Count(Part2),
each [s=current & Part2{[idx]+1}, idx=[idx]+1],
each [s]))
in
sn),
#"Added Custom" = Table.AddColumn(#"Changed Type", "Custom", each
let
x = Text.Split([Column1]," "){0}
in
if List.Contains(SerialNumbers,x) then x else null, type text)
in
#"Added Custom"

How about
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.PositionOfAny([Column1], {"0".."9"})>=0 then Text.BeforeDelimiter(Text.From([Column1])," ") else null)
in #"Added Custom"

Appending incomplete sentences together using grammatical rules (Mainly solved But answer needs simplifying)

Pulling data into excel from a text file of PDF occasionally splits sentences.
Following a previous post, Using PQ I can group text together and then with regex identify . belonging to the end of a sentence and split using this. This works really well, except where a Subheading may exist without punctuation. This results in a minor error where sentences are occasionally joined together. for example
SUBHEADING 1
Non-sensical
sentence 1. Sentence 2.
Returns as:
SUBHEADING 1 Non-sensical sentence 1.
Sentence 2.
I initially thought this was the best that could be done, however, the focus so far has been on identifying the end of a sentence with . Additionally though, we could also use the fact that sentences tend to start with a capital letter to refine this further. The way excel interprets PDF pages means that prior transformations are required to group the text, which I am pretty certain rules out using regex.
Here is an example of what I am trying to do (orange) and where I currently am (green):
Notice how each sentence starts with a capital and ends with .. I think this will be a logical way to separate SUBHEADING/CHAPTERS, etc., from the rest of the text. From this, I can then apply the remainder of the transformations on the previous post and if im right, this should give better separation.
So far:
I have identified which sentences Start with a Captial letter/lowercase and which end in lowercase or . (where ending in a number is also considered lowercase). Using this, I think it's safe to assume that:
If a row ends lowercase AND the next row also contains lowercase,
this belongs to the same sentence and can be appended.
If the Sentence starts Uppercase and ends in a full stop, this is a
complete sentence and requires no changes.
If a Sentence starts uppercase without a .At the end and then Next
is uppercase; this is a subheading/Extra info, not a typical
sentence.
I think it is possible to generate the Desired table using these rules. I am having trouble determining the transformations required to append the sentences following the above rules. I need some sort of way to identify the Upper/lower/. of rows above and below and I just cant see how.
I will update as I progress; however, if anyone could comment on how to achieve this using these rules, that would be great.
Note there is an error with the current M code where 3d. Is recognised as uppercase.
M Code:
let
Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Start([Column1], 1) = Text.Lower(Text.Start([Column1], 1)) then "Lowercase" else "Uppercase"),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each if Text.End([Column1], 1) = "." then Text.End([Column1], 1) else if Text.End([Column1], 1) = Text.Lower(Text.End([Column1], 1)) then "Lowercase" else null)
in
#"Added Custom1"
Source Data:
SUBHEADING 1
Non-sensical
sentence 1. Sentence 2.
Sentence 3 part 3a,
part 3b, and
part 3c.
SUBHEADING 1.1.
Non-sensical
sentence 4. Sentence 5.
Sentence 6 part 3a,
part 3b, 3c and
3d.
2.0. SUBHEADING
Extra Info 1 (Not a proper sentence)
Extra Info 2
Sentence 7.
SUBHEADING 3.0
Sentence 8.
SOLUTION SO FAR (Minor error if Subheading starts with number)
let
Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Start", each if Text.Start([Column1], 1) = Text.Lower(Text.Start([Column1], 1)) then "Lowercase" else "Uppercase"),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "End", each if Text.End([Column1], 1) = "." then Text.End([Column1], 1) else if Text.End([Column1], 1) = Text.Lower(Text.End([Column1], 1)) then "Lowercase" else null),
#"Added Custom4" = Table.AddColumn(#"Added Custom1", "Complete Sentences", each if [Start] = "Uppercase" and [End] = "." then "COMPLETE" else null),
#"Added Index" = Table.AddIndexColumn(#"Added Custom4", "Index", 0, 1, Int64.Type),
#"Added Custom2" = Table.AddColumn(#"Added Index", "StartCheckCaseBelow", each try #"Added Index" [Start] { [Index] + 1 } otherwise "COMPLETE"),
#"Removed Columns1" = Table.RemoveColumns(#"Added Custom2",{"Index"}),
#"Added Custom3" = Table.AddColumn(#"Removed Columns1", "CompleteHeadings_CompleteText", each if [StartCheckCaseBelow]= "COMPLETE" then "COMPLETE" else if [Start] = "Uppercase" and [StartCheckCaseBelow] = "Uppercase" then "COMPLETE" else null),
#"Added Custom5" = Table.AddColumn(#"Added Custom3", "Custom", each if [CompleteHeadings_CompleteText] = null then if [Start] = "Uppercase" and [End] = "Lowercase" then "START" else if [End] = "." then "END" else null else null),
#"Added Index1" = Table.AddIndexColumn(#"Added Custom5", "Index", 0, 1, Int64.Type),
#"Merged Columns" = Table.CombineColumns(Table.TransformColumnTypes(#"Added Index1", {{"Index", type text}}, "en-GB"),{"Index", "CompleteHeadings_CompleteText"},Combiner.CombineTextByDelimiter(" ", QuoteStyle.None),"CompleteHeadings_CompleteText"),
#"Added Index3" = Table.AddIndexColumn(#"Merged Columns", "Index", 0, 1, Int64.Type),
#"Filtered Rows" = Table.SelectRows(#"Added Index3", each ([Custom] = "START")),
#"Added Index2" = Table.AddIndexColumn(#"Filtered Rows", "temp", 0, 1, Int64.Type),
combined = #"Added Index2" & Table.SelectRows(#"Added Index3", each [Custom] <> "START"),
#"Sorted Rows" = Table.Sort(combined,{{"Index", Order.Ascending}}),
#"Removed Columns" = Table.RemoveColumns(#"Sorted Rows",{"Index"}),
#"Added Custom6" = Table.AddColumn(#"Removed Columns", "Custom.1", each if Text.Contains([CompleteHeadings_CompleteText], "COMPLETE") then [CompleteHeadings_CompleteText] else if [Complete Sentences] <> null then [Complete Sentences] else [temp]),
#"Filled Down" = Table.FillDown(#"Added Custom6",{"Custom.1"}),
#"Removed Other Columns" = Table.SelectColumns(#"Filled Down",{"Column1", "Custom.1"}),
#"Grouped Rows" = Table.Group(#"Removed Other Columns", {"Custom.1"}, {{"Text", each Text.Combine([Column1], " "), type text}}),
#"Removed Other Columns1" = Table.SelectColumns(#"Grouped Rows",{"Text"})
in
#"Removed Other Columns1"
This works to achieve the table in orange however requires many steps which need to be simplified. If you could advise on this or know of a better way please let me know.

let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Start([Column1], 1) = Text.Lower(Text.Start([Column1], 1)) then " " else "| "),
#"Merged Columns" = Table.CombineColumns(#"Added Custom",{ "Custom", "Column1"},Combiner.CombineTextByDelimiter("", QuoteStyle.None),"Merged"),
#"Transposed Table" = Table.Transpose(#"Merged Columns"),
Custom1 = Table.ColumnNames( #"Transposed Table"),
Custom2 = #"Transposed Table",
#"Merged Columns1" = Table.CombineColumns(Custom2,Custom1,Combiner.CombineTextByDelimiter("", QuoteStyle.None),"Merged"),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Merged Columns1", {{"Merged", Splitter.SplitTextByDelimiter("| ", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Merged"),
#"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Split Column by Delimiter", {{"Merged", Splitter.SplitTextByDelimiter(". ", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Merged"),
#"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Merged", type text}})
in
#"Changed Type"
Will be interested to see if there are more elegant solutions out there.

Power Query - Variable Column and Header location from 30+ workbooks

I am trying to combine many workbooks with multiple sheets. The issue is on sheet 1 there is a large information header prior to the information needed to extract. As well as many merged cells that return a large number of nulls and push data into variable columns depending on the date and version of the source workbooks.
Currently sorting and promoting headers allows me to match up the first Two Columns of information needed but subsequent info is pushed right into other fields.
Is there a way to delete nulls and shift the data sets left to match fields? Or better yet identify dynamic header changes and return data to match the selected headers?
Below is an outline of the issue, unfortunately cleaning the data on the amount of sheets and workbooks isn't really an option. I'm fairly new to Power Query and can't seem to figure this one out.
c1 c2 c3 c4 c5 c6 c7
A B Null C D Null E
a b c D Null E Null
A B C Null D G E
Need A-B-C-D-E only.
= () => let
Source = Folder.Files("C:\Users\XXXXXXXX\Desktop\Log"),
#"Filtered Hidden Files1" = Table.SelectRows(Source, each [Attributes]?[Hidden]? <> true),
#"Invoke Custom Function1" = Table.AddColumn(#"Filtered Hidden Files1", "Transform File from Log", each #"Transform File from Log"([Content])),
#"Renamed Columns1" = Table.RenameColumns(#"Invoke Custom Function1", {"Name", "Source.Name"}),
#"Removed Other Columns1" = Table.SelectColumns(#"Renamed Columns1", {"Source.Name", "Transform File from Log"}),
#"Expanded Table Column1" = Table.ExpandTableColumn(#"Removed Other Columns1", "Transform File from Log", Table.ColumnNames(#"Transform File from Log"(#"Sample File"))),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Table Column1",{{"Source.Name", type text}, {"Name", type text}, {"Data", type any}, {"Item", type text}, {"Kind", type text}, {"Hidden", type logical}}),
#"Removed Other Columns" = Table.SelectColumns(#"Changed Type",{"Data", "Name", "Source.Name"}),
#"Filtered Rows" = Table.SelectRows(#"Removed Other Columns", each ([Name] = "page 1" or [Name] = "page 2" or [Name] = "page 2 +" or [Name] = "page 3 +" or [Name] = "page 4 +" or [Name] = "page 5 +" or [Name] = "page 6 +" or [Name] = "page 7 +" or [Name] = "page 8 +")),
#"Reordered Columns" = Table.ReorderColumns(#"Filtered Rows",{"Source.Name", "Name", "Data"}),
#"Expanded Data" = Table.ExpandTableColumn(#"Reordered Columns", "Data", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29", "Column30", "Column31", "Column32", "Column33", "Column34", "Column35", "Column36", "Column37", "Column38", "Column39", "Column40", "Column41", "Column42", "Column43", "Column44", "Column45", "Column46", "Column47", "Column48", "Column49", "Column50", "Column51", "Column52", "Column53", "Column54", "Column55", "Column56", "Column57", "Column58", "Column59", "Column60", "Column61", "Column62", "Column63", "Column64", "Column65", "Column66", "Column67", "Column68", "Column69", "Column70", "Column71", "Column72", "Column73", "Column74", "Column75", "Column76", "Column77", "Column78", "Column79", "Column80", "Column81", "Column82", "Column83", "Column84"}, {"Data.Column1", "Data.Column2", "Data.Column3", "Data.Column4", "Data.Column5", "Data.Column6", "Data.Column7", "Data.Column8", "Data.Column9", "Data.Column10", "Data.Column11", "Data.Column12", "Data.Column13", "Data.Column14", "Data.Column15", "Data.Column16", "Data.Column17", "Data.Column18", "Data.Column19", "Data.Column20", "Data.Column21", "Data.Column22", "Data.Column23", "Data.Column24", "Data.Column25", "Data.Column26", "Data.Column27", "Data.Column28", "Data.Column29", "Data.Column30", "Data.Column31", "Data.Column32", "Data.Column33", "Data.Column34", "Data.Column35", "Data.Column36", "Data.Column37", "Data.Column38", "Data.Column39", "Data.Column40", "Data.Column41", "Data.Column42", "Data.Column43", "Data.Column44", "Data.Column45", "Data.Column46", "Data.Column47", "Data.Column48", "Data.Column49", "Data.Column50", "Data.Column51", "Data.Column52", "Data.Column53", "Data.Column54", "Data.Column55", "Data.Column56", "Data.Column57", "Data.Column58", "Data.Column59", "Data.Column60", "Data.Column61", "Data.Column62", "Data.Column63", "Data.Column64", "Data.Column65", "Data.Column66", "Data.Column67", "Data.Column68", "Data.Column69", "Data.Column70", "Data.Column71", "Data.Column72", "Data.Column73", "Data.Column74", "Data.Column75", "Data.Column76", "Data.Column77", "Data.Column78", "Data.Column79", "Data.Column80", "Data.Column81", "Data.Column82", "Data.Column83", "Data.Column84"}),
#"Filtered Rows1" = Table.SelectRows(#"Expanded Data", each ([Data.Column2] <> null and [Data.Column2] <> 16 and [Data.Column2] <> "16" and [Data.Column2] <> "LOCATION")),
#"Promoted Headers" = Table.PromoteHeaders(#"Filtered Rows1", [PromoteAllScalars=true])
in
#"Promoted Headers"
Picture

To get rid of nulls and slide everything to the left
add column .. index column
right click index column, unpivot other columns
right click and remove attribute column
Group on Index and add another index in each group by modifying the code to end with
each Table.AddIndexColumn(_, "Index2", 1, 1), type table}})
Expand the column using arrows atop, for the [x]values and [x] index2 fields
Click the Index2 field and transform .. pivot column, with Value as Values, advanced, do not aggregate
Sample code for transforming above BEFORE table to AFTER table
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Index" = Table.AddIndexColumn(Source, "Index", 0, 1),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index"}, "Attribute", "Value"),
#"Removed Columns" = Table.RemoveColumns(#"Unpivoted Other Columns",{"Attribute"}),
#"Grouped Rows" = Table.Group(#"Removed Columns", {"Index"}, {{"GRP", each Table.AddIndexColumn(_, "Index2", 1, 1), type table}}),
#"Expanded GRP" = Table.ExpandTableColumn(#"Grouped Rows", "GRP", {"Value", "Index2"}, {"Value", "Index2"}),
#"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Expanded GRP", {{"Index2", type text}}, "en-US"), List.Distinct(Table.TransformColumnTypes(#"Expanded GRP", {{"Index2", type text}}, "en-US")[Index2]), "Index2", "Value"),
#"Removed Columns1" = Table.RemoveColumns(#"Pivoted Column",{"Index"})
in #"Removed Columns1"

Remove duplicate texts in one cell

I am currently shaping my data. I have one column called "Centro". However, there are so many duplicate texts in just one cell. How Can I remove the duplicate texts and only show distinct texts? Anyone can help on this?Thanks!!
This is my code:
let
Source = Etapa_2_Caricam,
#"Grouped Rows" = Table.Group(Source, {"Material"}, {{"mynewtable", each _, type table [Material=number, Num Form=text, Created on=date, FechaCreac=date, Initiator=text, Texto tarea=text]}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "NoForm", each Table.Column([mynewtable],"Num Form")),
#"Extracted Values" = Table.TransformColumns(#"Added Custom", {"NoForm", each Text.Combine(List.Transform(_, Text.From), ", "), type text}),
#"Added Custom1" = Table.AddColumn(#"Extracted Values", "Iniciador", each Table.Column([mynewtable],"Initiator")),
#"Extracted Values1" = Table.TransformColumns(#"Added Custom1", {"Iniciador", each Text.Combine(List.Transform(_, Text.From), ", "), type text}),
#"Extracted Text Before Delimiter" = Table.TransformColumns(#"Extracted Values1", {{"Iniciador", each Text.BeforeDelimiter(_, ", "), type text}}),
#"Added Custom2" = Table.AddColumn(#"Extracted Text Before Delimiter", "FechaInicio", each Table.Column([mynewtable],"Created on")),
#"Extracted Values2" = Table.TransformColumns(#"Added Custom2", {"FechaInicio", each Text.Combine(List.Transform(_, Text.From), ", "), type text}),
#"Extracted Text Before Delimiter1" = Table.TransformColumns(#"Extracted Values2", {{"FechaInicio", each Text.BeforeDelimiter(_, ", "), type text}}),
#"Added Custom3" = Table.AddColumn(#"Extracted Text Before Delimiter1", "FechaFinalTarea", each let dates = Table.Column([mynewtable],"FechaCreac") in [min = List.Min(dates), max = List.Max(dates)]),
expanded = Table.ExpandRecordColumn(#"Added Custom3", "FechaFinalTarea", {"min", "max"}),
#"Changed Type" = Table.TransformColumnTypes(expanded,{{"min", type date}, {"max", type date}, {"FechaInicio", type date}}),
#"Added Custom4" = Table.AddColumn(#"Changed Type", "TextoTarea", each Table.Column([mynewtable],"Texto tarea")),
#"Extracted Values3" = Table.TransformColumns(#"Added Custom4", {"TextoTarea", each Text.Combine(List.Transform(_, Text.From), ", "), type text}),
#"Changed Type1" = Table.TransformColumnTypes(#"Extracted Values3",{{"Material", type text}}),
#"Split Column by Delimiter" = Table.SplitColumn(Table.TransformColumnTypes(#"Changed Type1", {{"FechaInicio", type text}}, "en-US"), "FechaInicio", Splitter.SplitTextByDelimiter("/", QuoteStyle.Csv), {"FechaInicio.1", "FechaInicio.2", "FechaInicio.3"}),
#"Changed Type2" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"FechaInicio.1", Int64.Type}, {"FechaInicio.2", Int64.Type}, {"FechaInicio.3", Int64.Type}}),
#"Renamed Columns" = Table.RenameColumns(#"Changed Type2",{{"FechaInicio.2", "FID"}, {"FechaInicio.1", "FIM"}, {"FechaInicio.3", "FIA"}}),
#"Changed Type3" = Table.TransformColumnTypes(#"Renamed Columns",{{"FIM", type text}, {"FID", type text}, {"FIA", type text}}),
#"Added Custom5" = Table.AddColumn(#"Changed Type3", "FechaInicio", each [FID]&"/"&[FIM]&"/"&[FIA]),
#"Changed Type4" = Table.TransformColumnTypes(#"Added Custom5",{{"FechaInicio", type date}}),
#"Removed Columns1" = Table.RemoveColumns(#"Changed Type4",{"FIM", "FID", "FIA"}),
#"Reordered Columns" = Table.ReorderColumns(#"Removed Columns1",{"Material", "mynewtable", "NoForm", "Iniciador", "FechaInicio", "min", "max", "TextoTarea"}),
#"Split Column by Delimiter1" = Table.SplitColumn(Table.TransformColumnTypes(#"Reordered Columns", {{"max", type text}}, "en-US"), "max", Splitter.SplitTextByDelimiter("/", QuoteStyle.Csv), {"max.1", "max.2", "max.3"}),
#"Changed Type5" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"max.1", type text}, {"max.2", type text}, {"max.3", type text}}),
#"Added Custom6" = Table.AddColumn(#"Changed Type5", "FechaFinal", each [max.2]&"/"&[max.1]&"/"&[max.3]),
#"Reordered Columns1" = Table.ReorderColumns(#"Added Custom6",{"Material", "mynewtable", "NoForm", "Iniciador", "FechaInicio", "FechaFinal", "min", "max.1", "max.2", "max.3", "TextoTarea"}),
#"Changed Type6" = Table.TransformColumnTypes(#"Reordered Columns1",{{"FechaFinal", type date}}),
#"Removed Columns2" = Table.RemoveColumns(#"Changed Type6",{"max.1", "max.2", "max.3"}),
#"Changed Type7" = Table.TransformColumnTypes(#"Removed Columns2",{{"FechaFinal", type text}, {"FechaInicio", type text}}),
#"Added Custom7" = Table.AddColumn(#"Changed Type7", "Table", each Table.Column([mynewtable],"NoMatAnt")),
#"Extracted Values4" = Table.TransformColumns(#"Added Custom7", {"Table", each Text.Combine(List.Transform(_, Text.From), ", "), type text}),
#"Inserted Text Before Delimiter" = Table.AddColumn(#"Extracted Values4", "Text Before Delimiter", each Text.BeforeDelimiter([Table], ", "), type text),
#"Removed Columns3" = Table.RemoveColumns(#"Inserted Text Before Delimiter",{"Text Before Delimiter"}),
#"Extracted Text Before Delimiter2" = Table.TransformColumns(#"Removed Columns3", {{"Table", each Text.BeforeDelimiter(_, ", "), type text}}),
#"Duplicated Column" = Table.DuplicateColumn(#"Extracted Text Before Delimiter2", "Table", "Table - Copy"),
#"Replaced Value" = Table.ReplaceValue(#"Duplicated Column","CR","",Replacer.ReplaceText,{"Table - Copy"}),
#"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","DO","",Replacer.ReplaceText,{"Table - Copy"}),
#"Replaced Value2" = Table.ReplaceValue(#"Replaced Value1","GT","",Replacer.ReplaceText,{"Table - Copy"}),
#"Replaced Value3" = Table.ReplaceValue(#"Replaced Value2","PR","",Replacer.ReplaceText,{"Table - Copy"}),
#"Renamed Columns1" = Table.RenameColumns(#"Replaced Value3",{{"Table - Copy", "No.FormAnt"}, {"Table", "No.MatAnt"}}),
#"Trimmed Text" = Table.TransformColumns(#"Renamed Columns1",{{"TextoTarea", Text.Trim, type text}}),
#"Added Custom8" = Table.AddColumn(#"Trimmed Text", "Centro", each Table.Column([mynewtable],"Ce")),
#"Extracted Values5" = Table.TransformColumns(#"Added Custom8", {"Centro", each Text.Combine(List.Transform(_, Text.From), ", "), type text})
in
#"Extracted Values5"

The easiest way to do this would be to split the comma separated values (Centro field), into different rows and then remove duplicates. Then you can group them up again to get the comma separated values. For the purpose of demonstrating, I created a table with two fields: PrimaryKey and Centro. Then I used the following steps to get to the desired output:
Source Information:
= Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUUrUSdJJUorViVYyAvKSdVJ0knWSlWJjAQ==", BinaryEncoding.Base64), Compression.Deflate)),
let _t = ((type text) meta [Serialized.Text = true]) in type table [PrimaryKey = _t, Centro = _t])
Split Column By delimiter (You can do this by selecting the "Split Column - By Delimiter" option in the Transform tab, you have to choose the Centro field before selecting this option):
= Table.ExpandListColumn(Table.TransformColumns(Source, {{"Centro", Splitter.SplitTextByDelimiter(",")}}), "Centro")
Remove Duplicates (You can do this by choosing "Remove Rows - Remove Duplicates" from the Home tab, make sure you choose all the columns before selecting this option):
= Table.Distinct(#"Split Column by Delimiter")
Grouped Rows (You can do this by choosing the "Group By" option in the transform tab, but you would have to edit the query a little to use the delimiter aggregation):
= Table.Group(#"Removed Duplicates", {"PrimaryKey"}, {{"Centro_New", each Text.Combine([Centro],","), type text}})
This should give you the desired output. Hope this helps.
Edit: You can combine all these into a single step and use the following formula:
= Table.Group(Table.Distinct(Table.ExpandListColumn(Table.TransformColumns(Source, {{"Centro", Splitter.SplitTextByDelimiter(",")}}), "Centro")),{"PrimaryKey"}, {{"Centro_New", each Text.Combine([Centro],","), type text}})

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

A Better way to Extract Sentences from PDFs using Power Query/BI - excel

Related

Extracting Text Between boundaries by applying logical parameters

A better way to extract Subheading numbers using power query

Appending incomplete sentences together using grammatical rules (Mainly solved But answer needs simplifying)

Power Query - Variable Column and Header location from 30+ workbooks

Remove duplicate texts in one cell

Categories

Resources