powerquery split column by variable field lengths - text

In PowerQuery I need to import a fixed width txt file (each line is the concatenation of a number of fields, each field has a fixed specific length).
When I import it I get a table with one single column that contains the txt lines, e.g. in the following format:
AAAABBCCCCCDDD
I want to add more columns in this way:
Column1: AAAA
Column2: BB
Column3: CCCCC
Column4: DDD
In other words the fields composing the source column are of known length, but this length is not the same for all fields (in the example above the lengths are: 4,2,5,3).
I'd like to use the "Split Column">"By number of character" utility but I can only insert one single length at a time, and to get the desired output I'd have to repeat the process 3 times, adding one column each time and using the "Once, as far left as possible" option for the "Split Column">"By number of character" utility.
My real life case has many different line types (files) to import and convert, each with more then 20 fields, so a less manual approach is needed; I'd like to somehow specify the record structure (the length of each field) and get the lines split automagically :)
There would probably be the need for some M code, which I know nothing about: can anybody point me to the right direction?
Thanks!

Create a query with the formula below. Let's call this query SplitText:
let
SplitText = (text, lengths) =>
let
LengthsCount = List.Count(lengths),
// Keep track of the index in the lengths list and the position in the text to take the next characters from. Use this information to get the next segment and put it into a list.
Split = List.Generate(() => {0, 0}, each _{0} < LengthsCount, each {_{0} + 1, _{1} + lengths{_{0}}}, each Text.Range(text, _{1}, lengths{_{0}}))
in
Split,
// Convert the list to a record to
ListToRecord = (text, lengths) =>
let
List = SplitText(text, lengths),
Record = Record.FromList(List, List.Transform({1 .. List.Count(List)}, each Number.ToText(_)))
in
Record
in
ListToRecord
Then, in your table, add a custom column that uses this formula:
each SplitText([Column1], {4, 2, 5, 3})
The first argument is the text to split, and the second argument is a list of lengths to split by.
Finally, expand the column to get the split text values into your table. You may want to rename the columns since they will be named 1, 2, etc.

Related

pdfplumber - How to extract table with no horizontal lines?

So I have a table like this one, with an unknown number of description lines. Some can have 1, 2, 5, even zero, or more lines:
(I removed all sensitive informations.)
and I use :
with pdfplumber.open("invoice.pdf") as pdf:
pages = pdf.pages
for page in pages:
page.extract_table()
which is does extract all data from the table but the second column it treats as one row.
I want somehow to split the lines of second column (or better all columns) by a small blank row, which so I put it on red rectangles to highlight it.
I know that I need to use table_settings={}, but I can't figure out ... yet, which property (ies), to use ?
What I tried:
print(page.extract_table(table_settings={
"horizontal_strategy": "text",
"snap_y_tolerance": 3,
"keep_blank_chars": True,
}))
Which, again, it splits when he wants ..
So it's possible to extract a mix-borderless table ?

Convert comma separated string/list into table/Matrix format using PDI

Using Pentaho data integration (Kettle), I read a long string from a text file:
a, 1, 2, b, 3, 4, c, 5, 6, ...
Is there any PDI/Kettle steps or method to split this string to become an n column table format like below (the column name can be define freely):
column1
column2
column3
a
1
2
b
3
4
c
5
6
the above just a simplify example, my real case is having different separator character and the column number (n) is bigger. But I just want to get the main problem solve first.
I have prepared a SOLUTION for you. In my solution I have set N=3, But you can set as many as you want. Also, you require to input Column name in 'Row denormaliser' step if you want to set N =3/4/5/N.
Although , you can set column name dynamic (if you want) using 'Meta Data Injection' step easily.
I didn't understand about "different separator character". If you have different separators on the same line, such as a comma and a semicolon, then this is a tricky task for the PDI process. Then you need to cast all delimiters to the same type first. For example, in Notepad ++, make a replacement. Notepad ++ does a good job with large CSV files.
Further in the PDI there is a standard separator component "Split Fields".

How to extract text from a string between where there are multiple entires that meet the criteria and return all values

This is an exmaple of the string, and it can be longer
1160752 Meranji Oil Sats -Mt(MA) (000600007056 0001), PE:Toolachee Gas Sats -Mt(MA) (000600007070 0003)GL: Contract Services (510000), COT: Network (N), CO: OM-A00009.0723,Oil Sats -Mt(MA) (000600007053 0003)
The result needs to be column1 600007056 column2 600007070 column3 600007053
I am working in Spotfire and creating calclated columns through transformations as I need the columns to join to other data sets
I have tried the below, but it is only picking up the 1st 600.. number not the others, and there can be an undefined amount of those.
Account is the column with the string
Mid([Account],
Find("(000",[Account]) + Len("(000"),
Find("0001)",[Account]) - Find("(000",[Account]) - Len("(000"))
Thank you!
Assuming my guess is correct, and the pattern to look for is:
9 numbers, starting with 6, preceded by 1 opening parenthesis and 3 zeros, followed by a space, 4 numbers and a closing parenthesis
you can grab individual occurrences by:
column1: RXExtract([Amount],'(?<=\\(000)6\\d{8}(?=\\s\\d{4}\\))',1)
column2: RXExtract([Amount],'(?<=\\(000)6\\d{8}(?=\\s\\d{4}\\))',2)
etc.
The tricky bit is to find how many columns to define, as you say there can be many. One way to know would be to first calculate a max number of occurrences like this:
maxn: Max((Len([Amount]) - Len(RXReplace([Amount],'(?<=\\(000)6\\d{8}(?=\\s\\d{4}\\))','','g'))) / 9)
still assuming the number of digits in each column to extract is 9. This compares the length of the original [Amount] to the one with the extracted patterns replaced by an empty string, divided by 9.
Then you know you can define up to maxn columns, the extra ones for the rows with fewer instances will be empty.
Note that Spotfire always wants two back-slash for escaping (I had to add more to the editor to make it render correctly, I hope I have not missed any).

PowerQuery remove items from a list matching a pattern

Using PowerQuery in excel how can I remove items from a list that match a pattern.
I have a column with cells that contain names and numeric id's. I want to be left with just a list of names.
LastName, FirstName;#123;#LastName, FirstName;#321;
The numbers are all unique. So if I had regex the pattern would be similar to
/^\#ddd+$/
I can split the cell into a list using ';' as a separator.
= Text.Split([Consultant],";")
If there was a way to remove every 2nd item until the end that could work too. Unfortunately it seems there is no way to specify patterns to match.
List.RemoveItems({1, 2, 3, 4, 2, 5, 5}, {2, 4, 6})
This would be awesome however I have to define all the number patterns that exist. So this fails.
List.RemoveMatchingItems(Text.Split([Consultant], ";#"), {1,2,3,4,5,6,7,8,9})
Method2
I split the text into a list as above. This gave me a column of lists. So I expanded the lists in columns to new rows. My plan was to remove alternate rows. However, remove alternate rows requires an end number. I would need an argument to go until there are no more arguments to process.
There are many ways.
One way is to select every other item with List.Select
In your example, these would be the items with an even number position.
let
x = Text.Split([Column1],";#"),
y = List.Select(x, each Number.IsEven(List.PositionOf(x, _)))
in
y
Edit Nov 2022
Another method or removing every other would be:
=List.Alternate(Text.Split([Column1],";#"),1,1,1)

How to remove the FIRST whitespace from a python dataframe on a certain column

I extracted a pdf table using tabula.read_pdf but some of the data entries a) show a whitespace between the values and b) includes two sets of values into one column as shown one columns "Sports 2019/2018" and "Total 2019/2018": https://imgur.com/a/MviV6N9
In order for me to use df_1=df1["Sprots 2019/2018"].str.split(expand=True) to split the two values which are separated by a space, I need to remove the FIRST space shown in the first value so that it doesn't split into three columns.
I've tried df1["Sports 2019/2018"] = df1["Sports 2019/2018"].str.replace(" ", "") but this removes all the spaces, which would then combine the two values.
Is there a way to remove the first whitespace on column "Sports 2019/2018 so that it resembles the values on "Internet 2019/2018'?
df1["Sports 2019/2018"] = df1["Sports 2019/2018"].str.replace(" ", "", n = 1)
n=1 is an argument that will only replace the first character that will find.

Resources