pdfplumber - How to extract table with no horizontal lines? - python-3.x

So I have a table like this one, with an unknown number of description lines. Some can have 1, 2, 5, even zero, or more lines:
(I removed all sensitive informations.)
and I use :
with pdfplumber.open("invoice.pdf") as pdf:
pages = pdf.pages
for page in pages:
page.extract_table()
which is does extract all data from the table but the second column it treats as one row.
I want somehow to split the lines of second column (or better all columns) by a small blank row, which so I put it on red rectangles to highlight it.
I know that I need to use table_settings={}, but I can't figure out ... yet, which property (ies), to use ?
What I tried:
print(page.extract_table(table_settings={
"horizontal_strategy": "text",
"snap_y_tolerance": 3,
"keep_blank_chars": True,
}))
Which, again, it splits when he wants ..
So it's possible to extract a mix-borderless table ?

Related

How do I drop complete rows (including all values in it) that contain a certain value in my Pandas dataframe?

I'm trying to write a python script that finds unique values (names) and reports the frequency of their occurrence, making use of Pandas library. There's a total of around 90 unique names, which I've anonymised in the head of the dataframe pasted below.
,1,2,3,4,5
0,monday09-01-2022,tuesday10-01-2022,wednesday11-01-2022,thursday12-01-2022,friday13-01-2022
1,Anonymous 1,Anonymous 1,Anonymous 1,Anonymous 1,
2,Anonymous 2,Anonymous 4,Anonymous 5,Anonymous 5,Anonymous 5
3,Anonymous 3,Anonymous 3,,Anonymous 6,Anonymous 3
4,,,,,
I'm trying to drop any row (the full row) that contains the regex expression "^monday.*", intending to indicate the word "monday" followed by any other number of random characters. I want to drop/deselect any cell/value within that row.
To achieve this goal, I've tried using the line of code below (and many other approaches I found on SO).
df = df[df[1].str.contains("^monday.*", case = True, regex=True) == False]
To clarify, I'm trying to search values of column "1" for the value "^.monday.*" and then deselecting the rows and all values in that row that match the regex expression. I've succesfully removed "monday09-01-2022" and "tuesday10-01-2022" etc.. but I'm also losing random names that are not in the matching rows.
Any help would be very much appreciated! Thank you!

PowerQuery remove items from a list matching a pattern

Using PowerQuery in excel how can I remove items from a list that match a pattern.
I have a column with cells that contain names and numeric id's. I want to be left with just a list of names.
LastName, FirstName;#123;#LastName, FirstName;#321;
The numbers are all unique. So if I had regex the pattern would be similar to
/^\#ddd+$/
I can split the cell into a list using ';' as a separator.
= Text.Split([Consultant],";")
If there was a way to remove every 2nd item until the end that could work too. Unfortunately it seems there is no way to specify patterns to match.
List.RemoveItems({1, 2, 3, 4, 2, 5, 5}, {2, 4, 6})
This would be awesome however I have to define all the number patterns that exist. So this fails.
List.RemoveMatchingItems(Text.Split([Consultant], ";#"), {1,2,3,4,5,6,7,8,9})
Method2
I split the text into a list as above. This gave me a column of lists. So I expanded the lists in columns to new rows. My plan was to remove alternate rows. However, remove alternate rows requires an end number. I would need an argument to go until there are no more arguments to process.
There are many ways.
One way is to select every other item with List.Select
In your example, these would be the items with an even number position.
let
x = Text.Split([Column1],";#"),
y = List.Select(x, each Number.IsEven(List.PositionOf(x, _)))
in
y
Edit Nov 2022
Another method or removing every other would be:
=List.Alternate(Text.Split([Column1],";#"),1,1,1)

(Excel 2013/Non-VBA) Format Data column based on value of another cell?

We have a column that is query driven, and the query partially formats the values in the column using math based off the value of a "user entry cell" on another sheet.
For the really curious, our query looks like this:
DECLARE #rotationsNum INT
SET #rotationsNum = ?
SELECT t.Piece_ID, t.Linear_Location, ((ROW_NUMBER() OVER(ORDER BY
Linear_Location) -1 )%#rotationsNum )*(360/#rotationsNum) AS Rotation FROM
(SELECT Position.Feature_Key, Piece_ID, ((Place-1)%(Places/#rotationsNum))+1 AS Linear_Location, Place, Measured_Value, Places FROM Fake.dbo.Position LEFT JOIN Fake.dbo.Features
ON Position.Feature_Key = Features.Feature_Key WHERE Position.Inspection_Key_FK = (SELECT Inspection_Key FROM Fake.dbo.Inspection WHERE Op_Key = ?)) AS t
ORDER BY Piece_ID, Linear_Location
The first parameter "#rotationsNum" is a cell that will always have a value between 1-4. IF the value is 1, the entire column will show "0"s, which we want to show as "N/A". However, it isn't as simple as "How to hide zero data.." Because if the "#rotationsNum" == 2, 3, or 4, there will still be 0 values in the column that need to be shown.
A "#rotationsNum" value of 2 will have the query write the column as such: example
So I am trying to come up with a way to format the column =IF(cell>1, do nothing, overwrite entire column to say "NA"). But I don't think it is that straight forward since the column is query driven.
My resolution was to format the column so that if the cell that drives the "#rotationsNum" parameter is below 2, then the whole column just gets "grayed out". It kind of makes it look like a redaction, and isn't as desirable as "NA", but it works for our purposes. Hopefully this solution helps someone else who stumbles upon this problem.

powerquery split column by variable field lengths

In PowerQuery I need to import a fixed width txt file (each line is the concatenation of a number of fields, each field has a fixed specific length).
When I import it I get a table with one single column that contains the txt lines, e.g. in the following format:
AAAABBCCCCCDDD
I want to add more columns in this way:
Column1: AAAA
Column2: BB
Column3: CCCCC
Column4: DDD
In other words the fields composing the source column are of known length, but this length is not the same for all fields (in the example above the lengths are: 4,2,5,3).
I'd like to use the "Split Column">"By number of character" utility but I can only insert one single length at a time, and to get the desired output I'd have to repeat the process 3 times, adding one column each time and using the "Once, as far left as possible" option for the "Split Column">"By number of character" utility.
My real life case has many different line types (files) to import and convert, each with more then 20 fields, so a less manual approach is needed; I'd like to somehow specify the record structure (the length of each field) and get the lines split automagically :)
There would probably be the need for some M code, which I know nothing about: can anybody point me to the right direction?
Thanks!
Create a query with the formula below. Let's call this query SplitText:
let
SplitText = (text, lengths) =>
let
LengthsCount = List.Count(lengths),
// Keep track of the index in the lengths list and the position in the text to take the next characters from. Use this information to get the next segment and put it into a list.
Split = List.Generate(() => {0, 0}, each _{0} < LengthsCount, each {_{0} + 1, _{1} + lengths{_{0}}}, each Text.Range(text, _{1}, lengths{_{0}}))
in
Split,
// Convert the list to a record to
ListToRecord = (text, lengths) =>
let
List = SplitText(text, lengths),
Record = Record.FromList(List, List.Transform({1 .. List.Count(List)}, each Number.ToText(_)))
in
Record
in
ListToRecord
Then, in your table, add a custom column that uses this formula:
each SplitText([Column1], {4, 2, 5, 3})
The first argument is the text to split, and the second argument is a list of lengths to split by.
Finally, expand the column to get the split text values into your table. You may want to rename the columns since they will be named 1, 2, etc.

Conversion of single explicit label to multiple binary labels in a .CSV file

I have a .csv file of the following form :
NameOfFile, Type
Example, Word Document;
Picture;
PDF;
Example2, Word Document;
Example3, Picture;
I would like to convert it to this form using a program:
Name of File, Word Document, Picture, PDF
Example, 1, 1, 1
Example2, 1, 0, 0
Example3, 0, 1, 0
So we're going from explicitly writing down the type, to having a binary feature which shows the type.
I assume there must be some clever way of doing this, since I'd imagine people fairly commonly need to do it.
What is my best practice method for doing this?
Fill the blanks (select entire column, HOME > Editing - Find & Select, Go To Special..., check Blanks, =, Up, Ctrl+Enter) then create a PivotTable with NameofFile for ROWS, Type for COLUMNS and Count of Type for VALUES. Order to suit and under PivotTable Options...., check For empty cells show: and add 0.

Resources