Convert line separated data into excel data from pdf - excel

I have a pdf which has word meanings in the following format:
word1 meaning1
_____________________
word2 meaning2 in
multiple lines.
_____________________
word3 meaning3
_____________________
I want to store this information in Excel as
| column1 | column2 |
+----------+-----------------------------+
| word1 | meaning1 |
| words2 | meaning2 in multiple lines. |
| word3 | meaning3 |
For each pair, I want only one cell per word and meaning. I tried this converting to Excel via online tools but its not getting identified correctly. Copy pasting is not working because multiple lines of pdf become multiple lines in excel and merging(via macro) them with a full stop as delimiter fails as some meanings have full stops inside them. Is there any easy way to achieve it?
Update screenshot of Excel needed:

Related

Use pandas to conditionally format substrings in Excel

I have an Excel table like so:
+------------+-----------------------+
| String1 | String2 |
+------------+-----------------------+
| Example 1 | This is example 1 |
| Example 2 | The second Example, 2 |
+------------+-----------------------+
I'm trying to compare the two strings, and format them conditionally. Ideally, I'd be able to create a third column, with the string difference in bold (or whatever formatting I want, applied) like so:
+--------------+---------------------------------+-----------------------------------+
| String1 |    String2      |   Formatted String  |
+--------------+---------------------------------+-----------------------------------+
| Example 1 | This is Example 1   | This is Example 1   |
| Example 2 | The second Example, 2 | The second Example, 2 |
+--------------+---------------------------------+-----------------------------------+
I know that using XlsxWriter I can apply conditional formatting to a df as I'm writing to excel, but it seems I can only do that to an entire cell. Is there any way to apply my formatting to some contents of each cell?
Alternatively, could I insert HTML tags into my df to produce say, "<b>This is</b> Example 1" and then render those tags in excel?
For anyone encountering this problem: XlsxWriter can output rich strings. You have to define all your formats explicitly, but it works.

excel filter data with many headers

I have a really long excel spreadsheet which I need to sort in a really unusual way:
I have many columns, one of which is full of numbers and blank spaces. The column is cut into many parts and is separated by blank spaces. The blank spaces act as the beginning and the end of two areas.
What I need to do is to leave only the numbers that are bigger than 999999999 and smaller than 2000000000 while keeping only the blank spaces adjacent to them. (and filtering all other columns the same way this one column is filtered)
--- Example Table:
Name | ID................. | other data
Bob.. |......................|~-~-~---~-``~
Taxes | 1000077008 | ~~ -`~ `~ ~--
Alice |......................| ~~--~-~ ~_~
Carel |......................|~~ ~ ~--_ ~~
Beans | 2000007804 | ~ ~_~ `~ ~~ `
Coffee| 1000078363 | ~ ~-`--`-` `_~-
--- Example Filtered Table:
Name | ID................. | other data
Bob.. |......................|~-~-~---~-``~
Taxes | 1000077008 | ~~ -`~ `~ ~--
Carel.|......................|~~ ~ ~--_ ~~
Coffee| 1000078363 | ~ ~-`--`-` `_~-
The spaces in front of the numbers show that the format is Text. Change the format in Excel to General or to Numeric and use a custom filter to achieve what you want.
This is an example of a custom filter:
If you cannot change the format, then use the =INT(TRIM(A1)) formula in Excel and sort them.

Pre-pend required number of characters in Excel

I need to prepend field values in an Excel sheet with the required number of characters to equal 5 characters in the field, then concatenate two fields and have all of the characters show in the new field.
Example:
Field 1 | Field 2 | Show as
abc | 123 | 00abc00123
d | 5678 | 0000d05675
Ideas?
I think what you may want is something like:
=REPT(0,5-LEN(A1))&A1&REPT(0,5-LEN(B1))&B1
You could also use TEXT on a number field.
=REPT(0,5-LEN(A1))&A1&TEXT(B1,"00000")

Specific concatenation of text cells by formula

I'm having difficulty producing a CONCATENATE formula that combines text cells in the way that I want. There are five fields that I want to concatenate: Title, Forename, RegnalNumber, Surname, and Alias, in that order. I'm no regex expert, so excuse the poor formatting, but this is a rough way of expressing what I'm trying to achieve:
(title)? (forename) (regnalnumber)? (surname)?, (alias).
The only field that can't be null is the forename field, although it might have the value "?", in which case it shouldn't output anything in the concatenation, i.e. it should be treated as blank. Hopefully the following test cases should demonstrate the output I'm trying to achieve: the output on the right is what it should look like:
| Title | Forename | RN | Surname | Alias | CONCATENATE |
+--------+----------+----+-----------+--------------+---------------------------------------+
| Ser | Jaime | | Lannister | Kingslayer | Ser Jaime Lannister, Kingslayer |
| | Pate | | | | Pate |
| Lord | ? | | Vance | | Lord Vance |
| King | Aerys | II | Targaryen | The Mad King | King Aerys II Targaryen, The Mad King |
| Lord | Jon | | Arryn | | Lord Jon Arryn |
| | Garth | | | Of Oldtown | Garth, Of Oldtown |
I've experimented for ages trying to make this concatenation work, but haven't been able to get it right. This is the current formula, with cell references replaced by the field name for comprehensibility:
=CONCATENATE(IF(Title<>"",Title&" ",""),IF(AND(Forename<>"",Forename<>"?"),Forename,""),IF(RN<>""," "&RN,""), IF(OR(AND(Forename<>"", Forename<>"?"), Surname<>"", RN<>""), " ",""), IF(Surname<>"",Surname,""),IF(AND(Alias<>"",OR(Alias<>"",AND(Forename<>"", Forename<>"?"),Surname<>"")),", "&Alias, Alias))
There is one case where it doesn't work: if the Surname and RN are null but the the Forename and Alias are non-null. For example, if the Forename is Garth, and the Alias is Of Oldtown, the concatenation outputs: Garth , Of Oldtown. It's the same if the title is non-null. It shouldn't have a space before the comma.
Can you help me to fix this formula so it works as expected? If you can find a way to simplify it, even better! I know I'm probably overcomplicating this a great deal. I'm using LibreOffice Calc 4.3.1.2, not Excel.
The best way imho to solve situations like this is to divide the problem over multiple simple columns, rather than 1 huge complex formula. Remember you can always hide the columns that you don't want to see.
So create a column for Title that says =if(a2="","",a2&" ").
That can be extended for all the other columns, except:
for Forename, where you want to include the "?" as follows: =if(b2="?","",b2&" ")
for Alias, where you want to include the leading ",": =if(e2="","",", "&e2)
Lastly just concatenate each of your working columns with something like: =f2&g2&h2&i2&j2.
This breaks the problem down into very simple components, and makes it easy to debug. If you want to add extra functionality at a later stage, it is easy to swap out one of your formulae for something else.
I know this is only a bit of fun, but can I suggest a more algorithmic approach?
The algorithm is:-
If a field is empty or ?, do nothing
Else
If concatenation so far is empty, add field to concatenation
Else
Add a space followed by the field to concatenation
which leads to this formula in G2 :-
=IF(OR(A2="",A2="?"),F2,IF(F2="",A2,F2&" "&A2)
(need to put single apostrophes in column F to make it work)
which when copied across looks like this:-

Highlighting ascii tables

Some (ascii) reports I produce contain ascii tables, like this one:
+------+------+------+
| col1 | col2 | col3 |
+======+======+======+
| bla | bla | bla |
| bla | bla | bla |
| bla | bla | bla |
+------+------+------+
I am trying to find a way to highlight such tables using a vim syntax file. A simple highlighting should suffice - no need to distinguish between the |, the =, the + and the -. However, I do not want to highlight the words inside the table (only the skeleton), and I do not want to highlight -, = signs (etc.) outside of the table.
The problem with vim syntax files is that they have no way of determining what's "up" or "down" relatively to a given point. I would be OK with just highlighting per-line, for examples, lines like this:
+------+------+------+
even if they not create nice tables, but the problem is with lines like this:
| col1 | col2 | col3 |
which may be mixed with non-tabular code, like this Python code:
x = y\
| z | u | v # | is here for 'or'
Can you think of a more elegant way of doing so? I've seen ome highlighters (other than vim) which highlight tables quite well...
You can solve this with containmaint, cp. :help :syn contains. First, define a region that spans the entire range of lines that the table is comprised of. I'm using a simplistic pattern for the header / footer line here, and assert that there's no | immediately above / below (in the neighboring line); refine this as needed:
syn region tableRegion start="|\#<!\n+[-+]*+$" end="^+[-+]*+\n|\#!" contains=tableRow
Then, define the (again, simplistic here) pattern to match the table rows, and mark this contained, so it will only match inside other syntax regions that contains= it.
syn match tableRow "^|.*|$" contained

Resources