I have a project where I have to highlight text in a structured PDF document and classify it so I can perform regex on multiple substrings and give their respective variables the proper values. Is there a way to have a PDF prompted to the screen where the user can highlight multiple parts and classify each of them automatically to a field that I can then use to create regular expressions without having to first extract the text from the pdf and then manually perform regexes on all the different substrings of interest?
Right now I'm using the pdfplumber library in python to extract text in PDFs line by line and append it to a string so that I can perform regex on it.
I would like to be able to just highlight multiple lines of text in the pdf file each and classify each of them individually so that I can send them as arguments to whichever regular expression library I'm using automatically and get multiple regular expressions and or one regular expression in return?
Highlight text in a PDF with Python
These might help: https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052
https://www.thepythoncode.com/article/redact-and-highlight-text-in-pdf-with-python
For the GUI you could use GTK:
https://python-gtk-3-tutorial.readthedocs.io/en/latest/textview.html
Can Excel interpret the URLs in my CSV as hyperlinks? If so, how?
You can actually do this and have Excel show a clickable link. Use this format in the CSV file:
=HYPERLINK("URL")
So the CSV would look like:
1,23.4,=HYPERLINK("http://www.google.com")
However, I'm trying to get some links with commas in them to work properly and it doesn't look like there's a way to escape them and still have Excel make the link clickable.
Does anyone know how?
With embedding the hyperlink function you need to watch the quotes. Below is an example of a CSV file created that lists an error and a link to view the documentation on the method that failed. (Bit esoteric but that's what I am working on)
"Details","Failing Method (click to view)"
"Method failed","=HYPERLINK(""http://some_url_with_documentation"",""Method_name"")"
I read all of these answers and some others but it still took a while to work it out in Excel 2014.
The result in the csv should look like this
"=HYPERLINK(""http://www.Google.com"",""Google"")"
Note: If you are trying to set this from MSSQL server then
'"=HYPERLINK(""http://www.' + baseurl + '.com"",""' + baseurl + '"")"' AS url
you can URL Encode your commas inside the URL so the URL is not split across multiple cells.
Just replace commas with %2c
http://www.xyz.com/file,comma.pdf
becomes
=hyperlink("http://www.xyz.com/file%2ccomma.pdf")
Yes, but it's not possible to link them automatically. CSV files are just text files - whatever opens and reads them is responsible for allowing you to click the link.
As to how Excel seems to handle CSV files - everything between commas is interpreted as if it already had been typed into the cell. Therefore, the CSV file containing ="http://google.com",=A1 will display as http://google.com,http://google.com in Excel. It's important to note, however, that hyperlinks in Excel are metadata, and not the result of anything in the actual cell (ie, a hyperlinked cell to Google still contains http://google.com not <a>http://google.com</a> or anything of that sort.)
Since that's the case, and all metadata is lost when converting to a CSV, it's impossible to tell Excel you wish for something to be hyperlinked merely by changing the cell value. Normally, Excel interprets your input when you hit 'Enter' and links URLs then, but since CSV data is not being entered, but rather already exists, this does not happen.
Your best bet is to write some sort of addon or macro to run when you open up a CSV which parses every cell and hyperlinks them if they match a URL format.
Use this format:
=HYPERLINK(""<URL>"";""<LABEL>"")
e.g.:
=HYPERLINK(""http://stackoverflow.com"";""I love stackoverflow!"")
P.S. The same format works in LibreOffice Calc as well.
"=HYPERLINK(\"\" " + "http://www.mywebsite.com"+ "\"\")"
use this format before writing to CSV.
As described above, "=HYPERLINK(""http://www.google.com"", ""Google"")" is what worked for me.
However, In Excel Version 2204 Click to Run, I couldn't have leading white space.
For example;
FirstName, "=HYPERLINK(""http://www.google.com"", ""Google"")" fails
FirstName,"=HYPERLINK(""http://www.google.com"", ""Google"")" success
The issue here for me was that because a .CSV by it's nature is Comma separated, any commas in the text file are interpreted as separators. It worked for me by using tab characters as separators, saving it as a .TXT file so that when opened in EXCEL you choose the TAB character rather than ','.
In the text file …
## ensure that the file is TAB separated
Item 1 A file Name data.txt
Item 2 Col 2 =HYPERLINK("http:\www.ilexuk.com","ILEX")
"ILEX" then is shown in the cell and "http:\www.ilexuk.com" is the hyperlink for the cell.
I have a word document which needs to be converted to a table.
The catch however is, that the document contains a thousand pages and each page, needs to be an individual cell in the excel sheet. When I copy paste from Word, each line gets converted to one cell which i don't want. I need all the content between two page breaks to be a part of one cell.
To give some background on the issue, I need to basically create a csv from the the word file such that each page from the document is one value, hence I am trying to create a table.
Is there a way with which, this can be automated?
Found my solution here :
https://superuser.com/questions/747197/how-do-i-copy-word-tables-into-excel-without-splitting-cells-into-multiple-rows
It basically involved replacing 'pilcrow' characters into my file for line breaks and doing vice versa in excel.
One important thing though, the article says to type 'alt+0010' (the key combination for line break) something while replacing pilcrows in excel. However, that did not work for me. Ctrl+J does the trick though, it inserts line break character in excel replace box.
Cheers :)
I am created a form-letter using an Excel spreadsheet as a forming tool connected to a database and using paste-link to connect the results to an MS Word document.
Each section of the document is given a single cell to draw from which utilizes a formula to comprise itself of several other cells based on a logic determinate upon the data from the database queries.
All of this functions perfectly well.
The problem arises when the generated blocks of text from Excel include two carriage-returns in a row, creating what MS Word thinks is a new paragraph (and technically it is). The rest of the letter is justified, and I have attempted to set justified text as the default alignment. But no matter what I try, any newly formed paragraphs generated inside of linked text from Excel will be left-aligned.
For this form letter to function properly it must have justified text throughout. Inconsistent formatting won't be accepted by management.
To be clear, I have attempted to modify the settings of the "Normal" style of the document in Word, as well as creating a new style based on Normal called "Justified" and setting that as the default by selecting it and clicking "Change Styles" -> "Set as Default".
The first paragraph of any given block will always remain justified-aligned, it is only subsequent, newly-created (as far as MS Word knows) paragraphs that aren't. So I suspect I am just not setting the default properly or...I don't know, something.
I tried linking as unformatted text but that, for some maddening reason, includes QUOTATIONS MARKS bookending the text! I'm baffled and frustrated.
Please help. I don't like to look the fool at work.
While I still do not know how to make Word insert new paragraphs into linked blocks of text without left-aligning them, I have a working solution to my particular problem.
By forcing my spreadsheet to create blocks of text with the maximum number of paragraphs, then forcibly justifying the output in MS Word, I was able to ensure that, as long as I close the document between updates, that the text blocks will only shrink in size, rather than grow. This way, Word does not recognize the updated text as "new" paragraph, as there was already a paragraph in that block.
I saved the Word document with this overabundance of paragraphs, and put the Excel spreadsheet back the way it was.
I have transcripts of data in MS Word want to read into a stats program called R. The problem is these documents contain special characters (not plain text). My process for dealing with them has been to sub them out in MS Word/save as a txt document/read into MS Excel (makes a column for people and dialogue using the import wizard)/Convert to .csv/read into R. This process works but is time consuming. I found out how to read the text with special characters right into R (R generally wants plain text) but this requires the document be in an excel document. This is desirable because if I can read the special characters into R it's rather simple to sub out all the special characters at once. The problem arises because I can't get the MS Word document into Excel directly. I have to save it as a text file first (which I don't mind doing) and then read it in. This turns the special characters into boxes and question marks. I need to get the MS Word doc into Excel as a data frame with 2 columns (person, dialogue) without destroying the special characters (“, ”, —, ’, ‘, …, etc.).
I can do this by subbing out in Word with replace but again if I could get it to Excel doing this in R would be much easier.
Here is a sample MS Word doc of what my data looks like (tab separated columns)
https://dl.dropbox.com/u/61803503/TEST.doc
Excel and Word versions 2010 on a Win 7 machine.
One way: use Edit->Copy in Word and Edit->Paste in Excel. A simple tabular structure should be preserved if you do that, with preservation of Unicode characters. Not so sure about non-Unicode stuff such as Wingdings. Haven't tried VBA-ing that, either.