How to extract links/text from lines containing a specified string? - text

I have a big .htm file (106 MB). File contains many links (99% of content).
I need to get only those, which contain in url a specified string, phrase.
What is easy way to do it? That can by extracting links in html format or also in text format.

Related

Can the Excel PowerPivot data model process string lengths of >32,767 characters in a field?

I'm working with a csv file that contains 1 field (out of 10 total fields) with very large strings (50,000+ characters). The other 9 fields contain strings of normal length (<100).
I imported this file into the PowerPivot data model and then copied the data directly from the table view in PowerPivot into Notepad++.
All of the strings in Notepad++ are 32,767 characters, which suggests that PowerPivot has the same limitations as standard Excel in this respect.
Is there something I can do in PowerPivot to enable a field to hold more than 32,767 characters, or am I going to have to find another solution?
Fyi, the objective is to extract this long string (which is a base64-encoded jpeg) from the csv and save it as a separate text file (which would then be converted back to a jpeg with PowerShell...a script I've already developed).
The remainder of the data in the original csv would be saved as a table and combined with some other data from a few other sources to create one table to upload into our Salsify PIM.
I've asked the provider of this csv if it's possible to export the very long strings as individual text files with names that I could relate back to the original dataset (which would solve my problem instantly), but there is resistance. They are insisting on putting everything in one csv.
Note that I do have some experience in Python (and of course PowerShell) and am open to learning tools like PowerAutomate or any other tool that you'd recommend for something like this.
edit: Note that the jpeg files I'm working with range in size from 10KB all the way up to ~16MB, so the base64 string can get very long (in the range of 3.5M characters).
You will have to split an image into multiple rows of 30,000 characters and concatenation it back together in DAX. Images up to about 2.1MB should be supported this way.

How does ms word vba detect the end of a paragraph

I converted an html text into a docx document using several different online converters.
Then I analysed the number of paragraphs using an Excel vba macro which opens the document and examines it. Supplied with an original docx document (ie one not converted from another format) this macro always gives the correct number of paragraphs.
Only one converter yielded a docx from which the number of paragraphs could be determined. All the others simply said there was a single paragraph with hundreds of words in it.
Somehow the html to docx converters are missing something. What is missing ? Can I dob it in ?
Tools / Options / View.
Examine the characters that Word uses to delimit paragraphs in the docx and the translated html.
I suspect that "paragraphs" in the translated html might be manual line breaks. If so, that would account for the fact that the paragraph count in the translated html is incorrect.

office 365 excel csv hyperlink not displaying correctly when imported to excel [duplicate]

Can Excel interpret the URLs in my CSV as hyperlinks? If so, how?
You can actually do this and have Excel show a clickable link. Use this format in the CSV file:
=HYPERLINK("URL")
So the CSV would look like:
1,23.4,=HYPERLINK("http://www.google.com")
However, I'm trying to get some links with commas in them to work properly and it doesn't look like there's a way to escape them and still have Excel make the link clickable.
Does anyone know how?
With embedding the hyperlink function you need to watch the quotes. Below is an example of a CSV file created that lists an error and a link to view the documentation on the method that failed. (Bit esoteric but that's what I am working on)
"Details","Failing Method (click to view)"
"Method failed","=HYPERLINK(""http://some_url_with_documentation"",""Method_name"")"
I read all of these answers and some others but it still took a while to work it out in Excel 2014.
The result in the csv should look like this
"=HYPERLINK(""http://www.Google.com"",""Google"")"
Note: If you are trying to set this from MSSQL server then
'"=HYPERLINK(""http://www.' + baseurl + '.com"",""' + baseurl + '"")"' AS url
you can URL Encode your commas inside the URL so the URL is not split across multiple cells.
Just replace commas with %2c
http://www.xyz.com/file,comma.pdf
becomes
=hyperlink("http://www.xyz.com/file%2ccomma.pdf")
Yes, but it's not possible to link them automatically. CSV files are just text files - whatever opens and reads them is responsible for allowing you to click the link.
As to how Excel seems to handle CSV files - everything between commas is interpreted as if it already had been typed into the cell. Therefore, the CSV file containing ="http://google.com",=A1 will display as http://google.com,http://google.com in Excel. It's important to note, however, that hyperlinks in Excel are metadata, and not the result of anything in the actual cell (ie, a hyperlinked cell to Google still contains http://google.com not <a>http://google.com</a> or anything of that sort.)
Since that's the case, and all metadata is lost when converting to a CSV, it's impossible to tell Excel you wish for something to be hyperlinked merely by changing the cell value. Normally, Excel interprets your input when you hit 'Enter' and links URLs then, but since CSV data is not being entered, but rather already exists, this does not happen.
Your best bet is to write some sort of addon or macro to run when you open up a CSV which parses every cell and hyperlinks them if they match a URL format.
Use this format:
=HYPERLINK(""<URL>"";""<LABEL>"")
e.g.:
=HYPERLINK(""http://stackoverflow.com"";""I love stackoverflow!"")
P.S. The same format works in LibreOffice Calc as well.
"=HYPERLINK(\"\" " + "http://www.mywebsite.com"+ "\"\")"
use this format before writing to CSV.
As described above, "=HYPERLINK(""http://www.google.com"", ""Google"")" is what worked for me.
However, In Excel Version 2204 Click to Run, I couldn't have leading white space.
For example;
FirstName, "=HYPERLINK(""http://www.google.com"", ""Google"")" fails
FirstName,"=HYPERLINK(""http://www.google.com"", ""Google"")" success
The issue here for me was that because a .CSV by it's nature is Comma separated, any commas in the text file are interpreted as separators. It worked for me by using tab characters as separators, saving it as a .TXT file so that when opened in EXCEL you choose the TAB character rather than ','.
In the text file …
## ensure that the file is TAB separated
Item 1 A file Name data.txt
Item 2 Col 2 =HYPERLINK("http:\www.ilexuk.com","ILEX")
"ILEX" then is shown in the cell and "http:\www.ilexuk.com" is the hyperlink for the cell.

Downloading CSV data into Excel from a Browser

So I have a script in PHP that creates tab separated CSV output.
I have a button in my HTML that works like so:
Export Data
Ideally I want the user to open this CSV file in Excel.
The issue I have here is with tab separated CSVs, the file extension, and how Excel handles all of this. For example:
download="export.csv"
Results in the Browser asking me to open this in Excel (wanted behaviour), but then once in Excel none of the columns are respected as they are tab separated (not comma separated, which Excel is obviously expecting).
download="export.xls"
Results in the Browser asking me to open this in Excel (again, wanted behaviour), but then Excel complains that the file extension and the contents do not match and gives the user a warning. If the user goes past this warning the data displays as expected, but I could do without the warning.
download="export.txt"
Results in the Browser downloading the file as a text file. Once imported into Excel, the columns are respected, but I could do with this being thought of as an Excel file like CSV files are.
download="export.tsv"
Results in the Browser downloading the file, but as this extension isnt recognized, it will need to be imported into Excel manually, which isn't what I am after. Infact, even though TSV is the most correct file extension for tab separated verse, the TXT extension seems to work more smoothly.
I am unable to set file associations on the end users machine, and I would like to avoid going down the "export your data as an actual XLXS file" route if at all possible. I would prefer to use tab separated CSVs over comma separated CSVs because the exported data contains lots of commas naturally.
EDIT:
So as per Ron Rosenfeld suggested I tried outputting a comma separated CSV file with quotes around the data - and the file loads into Excel, with columns preserved - however the quotes appear on every piece of data in every column that uses quotes.
Is it possible to not have the quotes appear?
Ideally I would prefer to have the content tab separated, but at this stage anything that allows me to open a CSV file from a browser into Excel would be great.
I want a way to download a tab separated CSV file from a browser to Excel with as little fuss as possible. How can this be achieved?
The difference between the CSV and TSV files are - as long as the creator followed some rules, that: CSV file will have comma separated values and a TSV file will have tab separated values.
For TXT files, there is no formatting specified.
CSV files are comma-delimited, so you have to use this:
sep=,
And TSV files are tab-delimited, so you have to use this:
sep=\t
If you have MS Excel installed on your computer, CSV files are closely associated with Excel.
Please, look at this post to find out what the use of sep=; for UTF-8 and UTF-16LE leads to.
It's very important to properly output UTF-8 and UTF-16LE CSV files in PHP.
So THIS POST will be informative and useful for you.
CSV means "comma separated values", so the default separator is a ,.
To change that separator to a tab, put
sep=\t
as the first line in your .csv-file (yes, you can still name it .csv). That tells excel what the delimiter character should be.
Note, that if you open the .csv with an actual text editor, it should read like
sep= (an actual tabulator character here, it's just not visible...)
This feature is not officially defined in the .csv RFC 4180, so if it works with any software other than Excel depends on that software's implementation.
I have done this before. A painful experience, which I rather not relive. but since you asked (and bountied).
Make sure your http-headers read: Content-Type: application/x-www-form-urlencoded
Make ; your separator
Don't enclose by " (This is a magic I have yet to understand).
Fingers crossed

Excel to CSV with special characters

This question's asked before. The solution given on this post works only for some of the special characters. When i save the file with Utf-8, some characters are saved as question marks in the CSV file. For example, this character ∑ is saved as ? mark in the csv file.
Is there another encoding method that I should use?
The file that I want to save includes these special characters : ∏∑€₭₮₲₽£¥§ßØæŒƂƌƜȹȻɄɅɸͶΏΨπλЖяӨ

Resources