How can I compile a list of unique image filenames in a set of html files? - search

I have ~3,600 html files with a ton of image tags in them. I'd like to be able to capture all the src attribute values used in these files and aggregate them into a text file where I can then remove duplicates and see how many unique image filenames there are overall.
I use BBEdit and I can easily use regex and multi-file search to find all the image references (18,673), but I don't want to replace them with anything -- instead, I want to capture them from the BBEdit search results 'Notes' and push them into another file.
Is this something that can be AppleScripted? Or are there other means to the same end that would be appropriate?

You've got a tall task there because there's many parts of this you have to solve. To give you a start, here's some advice on reading one html file and putting all the src images in an applescript list. You have to do much more than that but this is a beginning.
First you can read a html file into applescript as regular text. Something like this will get the text of one html file...
set theFile to choose file
set htmlText to read theFile
Once you have the text into applescript you could use text item delimiters to grab the src images. Here's an example. It should work no matter how complex the html code...
set htmlText to "<img src=\"smiley.gif\" alt=\"Smiley face\" height=\"42\" width=\"42\" />
<img src=\"smiley.gif\" alt=\"Smiley face\" height=\"42\" width=\"42\" />
<img src=\"smiley.gif\" alt=\"Smiley face\" height=\"42\" width=\"42\" />"
set text item delimiters to "src=\""
set a to text items of htmlText
if (count of a) is less than 2 then return
set imageList to {}
set text item delimiters to "\""
repeat with i from 2 to count of a
set thisImage to first text item of (item i of a)
set end of imageList to thisImage
end repeat
set text item delimiters to ""
return imageList
I hope that helps!


How to search part of a text

I need to capture part of a text within several other texts and within these several texts I have some that always have the same initial and final word, how can I do that?
I trying make a program to search part of the a text, in the text I have a key initial and a final key.
The text format is this:
my random text this my random text this my random text this my random
text this my random text this MY_START_WORD_KEY my text this my text
this my text this my text this my text this MY_END_WORD_KEY my random text this my random text this my random text this my random text this my random text this MY_START_WORD_KEY my text this my text
this my text this my text this my text this MY_END_WORD_KEY my random text this my random text this my random text this my random text this my random text this
I created this code:
txt = "my_text.txt"
with open(txt, encoding="utf8") as text:
all_text =
for part in temp:
if end in part:
But that way the initial and final word is lost in my final full text.
I need everything between the initial keyword to the final keyword.
you can try something like
import re
txt = '''my random text this my random text this my random text this my random text this my random text this START_KEY my text this my text this my text this my text this my text this END_KEY my text this my text this my text this my text this my text this my text this my text this
more random START_KEY text END_KEY.'''
matches = re.findall(START_KEY+r"\s.*\s"+END_KEY, txt)
the result will be
matches = ['START_KEY my text this my text this my text this my text this my text this END_KEY',
Regex can be very useful for things like this,
you can find more about 're' lib here
Assuming you only see one occurrence of the start and end word in the text:
text = ...
result = text.split(start)[1].split(end)[0].strip()
You can split and select the middle portion.
Try this:
keys = ['start','end']
text = "Everything between key[0] and key[1]+3 is suposed to return as substring. start extracting here and stop at the end. This part should not appear in extracted substring."
start = text.index(keys[0])
end = text.index(keys[1])
print(text[:start], '\n', text[start:end+3], '\n', text[end+3:])
Tô kee START and END words, use a regexp like:
To remove the limiting words,nuse the following regexp:
I hope it helps.

I need help in Python with displaying the contents of a 2D Set into a Tkinter Textbox

Disclaimer: I have only begun to learn about Python. I took a crash course just to learn the very basics about a month ago and the rest of my efforts to learn have all been research thru Google and looking at solutions here in Stack Overflow.
I am trying to create an application that will read all PDF files stored in a folder and extract their filenames, page numbers, and the contents of the first page, and store this information into a 2D set. Once this is done, the application will create a tkinter GUI with 2 listboxes and 1 text box.
The application should display the PDF filenames in the first listbox, and the corresponding page numbers of each file in the second listbox. Both listboxes are synched in scrolling.
The text box should display the text contents on the first page of the PDF.
What I want to happen is that each time I click a PDF filename in the first listbox with the mouse or with up or down arrow keys, the application should display the contents of the first page of the selected file in the text box.
This is how my GUI looks and how it should function
I have been successful in all other requirements so far except the part where when I select a filename in the first listbox, the contents of the first page of the PDF should be displayed in the text box.
Here is my code for populating the listboxes and text box. The contents of my 2D set pdfFiles is [['PDF1 filename', 'PDF1 total pages', 'PDF1 text content of first page'], ['PDF2 filename', 'PDF2 total pages', 'PDF2 text content of first page'], ... etc.
===========Setting the Listboxes and Textbox=========
scrollbar = Scrollbar(list_2)
scrollbar.pack(side=RIGHT, fill=Y)
list_1.bind("<MouseWheel>", scrolllistbox2)
list_2.bind("<MouseWheel>", scrolllistbox1)
txt_3 = tk.Text(my_window, font='Arial 10', wrap=WORD), rely=0.12, relwidth=0.472, relheight=0.86)
scrollbar = Scrollbar(txt_3)
scrollbar.pack(side=RIGHT, fill=Y)
list_1.bind("<<ListboxSelect>>", CurSelect)
============Populating the Listboxes with the content of the 2D Set===
i = 0
while i < count:
list_1.insert(tk.END, pdfFiles[i][0])
list_2.insert(tk.END, pdfFiles[i][1])
i = i + 1
============Here is my code for CurSelect function========
def CurSelect(evt):
values = [list_1.get(idx) for idx in list_1.curselection()]
print(", ".join(values)) ????
The print command above is just my test command to show that I have successfully extracted the selected item in the listbox. What I need now is to somehow link that information to its corresponding page content in my 2D list and display it in the text box.
Something like:
1) select the filename in the listbox
2) link the selected filename to the filenames stored in the pdfFilename 2D set
3) once filename is found, identify the corresponding text of the first page
4) display the text of the first page of the selected file in the text box
I hope I am making sense. Please help.
You don't need much to finish it. You just need some small things:
1. Get the selected item of your listbox:
selected_indexes = list_1.curselection()
first_selected = selected_indexes[0] # it's possible to select multiple items
2. Get the corresponding PDF text:
pdf_text = pdfFiles[first_selected][2]
3. Change the text of your Text widget: (from
txt_3.delete("1.0", tk.END)
txt_3.insert(tk.END, pdf_text)
so replace your CurSelect(evt) method with this:
def CurSelect(evt):
selected_indexes = list_1.curselection()
first_selected = selected_indexes[0]
pdf_text = pdfFiles[first_selected][2]
txt_3.delete("1.0", tk.END)
txt_3.insert(tk.END, pdf_text)

Python Selenium, the text of the next element

I can get the text of the first element. But I do not know how to go through the entire list and get the text of each element. Here is the tree from the site:
So I get the text of the first element:
print(driver.find_element_by_xpath("//span[#ng-bind=\"'#' + user.username\"]").text)
In each
div class="md_modal_list_peer_wrap clearfix" ng-repeat="participant in
is contained
div class="md_modal_list_peer_name"
which contains
a class="md_modal_list_peer_name"
which you need to press. That is, execute:
After that, a new window opens, from which I get the text of the element:
driver.find_element_by_xpath("//span[#ng-bind=\"'#' + user.username\"]").text
But there are several of these elements and I need to get the text with everyone:
div class="md_modal_list_peer_wrap clearfix" ng-repeat="participant in
How to do it?
Vladimir, I haven't done a careful analysis of this problem; however, could it be as simple as this?
Rather than using
driver.find_element_by_xpath("//span[#ng-bind=\"'#' + user.username\"]").text
could you use:
for span in driver.find_elements_by_xpath("//span[#ng-bind=\"'#' + user.username\"]"):
(Notice plural in `find_elements_by_xpath'.)
You'll need to click each element and store the text in a list:
So first use
elements_to_click = driver.find_elements_by_xpath("//a[#my-peer-link='participant.user_id']"
This will return a list of elements
Loop through those elements by
clicking on element
switch to the new window (When a new WindowHandle was created, otherwise skip this step.)
get the text via
driver.find_element_by_xpath("//span[#ng-bind=\"'#' + user.username\"]").text
store it in a list
close window (When a new WindowHandle is created, otherwise skip this step.)
switch to previous window

Create Colour in InDesign CS5 with CMYK values from user input

I'm trying to create a script which, in the process of opening a template file, asks the user for a set of CMYK values.
The idea is to then either change the value of an existing colour (called "Primary Colour") therefore changing the colour of every item to which it is applied...or add this new colour and delete 'Primary Colour' replacing it with new colour.
The problem is I can't get past creating a new colour with user input values. I can create a new colour with;
set New_Swatch to make color with properties {name:"New Primary Colour", model:process, color value:{82,72,49,46}}
however as soon as I try to replace the color value with a variable I get the error;
"Adobe InDesign CS5 got an error: Invalid parameter."
Here is a snippet of code in context;
set primaryColour to text returned of (display dialog "Enter CMYK calues of Primary Colour (separated by commas e.g. 0,0,0,0)" default answer "") as string
tell application "Adobe InDesign CS5"
tell active document
set New_Swatch to make color with properties {name:"new", model:process, color value:primaryColour}
end tell
end tell
Any help gratefully received.
I currently use this:
set primaryColor to text returned of (display dialog "Enter CMYK values of Primary Colour (separated by commas e.g. 0,0,0,0)" default answer "") as string
set text item delimiters to ","
set colorvalue to {}
repeat with color from 1 to count of text items of primaryColor
copy (text item colour of primaryColor as number) to end of colorvalue
end repeat
set colorname to "TEST"
tell application "Adobe InDesign CS5"
tell active document
set newcolor to make color with properties {name:colorname, space:CMYK, model:process, color value:colorvalue}
end tell
end tell
Why? Because it works. It is not pretty and it was not my first, or even 10th method to get the job done...Why this works? No idea...
It just does. You would think that:
set text item delimiters to ","
set {C,M,Y,K} to text items of primaryColor
set newcolor to make color with properties {name:colorname, space:CMYK, model:process, color value:{C,M,Y,K}}
Would do the trick, but it doesn't... I'm sure your attempts so far have proven just how much of a pain this particular function is.
You may also want to use the AppleScript “choose color” command, which presents a color picker, rather than presenting your user with a dialog into which they have to enter numeric color values.
This example inserts RGB colors as text into a BBEdit window, but you would use the same principle to insert CMYK colors as text into InDesign.
tell application "BBEdit"
set theColorValues to choose color
set theR to round (the first item of theColorValues) / 257
set theG to round (the second item of theColorValues) / 257
set theB to round (the third item of theColorValues) / 257
set theRGBColor to "rgb(" & theR & ", " & theG & ", " & theB & ")"
set selection to theRGBColor
end tell
set primaryColor to text returned of (display dialog "Enter CMYK values of Primary Color (separated by commas e.g. 0,0,0,0)" default answer "") as string
set text item delimiters to ","
set colorvalue to {}
repeat with color from 1 to count of text items of primaryColor
copy (text item color of primaryColor as number) to end of colorvalue
end repeat
set colorname to primaryColor
tell application "Adobe InDesign CC 2017"
tell active document
set newcolor to make color with properties {name:colorname, space:CMYK, model:process, color value:colorvalue}
end tell
end tell

How to print R graphics to multiple pages of a PDF and multiple PDFs?

I know that
will print to a PDF in R. What if I want to
Make a loop that prints subsequent graphs on new pages of a PDF file (appending to the end)?
Make a loop that prints subsequent graphs to new PDF files (one graph per file)?
Did you look at help(pdf) ?
pdf(file = ifelse(onefile, "Rplots.pdf", "Rplot%03d.pdf"),
width, height, onefile, family, title, fonts, version,
paper, encoding, bg, fg, pointsize, pagecentre, colormodel,
useDingbats, useKerning)
file: a character string giving the name of the file. For use with
'onefile=FALSE' give a C integer format such as
'"Rplot%03d.pdf"' (the default in that case). (See
'postscript' for further details.)
For 1), you keep onefile at the default value of TRUE. Several plots go into the same file.
For 2), you set onefile to FALSE and choose a filename with the C integer format and R will create a set of files.
Not sure I understand.
Appending to same file (one plot per page):
for (i in 1:10){
New file for each loop:
for (i in 1:10){
pdf(file = "Location_where_you_want_the_file/name_of_file.pdf", title="if you want any")
plot() # Or other graphics you want to have printed in your pdf
You can plot as many things as you want in the pdf, the plots will be added to the pdf in different pages. closes the connection to the file and the pdf will be created and you will se something like
null device 1
