How can I extract info from downloaded webpages?

How can I extract info from downloaded webpages? - linux

I have to extract info like college name, contact number, email ids etc in a systematic order from thousands of webpages. Is there any script for doing it?
Complete scenario: I downloaded webpages containing individual colleges info using wget. There are about 5000 webpages about each college containing information of about them, however I am interested in just their name, email ids, website and contact numbers. And I need to have the extracted info saved in a suitable file in systematic order.
How can one extract info? How can I use grep to do it? Is there any better way of doing it? What scripts are available for pulling info?
PS: I use Ubuntu and Kali linux. I am a newbie. Need expert's help.

I assume you have all files in one directory, "cd" to that and:
grep -i -e "Name" -e "email" -e "http" ./*
and improve that when you see the result. That will write into your screen, finally add after that:
>> my_collected_things.txt
to get it into afile.

Related

shell script to check if images in a folder are being used by a set of HTML files

Sometime ago I worked in a team that developed a bunch of educational softwares and now they are been reviewed for bugs and updates.
During this process, I noticed that the folder "imgs" accumulated too many files. Probably one of the developers decided to include all the images used by each of the softwares into the folder. However, because there are too many softwares, it would be too painful to check manually all of them (and some of the images are part of the layout, almost invisible).
Is there a way to write a shell script in Linux to check if the files in a given folder are being used by a set of HTML and JS files in another folder?

Go to the images folder and try this
for name in *; { grep -ril $name /path/to/soft/* || echo "$name not used"; }

Im not sure I understood your question correctly,
But maybe this will help you
ls -1 your_source_path | while read file
do
grep -wnr "$file*" your_destination_path ||
echo "no matching for file $file"
# you can set any extra action here
done
in source_path you put director from hi will list all files name and destination where he should searching.

It is not possible to check for the generic case - since HTML and Javascript are two dynamic (e.g. the Javascript code could create the image file name on the file). Likewise, images can be specified in CSS style sheet, inline style, etc.
You want to review the HTML/JS files, and see if possible to identify the tags that are actually used to specify images. This will hopefully, reduce the number of XML tags and attribute names that need to be extracted.
As an alternative, if you have access to the 'access log' of the server, you can find out which images have been accessed over time, and focus the search on images not referenced in the log file.

A Study on the Modification of PDF in nodejs

Project Environment
The environment we are currently developing is using Windows 10. nodejs 10.16.0, express web framework. The actual environment being deployed is the Linux Ubuntu server and the rest is the same.
What technology do you want to implement?
The technology that I want to implement is the information that I entered when I joined the membership. For example, I want to automatically put it in the input text box using my name, age, address, phone number, etc. so that the user only needs to fill in the remaining information in the PDF. (PDF is on some of the webpages.)
If all the information is entered, the PDF is saved and the document is sent to another vendor, which is the end.
Current Problems
We looked at about four days for PDFs, and we tried to create PDFs when we implemented the outline, structure, and code, just like it was on this site at https://web.archive.org/web/20141010035745/http://gnupdf.org/Introduction_to_PDF
However, most PDFs seem to be compressed into flatDecode rather than this simple. So I also looked at Data extraction from /Filter /FlateDecode PDF stream in PHP and tried to decompress it using QPDF.
Unzip it for now.Well, I thought it would be easy to find out the difference compared to the PDF without Kim after putting it in the first name.
However, there is too much difference even though only three characters are added... And the PDF structure itself is more difficult and complex to proceed with.
Note : https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf (PDF official document in English)
Is there a way to solve the problem now?

It sounds like you want to create a PDF from scratch and possibly extract data from it and you are finding this a more difficult prospect than you first imagined.
Check out my answer here on why PDF creation and reading is non-trivial and why you should reach for a tool you help you do this:
https://stackoverflow.com/a/53357682/1669243

how to show contents of the file rather than filename when searching by solr

I have a lot of pdf files (text inside), and I want to build a simple search engine to search the sentences which contains the given keywords. After several hours' searching, I chose solr as the tool.
I am new to solr. I downloaded latest solr 6.5.0 and set it up in windows 7.
I have used the following commands to create a collection called gettingstarted and can search operation by visiting the link http://localhost:8983/solr/gettingstarted/browse
bin\solr.cmd start
bin\solr.cmd create -c gettingstarted
java -Dauto -Dc=gettingstarted -Drecursive -jar example/exampledocs/post.jar *.pdf
However, it only shows the filename which contains the keyword rather than the detail lines of the file. The following picture shows this case:
I also tried the integrated example called techproducts and to my surprise, it can show the exact sentences which contains the keywords. The following picture shows this case:
So I have a question if I can do something to enable the sentences which contains exact keywords show in the first picture. I don't know about velocity, config files and even the underlying principles. I just want it work, giving the detail search results. I do not care about the security issues and also do not care about the way it shows (uglyness is OK).
It is the first day I play with solr, so maybe I made some mistakes about the description. Thanks for your patience. I need your help.

http://localhost:8983/solr/gettingstarted/browse
this is example UI application (solritas )which comes by default with solr.
You should use /select request handler to query, which handles you query and retrieve results.
http://localhost:8983/solr/gettingstarted/select?q=keyword
For Indexing PDF.
when you index pdf, all content inside pdf goes to field called content by default.
Example:
Assuming you created gettingstarted collection already.
Navigate to directory example/exampledocs/ and hit this command.
java -Dauto -Dc=gettingstarted -jar post.jar solr-word.pdf
if it indexed successfully. go to admin and search for keyword inside pdf, it should give content field with value (text inside pdf)
example query request URL
http://localhost:8983/solr/gettingstarted/select?q=solr&wt=json&indent=on

VBScript: Extracting data from a spreadsheet to import it into a config file

A little background on what I need assistance with. I am currently working on a project that is requiring the configuration of over 500 Cisco 2911 Routers. Part of the job is to take a premade config file provided by our customer and make it location specific. That involves changing the hostname and the second and third IP octets.
I have recently made a script that will actually configure the device but now I need to make another script that will make the changes in the default config file and save it location specific. The goal is when they launch the script, it will prompt for the location, open the config file, change the hostname, open a spreadsheet containing an IP calculator, look for the location, extract the IP from it, plug it into the necessary location, then save it location specific and allow my first script to run.
Until about a week ago I have done little to no coding for over a decade so I am rusty and my knowledge is not that great. Any assistance would be greatly appreciated.

Too much to type in here but lets make it simple for you.
Do a CSV file with hostname, Interface1, IP1, Mask1, Interface2, IP2, Mask2, Location, etc..
try to put all the data that you will need for each device on a single raw on a CSV and the using python, pexpect or telnet modules and some bash you could be able to do all of that on just one script.
Will be a lot of work but at the end you will have a script that will help you on future global configs like this one for +500 devices.
Use this sites as reference:
http://linux.byexamples.com/archives/346/python-how-to-access-ssh-with-pexpect/
http://pexpect.sourceforge.net/doc/
I did one with those same tools that I mentioned and it runs commands on +1000 different cisco routers, switches and ASA and print on screen a report at the end.
Try with those tools.
Regards,

fetching a specific category / specific person photos using wget

I want to gather large picture data base for running an application. I saw wget commands for fetching pictures from websites generally, but not with a specific person's name/folder. I was trying to fetch pictures of a specific person from flicker, like this.
wget -r -A jpeg,jpg,bmp,gif,png https://www.flickr.com/search/?q=obama
thought it shows as if something is being fetched, with a lot of folders being created, but the insides are actually empty. no pictures are really being fetched. Am I doing something wrong?
does anybody know how to do this, downloading a specific persons photos from google n flicker sort of websites using wget?

By default, wget does not --span-hosts. But on Flickr the bitmap files are stored on servers with a different DNS name than www.flickr.com {something with "static" in its name typically}.
You may grep for such URLs in the files, you retrieved during your first run. Then, you shall extend the parameters to wget by --span-hosts and a corresponding list of directory names via --include-directories.
Another alternative is to follow the lines of http://blog.fedora-fr.org/shaiton/post/How-to-download-whole-Flickr-album.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string