How do i get all the weblinks from a website? - web

I want to get all the links(web posts) available in website . And also if any new post is added to website I should be able get the link. I will be having list of 10 websites and the link extraction process needs to be run periodically.
Can some one help me how to get only post links and new post link that is added.

I would suggest to write a php script (since you mentioned php) which is called by a cron-job periodically. Inside the script you can
Option 1: Define a curl commando which automatically fetches all the content of one url. (May be better if you have to deliver some information to the website with post-method.)
Option 2: Use file_get_contents function to get all contents
Than you can parse these result with a regular-expression to extract the parts you are interested in (for example search for something like <div class=".post">...</div>). After that you can add the information to your database or just check if the information is already there.

Related

How to download a file from website by using logic app?

how you doing?
I'm trying to download a excel file from a web site (Specifically DataCamp) in order to use its data into an automatic process, but before to get the file is necessary to sign in on the page. I was thinking that this would be possible with the JSON Query on the HTTP action, but to be honest I don't know where to start (I'm new on Azure).
The process that I need to emulate to get the file extraction would be as follow (I know this could be possible with an API or RPA but I don't have any available for now):
Could you tell me guys some advices (how to get the desired result or at least where to make research)? is this even posibile?
Best regards.
If you don't have other ways, e.g. your source is on an SFTP, etc. than using an HTTP Action should work, pass the BODY to your next action (e.g. you might want to persist that on a BLOB if content is binary).
If your content is "readable", e.g. JSON, CSV and want to load for processing, you need to ensure, for large files, that you read it in Chunks to load it completely before processing.
Detailed explanation at https://learn.microsoft.com/en-us/azure/logic-apps/logic-apps-handle-large-messages#download-content-in-chunks

Is it possible to have a link to raw content of file in Azure DevOps

It's possible to generate a link to raw content of the file in GitHub, is it possible to do with VSTS/DevOps?
Even after reading the existing answers, I still struggled with this a bit, so I wanted to leave a bit more of a thorough response.
As others have said, the pattern is (query split onto separate lines for ease of reading):
https://dev.azure.com/{{organization}}/{{project}}/_apis/sourceProviders/{{providerName}}/filecontents
?repository={{repository}}
&path={{path}}
&commitOrBranch={{commitOrBranch}}
&api-version=5.0-preview.1
But how do you find the values for these variables? If you go into your Azure DevOps, choose Repos > Files from the left navigation, and select a particular file, your current url should look something like this:
https://dev.azure.com/{{organization}}/{{project}}/_git/{{repository}}?path=%2Fpackage.json
You should use those values for organization, project, and repository. For path, you'll see an HTTP encoded version of the unix file path. %2F is the HTTP encoding for /, so that path is actually just /package.json (a tool like Postman will do that encoding for you).
Commit or branch is pretty self explanatory; you either know what you want for this value or you should use master. I have "hard-coded" the api version in the above url because that's what the documentation currently points to.
For the last variable, you need providerName. In short, you should probably use TfsGit. I got this value from looking through the list of source providers and looking for one with a value of true for supportedCapabilities.queryFileContents.
However, if you just request this URL you'll get a "203 Non-Authoritative Information" response back because you still need to authenticate yourself. Referring again to the same documentation, it says to use Basic auth with any value for the username and a personal access token for the password. You can create a personal access token at https://dev.azure.com/{{organization}}/_usersSettings/tokens; ensure that it has the Token Administration - Read & Manage permission.
If you're unfamiliar with this sort of thing, again Postman is super helpful with getting these requests working before you get into the code.
So if you have a repository with a src directory at the root, and you're trying to get the file contents of src/package.json, your URL should look something like:
https://dev.azure.com/{{organization}}/{{project}}/_apis/sourceProviders/TfsGit/filecontents?repository={{repository}}&commitOrBranch=master&api-version={{api-version}}&path=src%2Fpackage.json
And don't forget the basic auth!
Sure, here's the rests call needed:
GET https://feeds.dev.azure.com/{organization}/_apis/packaging/Feeds/{feedId}/packages/{packageId}?includeAllVersions={includeAllVersions}&includeUrls={includeUrls}&isListed={isListed}&isRelease={isRelease}&includeDeleted={includeDeleted}&includeDescription={includeDescription}&api-version=5.0-preview.1
https://learn.microsoft.com/en-us/rest/api/azure/devops/artifacts/artifact%20%20details/get%20package?view=azure-devops-rest-5.0#package
I was able to get the raw contents of a file using this URL.
GET https://dev.azure.com/{organization}/{project}/_apis/sourceProviders/{providerName}/filecontents?serviceEndpointId={serviceEndpointId}&repository={repository}&commitOrBranch={commitOrBranch}&path={path}&api-version=5.0-preview.1
I got this from here.
https://learn.microsoft.com/en-us/rest/api/azure/devops/build/source%20providers/get%20file%20contents?view=azure-devops-rest-5.0
You can obtain the raw URL using chrome.
Turn on Developer tools and view the Network tab.
Navigate to view the required file in the DevOps portal (Content panel). Once the content view is visible check the network tab again and find the URL which starts with "Items?Path", this is json response which contains the required "url:" element.
Drag the filename from the attachments windows and drop it in to any other MS application to get the raw URL or linked filename.
Most answers address this well, but in context of a public repo with anonymous access the api is different. Here is the one that works in such a scenario:
https://dev.azure.com/{{your_user_name}}/{{project_name}}/_apis/git/repositories/{{repo_name_encoded}}/items?scopePath={{path_to_your_file}}&api-version=6.0
This is the exact equivalent of the "raw" url provided by Github.
Another way that may be helpful if you want to quickly get the raw URL for a specific file that you are browsing:
install the browser extension named "Undisposition"
from the dot menu (top right) choose "Download": the file will open in a new browser tab from which you can copy the URL
(edit: unfortunately this will only work for file types that the browser knows how to open, otherwise it will still offer to download it...)
I am fairly new to this and had an issue accessing a raw file in an Azure DevOps Repo. It's straightforward in Github.
I wanted to download a file in CMD and BASH using Curl.
First I browsed to the file contents in the browser make a note of the bold sections:
https://dev.azure.com/**myOrg**/_git/**myProjectName**?path=%2F**MyFileName.ps1**
I then constructed the URL similar to what #Zach posted above.
https://dev.azure.com/**myOrg**/**myProjectName**/_apis/sourceProviders/TfsGit/filecontents?repository=**myProjectName**&commitOrBranch=**master**&api-version=5.0-preview.1&path=%2F**MyFileName.ps1**
Now when I paste the above URL in the browser it displays the content in RAW form similar to GitHub.
The difference was I had to setup a PAT (Personal Access Token) in My Azure DevOps account then authenticate the URL in DOS/BASH example below:
curl -u "<username>:<password>" "https://dev.azure.com/myOrg/myProjectName/_apis/sourceProviders/TfsGit/filecontents?repository=myProjectName&commitOrBranch=master&api-version=5.0-preview.1&path=%2FMyFileName.ps1" -# -L -o MyFileName.ps1

Posting Blog Entries to a Community

Our tool is submitting blog entries to the idation blog for a configured community by using the Connections API.
Therefore, I use the following workflow, given only a community ID:
1) query /blogs/api/blogs?commUuid=<ID_HERE>&blogType=ideationblog
2) retrieve the link to the communities ideation blog from the xml result of aboves query. the xPath for this is "/app:service/app:workspace/app:collection[a:category[#term='entries']][1]/#href"
3) post the created blog entry payload to this url.
This all worked fine in our environment. However, when I deployed this at a customer, it did not work anymore. The url from the first step returns an empty xml document, and the following steps thus cannot be executed. I tried to query different urls on the customers server like /blogs/{homepageHandle}/api/blogs?commUuid=&blogType=ideationblog which work fine, however the query to the api service document above is the only one which contains the collection element with the link I need.
Is there any other API call I can do, to get this url? Do you know of any reason, why the call is working just fine in our environment, but fails at the customer? Might this be an access rights problem?
I am aware, that I could probably just create a url like "blogs//api/entries" and post to it, however I would prefer the above way, since I only have the communityUuid configured, and also because it is exactly the way that the API Documentation describes:
http://www-10.lotus.com/ldd/appdevwiki.nsf/xpDocViewer.xsp?lookupName=IBM+Connections+4.5+API+Documentation#action=openDocument&res_title=Creating_blog_posts_ic45&content=pdcontent
ServiceDoc -> Collection -> href
UPDATE:
This might be a problem with the SBT really. My assumption, that an empty xml document was returned was wrong, it is rather that calls via the SBT Endpoint classes are returning null.
Endpoint endpoint = EndpointFactory.getEndpoint("connections");
Object result = endpoint.xhrGet("/blogs/api"); // also tried for /blogs/<homepage>/api
When I again tried those URLs in the Browser, I got the complete results. Problem with all this is, that I can neither reproduce this in our own environment nor am I able to debug this at the customer. I tried to catch possible exceptions from this, but none are thrown. It's just that the result is null.
To clarify: The same requests work perfectly fine in our own (Connections 4.0) environment, and also from the browser at the customer. I am of course using the same user to authenticate as well in the browser as in the API calls.
endpoint.isAuthenticationValid();
also returns true, so seemingly no problem there...
I have long ago given up trying to follow the IBM documented REST API instructions (not least of all because it always ends in a myriard of REST requests just to get to the URL I need to send my request to).
I tried both your URLs (/blogs/api/blogs?commUuid=... and /blogs/<homepage>/api/blogs...) against all our Connections 4.5 systems, but although I do get an xml document back it doesn't contain a reference to the ideationblog anywhere (and yes, I made sure to quest against a Community that does contain an ideation blog).
This is a dirty workaround, which you mentioned you did not want to do, but which I do use because the documented way doesn't work:
To post blog entries, you need to POST against
/blogs/<bloghandle>/api/entries
To find out the handle (<snx:handle>) of the ideation blog in your community, you can do the following:
1.) Get the widgets-feed for the community: /communities/service/atom/community/widgets?communityUuid=...
2.) Navigate to the entry of the Ideation Blog widget: <snx:widgetDefId>IdeationBlog</snx:widgetDefId>.
Unless someone in your customer system has messed with the widgets-config.xml, the widgetDefId will be IdeationBlog.
3.) Take the <snx:widgetInstanceId> text of the Ideation Blog entry.
That is the handle of your ideation blog. (Yes, community ideation blogs are created with the widgetInstanceId of the Ideation Blog widget as handle. Normal blogs are created with some mashup of their title as handle). You can now construct the URL to post the entries to.

Can I capture JSON data already being sent with a userscript/Chrome extension?

I'm trying to write a userscript/Chrome extension to capture JSON data being sent while using a web service so that I can reformat it and display selected portion on page. Currently the JSON is sent as the application loads (as I've observed from watching traffic with Fiddler 2). Is my only option to request the JSON again or is capture possible? As I'm not providing a code example, a requested answer is even some guidance on what method / topic to research or if I'm barking up the wrong tree.
No easy way.
If it is for a specific site you might look into intercepting and overwriting part of a code which sends a request. For example if it is sent on a button click you can replace existing click handler with your own implementation.
You can also try to make a proxy for XMLHttpRequest. Not sure if this even possible, never seen a working example. You can look at some attempts here.
For all these tasks you probably would need to run your javascript code out of sandboxed content script to be able to access parent page variables, so you would need to inject <script> tag with your code right into the page from a content script:

Download images containing a specific tag with likes from Instagram

I would like to download images with a certain tag from Instagram with their likes. With this post I hope to get some advice or tips on how to do this. I have no experience with web scraping related stuff or web API usages. One of my questions is: can you create a program like this in python code or can you only do this using a webpage?
So far I have understood the following. To get images with a certain tag you have to:
need a valid access_token to even gain access to images by tag, which can be done like this. However, when I sign in you need to give a website. Does this indicate that you can only use the API's on websites rather than a python program for instance?
you use a media Tag Endpoint to search for tags by name.
I have no idea what the latest step will return exactly, but I expect that it will give me a specific image id that contains the tag. Correct? Now I will also need to get the likes belonging to these images. Just like latest step from before:
you use a likes Tag Endpoint to get a list of users that liked the image of which of course you can get the length.
If I can accomplish all of these steps it seems like I can achieve my original goal. I googled if there was something out there already. The only thing I could find was InstaRaider, but this did not seem to fit my description because it web scraped only the images from a specific user and not by tag or its likes. Any suggestions or ideas would be very helpful, I have only programmed in python and Java before..
I can only tell you that for URL you can use the localhost as this:
http://127.0.0.1
OR
http://localhost
I have also tried to do exactly the same before, but I could not, so I used a website to search for tags and images:
http://iconosquare.com/search/[HASHTAG]

Resources