Extracting rootdomains from URL string in Google Sheets

Extracting rootdomains from URL string in Google Sheets - web

Hi I am trying to extract the rootdomain from URL string in Google Sheets. I know how to get the domain and I have the formula to remove www. but now I realize it does not strip subdomain prefixes like 'mysite'.site.com; where mysite is not stripped from the domain name.
Question: How can I retrieve the domain.com rootdomain where the domain string contacts alphanumeric characters, then 1 dot, then alphanumeric characters (and nothing more)
Formula so far in Google Sheets:
=REGEXREPLACE(REGEXREPLACE(D3923;"(http(s)?://)?(www\.)?";"");"/.*";"")
Maybe this can be simplified ...
Test cases
https://www.domain.com/ => domain.com
https://domain.com/ => domain.com
http://www.domain.nl/ => domain.com
http://domain.de/ => domain.com
http://www.domain.co.uk/ => domain.co.uk
http://domain.co.au/ => domain.co.au
sub.domain.org/ => sub.domain.com
sub.domain.org => sub.domain.com
domain.com => domain.com
http://www.domain.nl?par=1 => domain.com
https://www.domain.nl/test/?par=1 => domain.com
http2://sub2.startpagina.nl/test/?par=1 => domain.com

Currently using:
=trim(REGEXEXTRACT(REGEXREPLACE(REGEXREPLACE(A2;"https?://";"");"^(w{3}\.)?";"")&"/";"([^/?]+)"))
Seems to work fine
Updated:7-7-2016
(thanks for all the help!)

I think that a most reliable way is to check over TLD list because of TLDs like co.uk, gov.uk and so on that are impossible to extract via a simple regex.
You can define these functions in Tools -> Script editor
function endsWith(str, searchString) {
position = str.length - searchString.length;
var lastIndex = str.lastIndexOf(searchString);
return lastIndex !== -1 && lastIndex === position;
}
function rawToTlds(raw) {
var letter = new RegExp(/^\w/);
return raw.split(/\n/).filter(function (t) { return letter.test(t) })
}
function compressString(s) {
var zippedBlob = Utilities.gzip(Utilities.newBlob(s))
return Utilities.base64Encode(zippedBlob.getBytes())
}
function uncompressString(x) {
var zippedBytes = Utilities.base64Decode(x)
var zippedBlob = Utilities.newBlob(zippedBytes, 'application/x-gzip')
var stringBlob = Utilities.ungzip(zippedBlob)
return stringBlob.getDataAsString()
}
function getTlds() {
var cacheName = 'TLDs'
var cache = CacheService.getScriptCache();
var base64Encoded = cache.get(cacheName);
if (base64Encoded != null) {
return uncompressString(base64Encoded).split(',')
}
var raw = UrlFetchApp.fetch('https://publicsuffix.org/list/public_suffix_list.dat').getContentText()
var tlds = rawToTlds(raw)
cache.put(cacheName, compressString(tlds.join()), 21600)
return tlds
}
function getDomainName(url, level) {
var tlds = getTlds()
var domain = url
.replace(/^http(s)?:\/\//i, "")
.replace(/^www\./i, "")
.replace(/\/.*$/, "")
.replace(/\?.*/, "");
if (typeof level === 'undefined') {
return domain
}
var result = domain
var longest = 0
for (i in tlds) {
var tld = '.' + tlds[i]
if (endsWith(domain, tld) && tld.length > longest) {
var parts = domain.substring(0, domain.length - tld.length).split('.')
result = parts.slice(parts.length-level+1, parts.length).join('.') + tld
longest = tld.length
}
}
return result
}
To get second-level domian of A1 use it like this
=getDomainName(A1, 2)
To get full domain of A1 just do
=getDomainName(A1)
EDIT
Public Suffix List has exceeded 100KB. It doesn't fit in Apps Script cache anymore. So I'm gzipping it now.

try:
=INDEX(IFERROR(REGEXEXTRACT(A1:A,
"^(?:https?:\/\/)?(?:ftp:\/\/)?(?:www\.)?([^\/]+)")))

Related

Is there a regex to be able to match two url's , one that has a wildcard and one that doesn't?

I am writing a program in Nodejs with the following scenarios.
I have an array of url's that include wildcards, such as the following:
https://*.example.com/example/login
http://www.example2.com/*/example2/callback
Secondly, I have an incoming redirect url that I need to validate matches what is in the array of url's above. I was wondering if there was a way using Regex or anything else that I can use something like arr.includes(incomingRedirectUrl) and compare the two.
I can match non-wildcard url's using array.includes(incomingRedirectUrl), but when it comes to matching the array that has wildcards, I cannot think of a solution.
For example,
https://x.example.com/example/login should work because it matches the first url in the above example, only replacing the "*" with the x.
Is there a way I can achieve this? Or do I have to break down the url's using something like slice at the "*" to compare the two?
Thanks in advance for any help.
for (let i = 0; i < arr.length; i++) {
if (arr[i].indexOf('*') !== -1) {
wildcardArr.push(arr[i]);
} else {
noWildcardArr.push(arr[i]);
}
}
***Note, the reason I check noWildcardArr first is because most of the validate redirect url's do not contain wildcard
if (noWildcardArr.includes(incomingRedirectUrl)) {
//Validated correct url, proceed with the next part of my code (this part already works)
} else if (wildcardArr.includes(incomingRedirectUrl)) {
//need to figure out this logic here, not sure if the above is possible without formatting wildcardArr but url should be validated if url matches with wildcard
} else {
log.error('authorize: Bad Request - Invalid Redirect URL');
context.res = {
status: 400,
body: 'Bad Request - Invalid Redirect URL',
};
}

You could compile your URL array into proper regex and then iterate over them to see if it matches. Similar to something like a web framework would do that allows URL path parameters such as /users/:id.
function makeMatcher(urls) {
const compiled = urls.map(url => {
// regex escape the url but dont escape *
let exp = url.replace(/[-[\]{}()+?.,\\^$|#\s]/g, '\\$&');
// replace * with .+ for the wildcard
exp = exp.replaceAll('*', '.+');
// the expression is used to create the match function
return new RegExp(`^${exp}$`);
});
// return the match function, which returns true, on the first match,
// or false, if there is no match at all
return function match(url) {
return compiled.find(regex => url.match(regex)) == undefined ?
false :
true;
};
}
const matches = makeMatcher([
'https://*.example.com/example/login',
'http://www.example2.com/*/example2/callback'
]);
// these 2 should match
console.log(matches('https://x.example.com/example/login'));
console.log(matches('http://www.example2.com/foo/example2/callback'));
// this one not
console.log(matches('http://nope.example2.com/foo/example2/callback'));

How do I train Bixby to recognize a wild card search term?

I have an action FindPage.js that finds pages and retrieves them for display as results. I understand how to train it to find pages with utterances like "Read the Twitter Search page" or "Read the Searchable Text page". The training treats "Twitter Search" as SearchTerm and the code below matches SearchTerm to the tag field in the data. But how would I train to understand a command like "Read all pages"? I want the code to carry out a search on the wildcard and bring back all available pages.
// search for informational pages
var console = require('console');
const PAGES = require('./content/pages')
pages = PAGES
console.log('pages are', pages)
exports.function = function findPage (searchTerm) {
console.log('searchTerm is', searchTerm)
var matches = []
pages = PAGES
for (var i = 0; i < pages.length; i++) {
if (searchTerm == pages[i].tag) {
matches.push(pages[i])
}
else
{ console.log('no tag matches')
}
}
console.log('matches are', matches)
return matches
}
Training:
[g:Page] Read the (Twitter Search)[v:SearchTerm] page.

This works although I feel it is somewhat clunky to hardcode a conversion from "all" to the include wildcard string, which is ''.
exports.function = function findPage (searchTerm) {
//console.log('searchTerm is', searchTerm)
if (searchTerm == 'all') {
searchTerm = ''
console.log('searchTerm is all', searchTerm)
}
else
{ console.log('searchTerm is not all', searchTerm)
}
var matches = []
pages = PAGES
matches = pages.filter(function(pages) {
return pages.tag.includes(searchTerm);
});
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/includes#Examples See example
const str = 'To be, or not to be, that is the question.';
console.log(str.includes('To be')); // true
console.log(str.includes('')) // true

Hybris prettyURL showing PK instead of the real file name

The file-name of any image is appearing like
/de-de/medias/sys_master/images/images/h9c/h5f/8796178743326/8796178743326.jpg in the url.
Instead of 8796178743326.jpg there should be file-name.jpg
I have already set media.legacy.prettyURL=true
8796178743326 is the PK of the image.
Any help!

With the prettyURL, if there is no realfilename value in media instance then URL will end with PK instead real file name.
/medias/sys_master/images/images/h9c/h5f/8796178743326/8796178743326.jpg
If you really want the file name in the URL then you have to edit respective media from the backoffice/impex and assign value to the realFileName attribute.
Have a look into assembleLegacyURL method of LocalMediaWebURLStrategy class
String realFileName = this.getRealFileNameForMedia(mediaSource);
if (realFileName == null) {
basePath = mediaSource.getLocation().substring(0, lastDotIdx);
lastDotIndexForRealFileName = StringUtils.lastIndexOf(basePath, '/');
String fileName = basePath.substring(lastDotIndexForRealFileName + 1);
sb.append(basePath).append("/").append(fileName).append('.').append(fileExtension);
} else {
basePath = location.substring(0, lastDotIdx);
lastDotIndexForRealFileName = realFileName.lastIndexOf(46);
if (lastDotIndexForRealFileName != -1) {
realFileName = realFileName.substring(0, lastDotIndexForRealFileName);
}
sb.append(basePath).append("/").append(realFileName).append('.').append(fileExtension);
}

Node JS: Add "/" on end of all url if there is not

How can i add "/" on end of all url if there is not on Node JS / Express ?
Thank you in advance

All you need to do is check whether the string last character is "/", and if its not add it.
like this:
var addSlash = function( str ) {
return str.substr(-1) !== "/" ? ( str + "/" ) : str
}

var url = require('url');
function addSlash = function (str) {
var u = url.parse(str);
if (u.pathname.substr(-1) !== "/") {
u.pathname += "/";
}
return url.format(u);
}

lastIndexOf return the last position where a slash is, and if it isn't at the end of the string, we add a slash to the url.
function addSlash(url) {
return url.lastIndexOf("/") == url.length - 1 ? url + "/" : url:
}
No modules required.

Codeigniter Route multiple domains to same controller and function but with different parameters

Is there a way to route multiple domains to a single controller/function but with different parameters?
For example:
some_domain.com -> sites/display/site_slug_1
other_domain.com -> sites/display/site_slug_2
"sites" is the controller and "display" is the function.
Is it possible to just add new domains to the routes.php file and have them redirected to the proper uri's?
Can't answer my own question so I'm posting the solution here:
I ended up adding something like this to the routes.php file
//define each domain and it's route
$sites_routes = array();
$sites_routes['domain1.com'] = 'sites/display/site_slug_1';
$sites_routes['domain2.com'] = 'sites/display/site_slug_2';
//get domain name
$host = $_SERVER['HTTP_HOST'];
preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
//define each domain and it's route
$sites_routes = array();
$sites_routes['domain1.com'] = 'sites/display/site_slug_1';
$sites_routes['domain2.com'] = 'sites/display/site_slug_2';
//get domain name
$host = $_SERVER['HTTP_HOST'];
preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
//build the routes
if(isset($sites_routes[$matches[0]]))
{
$route['default_controller'] = $sites_routes[$matches[0]];
$route['(:any)'] = $sites_routes[$matches[0]].'/$1';
}
else
{
$route['default_controller'] = 'home';
}

You could use .htaccess on the different domains to do the mapping

//get domain name
$host = $_SERVER['HTTP_HOST'];
preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
//define each domain and it's route
$sites_routes = array();
$sites_routes['meilibosi.com'] = 'mlbs';
$sites_routes['qunar.ir'] = 'longyueco';
//get domain name
$host = $_SERVER['HTTP_HOST'];
preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
//build the routes
if(isset($sites_routes[$matches[0]]))
{
$route['default_controller'] = $sites_routes[$matches[0]];
$route['(:any)'] = $sites_routes[$matches[0]]."/$1";
}
else
{
$route['default_controller'] = "mlbs";
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extracting rootdomains from URL string in Google Sheets - web

Currently using: =trim(REGEXEXTRACT(REGEXREPLACE(REGEXREPLACE(A2;"https?://";"");"^(w{3}\.)?";"")&"/";"([^/?]+)")) Seems to work fine Updated:7-7-2016 (thanks for all the help!)

try: =INDEX(IFERROR(REGEXEXTRACT(A1:A, "^(?:https?:\/\/)?(?:ftp:\/\/)?(?:www\.)?([^\/]+)")))

Related

Is there a regex to be able to match two url's , one that has a wildcard and one that doesn't?

How do I train Bixby to recognize a wild card search term?

Hybris prettyURL showing PK instead of the real file name

Node JS: Add "/" on end of all url if there is not

Codeigniter Route multiple domains to same controller and function but with different parameters

Categories

Resources