HiveQL string function questions - string

I'm using HiveQL to run the below query.
The intention is that the case statement removes the last XX characters from the end of the domain, dependent on the suffix (.com, .co.uk).
This doesn't seem to work as there is no change to the strings in the 'domainnew' column in the output.
Can anyone advise how I would make this work?
I also then need to take the output of 'domainnew' and take only the characters to the right of the first '.' when reading from the right handside.
domain = mobile.domain.facebook.com
domainnew = mobile.domain.facebook
newcalc = facebook
Any advice on this would be brilliant!!
Thank you
select domain, catid, apnid, sum(optimisedsize) as bytes,
CASE domain
WHEN instr(domain, '.co.uk') THEN substr(domain,LENGTH(domain)-6)
WHEN instr(domain, '.com') THEN substr(domain,LENGTH(domain)-6)
ELSE domain
END as domainnew
from udsapp.web
where dt = 20170330 and hour = 04 and loc = 'FAR1' and catid <> "0:0" group by domain, catid, apnid sort by bytes desc;

with t as (select 'mobile.domain.facebook.com' as domain)
select regexp_extract(domain,'(.*?)(\\.com|\\.co\\.uk|)$',1) as domainnew
,regexp_extract(domain,'.*?([^.]+)(\\.com|\\.co\\.uk|)$',1) as new_calc
from t
;
+------------------------+----------+
| domainnew | new_calc |
+------------------------+----------+
| mobile.domain.facebook | facebook |
+------------------------+----------+

Related

azure app insights get a part of a string from url and display it in the name column

I am using azure app insights and i want to parse a part of the string from url and show that part in a name column
requests
| where user_AuthenticatedId != ""
and url contains "reports" and user_AuthenticatedId == "xxx"
| project timestamp, user_AuthenticatedId, client_CountryOrRegion, client_OS, url,name
| order by timestamp asc nulls last
for example i am getting url as https://localhost:80/api/external-reports/blob/39/test 01b/false so i want to take the test 01b from this and show it in the name column.
i am not sure on how to do this.
There are some functions that might be helpful.
First of all, you can get the different url parts using parse_url() method. For example, given the url https://localhost:80/api/external-reports/blob/39/test 01b/false :
requests
| project parse_url(url)
output:
{"Scheme":"https","Host":"localhost","Port":"80","Path":"/api/external-reports/blob/39/test 01b/false","Username":"","Password":"","Query Parameters":{},"Fragment":""}
You can split the result even further using the split() method:
requests
| project split(parse_url(url).Path, "/")
output:
["","api","external-reports","blob","39","test 01b","false"]
To get the part you want you can use the index:
request
| project mycolumn = split(parse_url(test).Path, "/")[5]
output:
test 01b
When an index is used that is greater than the number of parts an empty result is returned. You can replace it with a value of your own using the coalesce function:
requests
| project mycolumn = coalesce(split(parse_url(test).Path, "/")[5], "unknown")
it shows unknown when the index is out of range or the part is empty.

How to extract specific sub directory names from URL

Given the following request URLs:
https://example.com/api/foos/123/bars/456
https://example.com/api/foos/123/bars/456
https://example.com/api/foos/123/bars/456/details
Common structure: https://example.com/api/foos/{foo-id}/bars/{bar-id}
I wish to get separate columns for the values of {foo-id} and {bar-id}
What I tried
requests
| where timestamp > ago(1d)
| extend parsed_url=parse_url(url)
| extend path = tostring(parsed_url["Path"])
| extend: foo = "value of foo-id"
| extend: bar = "value of bar-id"
This gives me /api/foos/{foo-id}/bars/{bar-id} as a new path column.
Can I solve this question without using regular expressions?
Related, but not the same question:
Application Insights: Analytics - how to extract string at specific position
Splitting on the '/' character will give you an array and then you can extract the elements you are looking for as long as the path stays consistent. Using parse_url() is optional- you could use substring() or just adjust the indexes you retrieve.
requests
| extend path = parse_url(url)
| extend elements = split(substring(path.Path, 1), "/") //gets rid of the leading slash
| extend foo=tostring(elements[2]), bar=tostring(elements[4])
| summarize count() by foo, bar

How to replace repeated words with single word

I have a string variable response:
where where where is it
I'm going there
where where did you say
sometimes it is where you think
i think its where where you go
its everywhere where you are
i am planning on going where where where i want to
As you can see, the word "where" is repeated quite often. I want to replace strings "where where" and "where where where" (or even "where where where where") with "where".
However, I don't want to replace "everywhere where" with "where".
I know I can do this manually, but I was hoping to condense the code into as few lines as possible.
This is what I have been trying so far:
gen temp = regexr(response, " (where)+ where ", " where ")
replace temp = regexr(response, "^(where)+ where ", "where ")
These are my results after running the code above:
where where is it
I'm going there
where did you say
sometimes it is where you think
i think its where where you go
its everywhere where you are
i am planning on going where where where i want to
Instead, I want the final data to look like this:
where is it
I'm going there
where did you say
sometimes it is where you think
i think its where you go
its everywhere where you are
i am planning on going where i want to
I have been using "(where)+" to capture both "where where" and "where where where" but it doesn't seem to work. I also split the code into two commands, one begins with "^(where)" and the other with " (where)" in order to avoid capturing the 'where' in "everywhere" but it seems as if the code does not capture "where where" when it occurs in the middle of the sentence.
A quick fix using Stata's string functions is the following:
clear
input str50 string1
"where where where is it"
"I'm going there"
"where where did you say"
"sometimes it is where you think"
"i think its where where you go"
"its everywhere where you are"
"i am planning on going where where where i want to"
end
generate tag1 = !strmatch(string1, "*everywhere where*")
generate tag2 = ( length(string1) - length(subinstr(string1, "where", "", .)) ) / 5
generate string2 = cond(tag1 == 1, stritrim(subinstr(string1, "where", "", tag2-1)), string1)
list string2, separator(0)
+----------------------------------------+
| string2 |
|----------------------------------------|
1. | where is it |
2. | I'm going there |
3. | where did you say |
4. | sometimes it is where you think |
5. | i think its where you go |
6. | its everywhere where you are |
7. | i am planning on going where i want to |
+----------------------------------------+

Using Match in a sqlite fts5 query but need more control over ranking?

I have a virtual table created using fts5:
import sqlite3
# create a db in memory
con = sqlite3.connect(':memory:')
con.execute('create virtual table operators using fts5(family, operator, label, summary, tokenize=porter)')
# some sample data
samples = {'insideTOP':
{'label':'Inside',
'family':'TOP',
'summary':'The Inside TOP places Input1 inside Input2.'
},
'inTOP':
{'label':'In',
'family':'TOP',
'summary':'The In TOP is used to create a TOP input.'
},
'fileinSOP':
{'label':'File In',
'family':'SOP',
'summary':'The File In SOP allows you to read a file'
}
}
# fill db with those values
for operator in samples.keys():
opDescr = samples[operator]
con.executescript("insert into operators (family, operator, label, summary) values ('{0}','{1}','{2}','{3}');".format(opDescr['family'],operator,opDescr['label'],opDescr['summary']))
with following columns
+--------+-----------+------------+----------------------------------------------+
| family | operator | label | summary |
+--------+-----------+------------+----------------------------------------------+
| TOP | insideTOP | Inside | The Inside TOP places Input1 inside Input2.|
| TOP | inTOP | In | The In TOP is used to create a TOP input. |
| SOP | fileinSOP | File In | The File In SOP allows you to read a file |
+--------+-----------+------------+----------------------------------------------+
an example query is:
# query the db
query = "select operator from operators where operators match 'operator:In*' or operators match 'label:In*' order by family, bm25(operators)"
result = con.execute(query)
for row in result:
print(row)
And as a result I get
fileinSOP
insideTOP
inTOP
For this particular case though, I'd actually like the 'inTOP' to appear before the 'insideTOP' as the label is a perfect match.
What would be a good technique to be able to massage these results the way I'd like them?
Thank you very much
Markus
maybe you can put your order rule in the question.
If you use bm25 to order your results, you can't achieve the result you want
I suggest you that you can use your custom rank function, like below sql:
query = "select operator from operators where operators match 'operator:In*' or operators match 'label:In*' order by myrank(family, operators)"
define a custom rank function is very easy in fts5, you can follow the guide in the fts5 website.
if you also want bm25 result as a rank score, you can get the score in the rank method can calculate your final score.

Jade/Pug if else condition usage

I'm sending a date to a .jade file from my .js file using Node.js. When the #{date} field is false, it executes the else and print man as it's answer. What could be going wrong?
if #{date} == false
| #{date}
else
| man
If date is false, do you want to output the string 'man'?
If yes, your if and else statements are the wrong way around...
How about:
if date
= date
else
| man
or even:
| #{date ? date : 'man'}
or simply:
| #{date || 'man'}
Within if expression you write plain variable names, without #{...}
if date == false
| #{date}
else
| man
Your statement was backwards. For the syntax, You can use this style to work:
p Date:
if date
| date
else
| man
Its correct that you don't need the #{} within expression. I was not able to get the = to work, or other ways on the other answers.
Ternary Style
For Myself, I too was looking for the ternary operator to do this on one line. I whittled it down to this:
p Date: #{(date ? date : "man")}
Alternatively, you can use a var, which adds one more line, but is still less lines than OP:
- var myDate = (date ? date : "man")
p Date: #{myDate}
I was not able to get the following to work, as suggested in another answer.
| #{date ? date : 'man'}

Resources