Regular expression to match any tiktok video id and url - node.js

I'm trying to match all these with just one regex:
https://m.tiktok.com/h5/share/usr/6641141594707361797.html
https://m.tiktok.com/v/6749869095467945218.html
https://www.tiktok.com/embed/6567659045795758085
https://www.tiktok.com/share/user/6567659045795758085
https://www.tiktok.com/trending?shareId=6744531482393545985
https://www.tiktok.com/#burntpizza89/video/7067695578729221378?is_copy_url=1&is_from_webapp=v1
https://www.tiktok.com/#burntpizza89/video/is_copy_url=1&is_from_webapp=v1&item_id=7067695578729221378
https://vm.tiktok.com/ZMF6rgvXY/
And it works fine except for the last one. The current regex is:
"\bhttps?:\/\/(?:m|www|vm)\.tiktok\.com\/.*\b(?:(?:usr|v|embed|user|video)\/|\?shareId=|\&item_id=)(\d+)\b"gm
It's handling all these digits ids perfectly (.tiktok.com/#burntpizza89/video/7067695578729221378), but I also need to match somehow these types of links which contains some specific url (.tiktok.com/ZMF6rgvXY/) with just one regex. So for the match I would get or digit-only id, or the url which contains digits and characters.

try the last part:
~https?://(?:www\.)?tiktok\.com/\S*/video/(\d+)|https?://(?:www\.)?vm.tiktok.com/\S*/~

Related

Use RegEx in Python to extract URL and optional query string from web server log data

Disclosure: very much a regex newbie, so I'm trying to tweak some example code I found which parses web server log data into named groups. The snippet of my modified regex thus far that deals with the URL and query string groups:
(?P<url>.+)(?P<querystr>\?.*)
This works just fine when the string against which it's applied actually does have a query string on the URL (each group gets the expected bit of the string) but fails to match if there is none. So I tried adding a '?' after the "querystr" group to indicate that it was optional, i.e. (?P<querystr>\?.*)? ... if there's no query string then it works as expected (nothing is extracted into querystr), but when there is one, it is still extracted as part of url rather than separately into querystr.
What's the best way to identify optional groups (assuming that's even the right approach in this case)? Thanks in advance.
You can use
^(?P<url>[^?]+)(?P<querystr>\?.*)?$
Details
^ - start of string
(?P<url>[^?]+) - Group "url": any one or more chars other than ?
(?P<querystr>\?.*)? - an optional Group "querystr": a ? char and then any zero or more chars other than line break chars as many as possible
$ - end of string.
See the regex demo.

re.split but leaving in the condition

I have an example text string text_var = 'ndTail7-40512-1' and I want to split the first time I see a number followed by a - BUT I want to keep the number. Currently, I have print(re.split('\d*(?=-)',text_var,1)) and my output is ['ndTail', '-40512-1']. But I want to keep that number which is the trigger so it should look like ['ndTail', '7-40512-1']. Any help?
We can try using re.findall here:
text_var = 'ndTail7-40512-1'
matches = re.findall(r'(.*?)(\d-.*$)', text_var)
print(matches[0])
This prints:
('ndTail', '7-40512-1')
Sometimes it can be easier to use re.findall rather than re.split.
The regex pattern used here says to:
(.*?) match AND capture all content up to, but including
(\d-.*$) the first digit which is followed by a hyphen;
match and capture this content all the way to the end of the input
Note that we are using re.findall which typically has the potential to return multiple matches. However, in this case, our pattern matches to the end of the input, so we are left with just a single tuple containing the two desired capture groups.

Negative Look-ahead in Regex doesn't seem to be working

I'm attempting to use Regex to extract a sub-domain from a url that follows a strict pattern. I want to only match urls with subdomains specified, so I'm using a negative look-ahead. This seems to work in many regex evaluators, but when I run in node, both strings get matched. Here's the code:
const defaultDomain = 'https://xyz.domain.com';
const scopedDomain = 'https://xyz.subdomain.domain.com';
const regex = /^https:\/\/xyz\.([^.]+(?!domain))\./
const matchPrefix1 = defaultDomain.match(regex);
const matchPrefix2 = scopedDomain.match(regex);
console.log(matchPrefix1);
console.log(matchPrefix2);
Expected: matchPrefix1 is null and matchPrefix2 results in a match where the first capture group is 'subdomain'
Actual: both matchPrefix1 and matchPrefix2 contain data, with the capture groups coming back as 'domain' and 'subdomain' respectively
Link to regexr (works!): https://regexr.com/42bfn
Link to repl (does not work): https://repl.it/#tomismore/SpiffyFrivolousLaws
What's going on here?
Regexr shows your code working because you didn't add the multiline flag. This causes the start of the second line to not match ^, so the whole second line is ignored. Add the multiline flag to see your regex not working.
I would use this regex:
^https:\/\/xyz\.(?!domain)([^.]+)\.
The change I made is to move the [^.]+ part to after checking (?!domain). Basically, you should check for (?!domain) immediately after matching xyz\..

Get a value from the string with regex

I have this for example:
<#445288012218368010>
And I want to get from between <# > symbols the value.
I tried so:
string.replace(/^(?:\<\#)(?:.*)(?:\>)$/gim, '');
But then I don't get any result. It will delete/remove the whole string.
I want only this part: 445288012218368010 (it will be dynamic, so yeah it will be not the same numbers).
Anyway it is for the discord chat bot and I know that there is other methods for check the mentioned names but I want to do that in regex because which I am trying to do can't go the common method.
So yeah how can I get the value from between those symbols?
I need this in node.js regex.
You can use String#match which will return regular expression matches for the string (in this case the RegExp would be <#(\d+)> (the parenthesis around the \d+ make \d+ become its own group). This way you can use <string>.match(/<#(\d+)>/) to get the regular expression results and <string>.match(/<#(\d+)>/)[1] to get the first group of the regex (in this case the number).
You regex matches but you use a non capturing group (?:.*) so you get the full match and replace that with an empty string. Note that you could omit the first and the third non capturing group and use <# and > instead.
You could match what is between the brackets using a capturing group ([^>]+) or (\d+) and use replace and refer the first capturing group $1 in the replacement.
console.log("<#445288012218368010>".replace(/^<#([^>]+)>$/gim, '$1'));

Regex for a specific email for a specific domain

I'm looking for a regex to match this: a_*_*#example.com where the * is any text of any length. Doing this in NodeJS
Additionally I'm looking for a regex that matches any string not including the # symbol.
a_.*_.*#example\.com for the first
^[^#]*$ for the second

Resources