Parse a line to get specific string in python - python-3.x

I'm very new for python and tried to get parse the URL from the line. How can I get the line?
application_url: https://hafaf.daff.io
I tried to use split but I could not get.

So split works as such:
mystring = "Hello, my name is Sam!"
print(mystring.split('Hello')[1])
That will output:
", my name is Sam!"
What split does it quite literally as it sounds like, is split on a specific string or character.
So to get the url there you'd do the following:
my_url = "application_url: https://hafaf.daff.io".split("application_url: ")[1]
Which would result in the variable my_url being "https://hafaf.daff.io"
Do note the inclusion of the spaces when splitting.
Split breaks a string into a LIST object which you can then access by index. So when I go to get your url from that string I search for the second index being 1 as the "application_url: " is in position 0.

Related

How to use Python Regex to match url

I have a string:
test_string="lots of other html tags ,'https://news.sky.net/upload_files/image/2022/202209_166293.png',and still 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'"
How can I get the whole 2 urls in the string,by using python Regex ?
I tried:
pattern = 'https://news.sky.net/upload_files/image'
result = re.findall(pattern, test_string)
I can get a list:
['https://news.sky.net/upload_files/image','https://news.sky.net/upload_files/image']
but not the whole url ,so I tried:
pattern = 'https://news.sky.net/upload_files/image...$png'
result = re.findall(pattern, test_string)
But received an empty list.
You could match a minimal number of characters after image up to a . and either png or jpg:
test_string = "lots of other html tags ,'https://news.sky.net/upload_files/image/2022/202209_166293.png',and still 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'"
pattern = r'https://news.sky.net/upload_files/image.*?\.(?:png|jpg)'
re.findall(pattern, test_string)
Output:
[
'https://news.sky.net/upload_files/image/2022/202209_166293.png',
'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'
]
Assuming you would always expect the URLs to appear inside single quotes, we can use re.findall as follows:
I have a string:
test_string = "lots of other html tags ,'https://news.sky.net/upload_files/image/2022/202209_166293.png',and still 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'"
urls = re.findall(r"'(https?:\S+?)'", test_string)
print(urls)
This prints:
['https://news.sky.net/upload_files/image/2022/202209_166293.png',
'https://news.sky.net/upload_files/image/2022/202209_166293.jpg']
You could match any URL inside the string you have by using the following regex '(https?://\S+)'
by applying this to your code it would be something like this:
import re
string = "Some string here'https://news.sky.net/upload_files/image/2022/202209_166293.png' And here as well 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg' that's it tho"
res = re.findall(r"(http(s)?://\S+)", string)
print(res)
this will return a list of URLs got collected from the string:
[
'https://news.sky.net/upload_files/image/2022/202209_166293.png',
'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'
]
Regex Explaination:
'(https?://\S+)'
https? - to check if the url is https or http
\S+ - any non-whitespace character one or more times
So this will get either https or http then after :// characters it will take any non-whitespace character one or more times
Hope you find this helpful.

Get number from string in Python

I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').

How to find a substring in a line from a text file and add that line or the characters after the searched string into a list using Python?

I have a MIB dataset which is around 10k lines. I want to find a certain string (for eg: "SNMPv2-MIB::sysORID") in the text file and add the whole line into a list. I am using Jupyter Notebooks for running the code.
I used the below code to search the search string and it print the searched string along with the next two strings.
basic = open('mibdata.txt')
file = basic.read()
city_name = re.search(r"SNMPv2-MIB::sysORID(?:[^a-zA-Z'-]+[a-zA-Z'-]+) {1,2}", file)
city_name = city_name.group()
print(city_name)
Sample lines in file:
SNMPv2-MIB::sysORID.10 = OID: NOTIFICATION-LOG-MIB::notificationLogMIB
SNMPv2-MIB::sysORDescr.1 = STRING: The MIB for Message Processing and Dispatching.
The output expected is
SNMPv2-MIB::sysORID.10 = OID: NOTIFICATION-LOG-MIB::notificationLogMIB
but i get only
SNMPv2-MIB::sysORID.10 = OID: NOTIFICATION-LOG-MIB
The problem with changing the number of string after the searched strings is that the number of strings in each line is different and i cannot specify a constant. Instead i want to use '\n' as a delimiter but I could not find one such post.
P.S. Any other solution is also welcome
EDIT
You can read all lines one by one of the file and look for a certain Regex that matches the case.
r(NMPv2-MIB::sysORID).* finds the encounter of the string in the parenthesis and then matches everything followed after.
import re
basic = open('file.txt')
entries = map(lambda x : re.search(r"(SNMPv2-MIB::sys).*",x).group() if re.search(r"(SNMPv2-MIB::sys).*",x) is not None else "", basic.readlines())
non_empty_entries = list(filter(lambda x : x is not "", entries))
print(non_empty_entries)
If you are not comfortable with Lambdas, what the above script does is
taking the text from the file, splits it into lines and checks all lines individually for a regex match.
Entries is a list of all lines where the match was encountered.
EDIT vol2
Now when the regex doesn't match it will add an empty string and after we filter them out.

how can i split a full name to first name and last name in python?

I'm a novice in python programming and i'm trying to split full name to first name and last name, can someone assist me on this ? so my example file is:
Sarah Simpson
I expect the output like this : Sarah,Simpson
You can use the split() function like so:
fullname=" Sarah Simpson"
fullname.split()
which will give you: ['Sarah', 'Simpson']
Building on that, you can do:
first=fullname.split()[0]
last=fullname.split()[-1]
print(first + ',' + last)
which would give you Sarah,Simpson with no spaces
This comes handly : nameparser 1.0.6 - https://pypi.org/project/nameparser/
>>> from nameparser import HumanName
>>> name = "Sarah Simpson"
>>> name = HumanName(name)
>>> name.last
'Simpson'
>>> name.first
'Sarah'
>>> name.last+', '+name.first
'Simpson, Sarah'
you can try the .split() function which returns a list of strings after splitting by a separator. In this case the separator is a space char.
first remove leading and trailing spaces using .strip() then split by the separator.
first_name, last_name=fullname.strip().split()
Strings in Python are immutable. Create a new String to get the desired output.
You can use split() method of string class.
name = "Sarah Simpson"
name.split()
split() by default splits on whitespace, and takes separator as parameter. It returns a list
["Sarah", "Simpson"]
Just concatenate the strings. For more reference https://docs.python.org/3.7/library/stdtypes.html?highlight=split#str.split
Output = "Sarah", "Simpson"
name = "Thomas Winter"
LastName = name.split()[1]
(note the parantheses on the function call split.)
split() creates a list where each element is from your original string, delimited by whitespace. You can now grab the second element using name.split()[1] or the last element using name.split()[-1]
split() is obviously the function to go for-
which can take a parameter or 0 parameter
fullname="Sarah Simpson"
ls=fullname.split()
ls=fullname.split(" ") #this will split by specified space
Extra Optional
And if you want the split name to be shown as a string delimited by coma, then you can use join() or replace
print(",".join(ls)) #outputs Sarah,Simpson
print(st.replace(" ",","))
Input: Sarah Simpson => suppose it is a string.
Then, to output: Sarah, Simpson. Do the following:
name_surname = "Sarah Simpson".split(" ")
to_output = name_surname[0] + ", " + name_surname[-1]
print(to_output)
The function split is executed on a string to split it by a specified argument passed to it. Then it outputs a list of all chars or words that were split.
In your case: the string is "Sarah Simpson", so, when you execute split with the argument " " -empty space- the output will be: ["Sarah", "Simpson"].
Now, to combine the names or to access any of them, you can right the name of the list with a square brackets containing the index of the desired word to return. For example: name_surname[0] will output "Sarah" since its index is 0 in the list.

matlab function replacing last part of strings between known characters

I have a text file TF including a set of the following kind of strings:
"linStru.twoZoneBuildingStructure.north.airLeakage.senTem.T",
"linStru.twoZoneBuildingStructure.north.vol.Xi[1]",
"linStru.twoZoneBuildingStructure.south.airLeakage.senTem.T",
"linStru.twoZoneBuildingStructure.south.vol.Xi[1]", "
"linStru.twoZoneBuildingStructure.north_ext.layMul.nMat[1].monoLayer1Nf.T[1]",
"linStru.twoZoneBuildingStructure.north_ext.layMul.nMat[1].monoLayer2Nf.T[2]",
Given a line L, starting from the end let the substring s denote the portion of the string between ," and the first .
To make it clearer, for L=1: s=T, for L=2: s=Xi[1], for L=5: s=T[1], etc.
Given a text file TF in the above format, I want to write a MATLAB function which takes TF and replaces the corresponding s on each line with der(s).
For example, the function should change the above strings as follows:
"linStru.twoZoneBuildingStructure.north.airLeakage.senTem.der(T)",
"linStru.twoZoneBuildingStructure.north.vol.der(Xi[1])",
"linStru.twoZoneBuildingStructure.south.airLeakage.senTem.der(T)",
"linStru.twoZoneBuildingStructure.south.vol.der(Xi[1])", "
"linStru.twoZoneBuildingStructure.north_ext.layMul.nMat[1].monoLayer1Nf.der(T[1])",
"linStru.twoZoneBuildingStructure.north_ext.layMul.nMat[1].monoLayer2Nf.der(T[2])",
How can such a function be written?
Something like
regexprep(TF, '\.([^.]+)",$', '.der($1)",', 'dotexceptnewline', 'lineanchors')
It finds the longest sequence of non-dot characters appearing between a dot before and quote-comma-endline after, and encloses that inside der( ).
I see there is a small " typo on the fourth line of your text file. I'm going to remove this to make things simpler.
As such, the simplest way that I can see you do this is iterate through all of your strings, remove the single quotes, then find the point in your string where the last . occurs. Extract this substring, then manually insert the der() in between this string. Assuming that those strings are in a text file called functions.txt, you would read in your text file using textread to read in individual strings. As such:
names = textread('functions.txt', '%s');
names should now be a cell array of names where each element is each string encapsulated in double quotes. Use findstr to extract where the . is located, then extract the last location of where this is. Extract this substring, then replace this string with der(). In other words:
out_strings = cell(1, numel(names)); %// To store output strings
for idx = 1 : numel(names)
%// Extract actual string without quotes and comma
name_str = names{idx}(2:end-2);
%// Find the last dot
dot_locs = findstr(name_str, '.');
%// Last dot location
last_dot_loc = dot_locs(end);
%// Extract substring after dot
last_string = name_str(last_dot_loc+1:end);
%// Create new string
out_strings{idx} = ['"' name_str(1:last_dot_loc) 'der(' last_string ')",'];
end
This is the output I get:
celldisp(out_strings)
out_strings{1} =
"linStru.twoZoneBuildingStructure.north.airLeakage.senTem.der(T)",
out_strings{2} =
"linStru.twoZoneBuildingStructure.north.vol.der(Xi[1])",
out_strings{3} =
"linStru.twoZoneBuildingStructure.south.airLeakage.senTem.der(T)",
out_strings{4} =
"linStru.twoZoneBuildingStructure.south.vol.der(Xi[1])",
out_strings{5} =
"linStru.twoZoneBuildingStructure.north_ext.layMul.nMat[1].monoLayer1Nf.der(T[1])",
out_strings{6} =
"linStru.twoZoneBuildingStructure.north_ext.layMul.nMat[1].monoLayer2Nf.der(T[2])",
The last thing you want to do is write each line of text to your text file. You can use fopen to open up a file for writing. fopen returns a file ID that is associated with the file you want to write to. You then use fprintf to print your strings and name a newline for each string using this file ID. You then close the file using fclose with this same file ID. As such, if we wanted to output a text file called functions_new.txt, we would do:
%// Open up the file and get ID
fid = fopen('functions_new.txt', 'w');
%// For each string we have...
for idx = 1 : numel(out_strings)
%// Write the string to file and make a new line
fprintf(fid, '%s\n', out_strings{idx});
end
%// Close the file
fclose(fid);
Another way to do it with regexprep:
str_out = regexprep(str_in, '\.([^\.]+)"$','\.der($1)"');
Example: for
str_in = {'"linStru.twoZoneBuildingStructure.north.airLeakage.senTem.T"'
'"linStru.twoZoneBuildingStructure.north.vol.Xi[1]"'};
this gives
str_out =
'"linStru.twoZoneBuildingStructure.north.airLeakage.senTem.der(T)"'
'"linStru.twoZoneBuildingStructure.north.vol.der(Xi[1])"'

Resources