Python regex group sentences by person's name

Python regex group sentences by person's name - python-3.x

I have the following text:
"- Nike: Hey, where are you?\n10/6/20, 8:51 - Mike: Soon\n10/6/20, 8:55 - Nike: how are you guy?\n10/4/20, 8:55 - Mike: It's okay\n10/4/20, 9:05"
I'd like to make 2 lists like the following:
nike = ["Hey, where are you?", "how are you guy?"]
mike = ["Soon", "It's okay"]
Any idea how I could do such thing please?
Thanks guys!

import re
s = "- Nike: Hey, where are you?\n10/6/20, 8:51 - Mike: Soon\n10/6/20, 8:55 - Nike: how are you guy?\n10/4/20, 8:55 - Mike: It's okay\n10/4/20, 9:05"
out = {}
for name, sentence in re.findall(r'([A-Za-z]+):\s*(.*)$', s, flags=re.M):
out.setdefault(name, []).append(sentence)
print(out)
Prints:
{'Nike': ['Hey, where are you?', 'how are you guy?'], 'Mike': ['Soon', "It's okay"]}

Related

Extracting countries from string

I am trying to go through a column of data frame in python 3. What I need to do is take from each row the country that it is mentioned and the number of times that country is mentioned.
i.e. if I have this row:
['[Aydemir, Deniz', ' Gunduz, Gokhan', ' Asik, Nejla] Bartin Univ, Fac Forestry, Dept Forest Ind Engn, TR-74100 Bartin, Turkey', ' [Wang, Alice] Lulea Univ Technol, Wood Technol, Skelleftea, Sweden']
it needs to output a list: ['Turkey', 'Sweden']
and if I have this row:
['[Fang, Qun', ' Cui, Hui-Wang] Zhejiang A&F Univ, Sch Engn, Linan 311300, Peoples R China', ' [Du, Guan-Ben] Southwest Forestry Univ, Kunming 650224, Yunnan, Peoples R China']
the output should be: ['China', 'China'].
I have written this code but it is not working as I want to:
from geotext import GeoText
sentence = df.iloc[0,0]
places = GeoText(sentence)
print(places.countries)
It prints only the country once and in some cases when it is USA it doesn't recognize the abbreviation. Can you help me figure out what to do?
l = [['[Aydemir, Deniz\', \' Gunduz, Gokhan\', \' Asik, Nejla] Bartin Univ, Fac Forestry, Dept Forest Ind Engn, TR-74100 Bartin, Turkey\', \' [Wang, Alice] Lulea Univ Technol, Wood Technol, Skelleftea, Sweden',1990],
['[Fang, Qun\', \' Cui, Hui-Wang] Zhejiang A&F Univ, Sch Engn, Linan 311300, Peoples R China\', \' [Du, Guan-Ben] Southwest Forestry Univ, Kunming 650224, Yunnan, Peoples R China',2005],
['[Blumentritt, Melanie\', \' Gardner, Douglas J.\', \' Shaler, Stephen M.] Univ Maine, Sch Resources, Orono, ME USA\', \' [Cole, Barbara J. W.] Univ Maine, Dept Chem, Orono, ME 04469 USA',2012]]
dataf = pd.DataFrame(l, columns = ['Authors', 'Year'])
I tried to do this code but I have the same problem, it doesn't give all the counties only one per row:
def find_country(n):
for c in pycountry.countries:
if str(c.name).lower() in n.lower():
return c.name
country1 = (dataf['Authors']
.replace(r"\bUSA\b", "United States", regex=True)
.apply(lambda x: find_country(x)))

USA does not seem to be detected correctly by geotext - it's worth trying to raise an issue with that package. As a workaround here, I replace USA with United States, which is correctly detected.
df = (dataf['Authors']
.replace(r"\bUSA\b", "United States", regex=True)
.apply(lambda x: geotext.GeoText(x).countries)
)
I'm not sure what you were doing before, but this will get the list of countries for each of the rows in Author, including duplicates.
0 [Turkey, Sweden]
1 [China, China]
2 [United States, United States]
Name: Authors, dtype: object
As mentioned in the comment, if you want to have an actual list of lists, just add tolist() to the end.
df.tolist()
[['Turkey', 'Sweden'], ['China', 'China'], ['United States', 'United States']]

How to fix my RE to get my expected arguments of group

I m new learner to python and learning Regex at this moment.
I made a Regex that is supposed to find all phone numbers.
I think I did it right but my code doesn't seem to be working correctly.
phoneRegex = re.compile(r'''(
(\d{2,3}|\(\d(2,3)\))? # first 2-3 digits
(\s|-|\.)? # -
(\d{3,4}) # second 3-4 digits
(\s|-|\.) # -
(\d{4}) # last 4 digits.
(\s*(ext|x|ext.)\s*(\d{3,4}))? # extension
)''', re.VERBOSE)
phoneRegex.findall('010 1234 5678 ext1234')
I am working on automatetheboringstuff tutorials, and read the Regex chapter through for 3 times.
If there are some minor things that I should read or consider, sorry for my hasty, but I spent roughly 2hrs, and I am happy to any of your suggesting reading materials and help.
I appreciate in advance.
Expected result:
[('010 1234 5678 ext1234', '010', ' ', '1234', ' ', '5678', ' ext1234', 'ext', '1234')]
Actual result:
[('010 1234 5678 ext1234', '010', '', ' ', '1234', ' ', '5678', ' ext1234', 'ext', '1234')]
what is the 3rd thing ('') and where did it come from?

Python Print Table for Term and Definition with Handled Overflow

I'm trying to make a program that prints out a two column table (Term and Definition) something like this: (table width should be 80 characters)
+--------------------------------------------------------------------------+
| Term | Definition
|
+--------------------------------------+-----------------------------------+
| this is the first term. |This is the definition for thefirst|
| |term that wraps around because the |
| |definition is longer than the width|
| |of the column. |
+--------------------------------------+-----------------------------------+
|The term may also be longer than the |This is the definition for the |
|width of the column and should wrap |second term. |
|around as well. | |
+--------------------------------------+-----------------------------------+
I have existing code for this, but it prints out "this is the first term" on every line because I have used a nested for loop. (Also tried implementing the textwrap module) Here is the code that I have:
# read file
with open(setsList[selectedSet-1], "r", newline="") as setFile:
cardList = list(csv.reader(setFile))
setFile.close()
for i in range(len(cardList)):
wrapped_term = textwrap.wrap(cardList[i][0], 30)
wrapped_definition = textwrap.wrap(cardList[i][1], 30)
for line in wrapped_term:
for line2 in wrapped_definition:
print(line, " ",line2)
print("- - - - - - - - - - - - - - - - - - - - - - - - - - -")
Can anyone suggest a solution? Thank you.

After a lot of (trial) & error & random youtube videos, the solution: (if anyone has a similar problem)
with open("table.csv", "r", newline="") as setFile:
cardList = list(csv.reader(setFile))
setFile.close()
print("+------------------------------------------------------------------------------+")
print("| Term | Definition |")
print("+------------------------------------------------------------------------------+")
print()
for x in range(len(cardList)):
wrapped_term = textwrap.wrap(cardList[x][0], 30)
wrapped_definition = textwrap.wrap(cardList[x][1], 30)
wrapped_list = []
for i in range(len(wrapped_term)):
try:
wrapped_list.append([wrapped_term[i], wrapped_definition[i]])
except IndexError:
if len(wrapped_term) > len(wrapped_definition):
wrapped_list.append([wrapped_term[i], ""])
elif len(wrapped_term) < len(wrapped_definition):
wrapped_list.append(["", wrapped_definition[i]])
column1 = len(" Term ")
column2 = len(" Definition ")
print("+--------------------------------------+---------------------------------------+")
for item in wrapped_list:
print("|", item[0], " "*(column1 - len(item[0])),"|", item[1], " "*(column2-len(item[1])), "|")
print("+--------------------------------------+---------------------------------------+")
print("* *")
Basically, I created a wrapped version of each of my terms and definitions.
Then the try-catch stuff checks whether the term is longer than the definition (in terms of lines) and if so puts blank lines for the definition and vice versa.
I then created a wrapped_list (combined terms and definitions) to store this the above.
With help from this video: (https://www.youtube.com/watch?v=B9BRuhqEb2Q), I formatted the table.
Hope this helped anyone struggling with a similar problem - this can be applied to any number of columns in a table, and any length of csv file.

How to capture words spread through multiple lines which have anywhite space(newline, space, tab)

import re
c = """
class_monitor std4:
Name: xyz
Roll number: 123
Age: 9
Badge: Blue
class_monitor std5:
Name: abc
Roll number: 456
Age: 10
Badge: Red
"""
I want to print Name, Roll number and age for std4 and Name, roll number and badge for std5.
pat = (class_monitor)(.*4:)(\n|\s|\t)*(Name:)(.*)(\s|\n|\t)*(Roll number:)(.*)(\s|\n|\t)*(Age:)(.*)(\s|\n|\t)*(Badge:)(.*)
it matches the respective std if I toggle the second group (.*4:) to (.*5:) in pythex.
However, in a script mode, it is not working. Am I missing something here?

Scraping youtube playlist

I've been trying to write a python script which will fetch me the name of the songs contained in the playlist whose link will be provided. for eg.https://www.youtube.com/watch?v=foE1mO2yM04&list=RDGMEMYH9CUrFO7CfLJpaD7UR85wVMfoE1mO2yM04 from the terminal.
I've found out that names could be extracted by using "li" tag or "h4" tag.
I wrote the following code,
import sys
link = sys.argv[1]
from bs4 import BeautifulSoup
import requests
req = requests.get(link)
try:
req.raise_for_status()
except Exception as exc:
print('There was a problem:',exc)
soup = BeautifulSoup(req.text,"html.parser")
Then I tried using li-tag as:
i=soup.findAll('li')
print(type(i))
for o in i:
print(o.get('data-video-title'))
But it printed "None" those number of time. I belive it is not able to reach those li tags which contains data-video-title attribute.
Then I tried using div and h4 tags as,
for i in soup.findAll('div', attrs={'class':'playlist-video-description'}):
o = i.find('h4')
print(o.text)
But nothing happens again..

import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=foE1mO2yM04&list=RDGMEMYH9CUrFO7CfLJpaD7UR85wVMfoE1mO2yM04'
data = requests.get(url)
data = data.text
soup = BeautifulSoup(data)
h4 = soup.find_all("h4")
for h in h4:
print(h.text)
output:
Mike Posner - I Took A Pill In Ibiza (Seeb Remix) (Explicit)
Alan Walker - Faded
Calvin Harris - This Is What You Came For (Official Video) ft. Rihanna
Coldplay - Hymn For The Weekend (Official video)
Jonas Blue - Fast Car ft. Dakota
Calvin Harris & Disciples - How Deep Is Your Love
Galantis - No Money (Official Video)
Kungs vs Cookin’ on 3 Burners - This Girl
Clean Bandit - Rockabye ft. Sean Paul & Anne-Marie [Official Video]
Major Lazer - Light It Up (feat. Nyla & Fuse ODG) [Remix] (Official Lyric Video)
Robin Schulz - Sugar (feat. Francesco Yates) (OFFICIAL MUSIC VIDEO)
DJ Snake - Middle ft. Bipolar Sunshine
Jonas Blue - Perfect Strangers ft. JP Cooper
David Guetta ft. Zara Larsson - This One's For You (Music Video) (UEFA EURO 2016™ Official Song)
DJ Snake - Let Me Love You ft. Justin Bieber
Duke Dumont - Ocean Drive
Galantis - Runaway (U & I) (Official Video)
Sigala - Sweet Lovin' (Official Video) ft. Bryn Christopher
Martin Garrix - Animals (Official Video)
David Guetta & Showtek - Bad ft.Vassy (Lyrics Video)
DVBBS & Borgeous - TSUNAMI (Original Mix)
AronChupa - I'm an Albatraoz | OFFICIAL VIDEO
Lilly Wood & The Prick and Robin Schulz - Prayer In C (Robin Schulz Remix) (Official)
Kygo - Firestone ft. Conrad Sewell
DEAF KEV - Invincible [NCS Release]
Eiffel 65 - Blue (KNY Factory Remix)

Ok guys, I have figured out what was happening. My code was perfect and it works fine, the problem was that I was passing the link as an argument from the terminal and co-incidentally, the link contained some symbols which were interpreted in some other fashion for eg. ('&').
Now I am passing the link as a string in the terminal and everything works fine. So dumb yet time-consuming mistake.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python regex group sentences by person's name - python-3.x

Related

Extracting countries from string

How to fix my RE to get my expected arguments of group

Python Print Table for Term and Definition with Handled Overflow

How to capture words spread through multiple lines which have anywhite space(newline, space, tab)

Scraping youtube playlist

Categories

Resources