Python regex group sentences by person's name - python-3.x

I have the following text:
"- Nike: Hey, where are you?\n10/6/20, 8:51 - Mike: Soon\n10/6/20, 8:55 - Nike: how are you guy?\n10/4/20, 8:55 - Mike: It's okay\n10/4/20, 9:05"
I'd like to make 2 lists like the following:
nike = ["Hey, where are you?", "how are you guy?"]
mike = ["Soon", "It's okay"]
Any idea how I could do such thing please?
Thanks guys!

import re
s = "- Nike: Hey, where are you?\n10/6/20, 8:51 - Mike: Soon\n10/6/20, 8:55 - Nike: how are you guy?\n10/4/20, 8:55 - Mike: It's okay\n10/4/20, 9:05"
out = {}
for name, sentence in re.findall(r'([A-Za-z]+):\s*(.*)$', s, flags=re.M):
out.setdefault(name, []).append(sentence)
print(out)
Prints:
{'Nike': ['Hey, where are you?', 'how are you guy?'], 'Mike': ['Soon', "It's okay"]}

Related

Extracting countries from string

I am trying to go through a column of data frame in python 3. What I need to do is take from each row the country that it is mentioned and the number of times that country is mentioned.
i.e. if I have this row:
['[Aydemir, Deniz', ' Gunduz, Gokhan', ' Asik, Nejla] Bartin Univ, Fac Forestry, Dept Forest Ind Engn, TR-74100 Bartin, Turkey', ' [Wang, Alice] Lulea Univ Technol, Wood Technol, Skelleftea, Sweden']
it needs to output a list: ['Turkey', 'Sweden']
and if I have this row:
['[Fang, Qun', ' Cui, Hui-Wang] Zhejiang A&F Univ, Sch Engn, Linan 311300, Peoples R China', ' [Du, Guan-Ben] Southwest Forestry Univ, Kunming 650224, Yunnan, Peoples R China']
the output should be: ['China', 'China'].
I have written this code but it is not working as I want to:
from geotext import GeoText
sentence = df.iloc[0,0]
places = GeoText(sentence)
print(places.countries)
It prints only the country once and in some cases when it is USA it doesn't recognize the abbreviation. Can you help me figure out what to do?
l = [['[Aydemir, Deniz\', \' Gunduz, Gokhan\', \' Asik, Nejla] Bartin Univ, Fac Forestry, Dept Forest Ind Engn, TR-74100 Bartin, Turkey\', \' [Wang, Alice] Lulea Univ Technol, Wood Technol, Skelleftea, Sweden',1990],
['[Fang, Qun\', \' Cui, Hui-Wang] Zhejiang A&F Univ, Sch Engn, Linan 311300, Peoples R China\', \' [Du, Guan-Ben] Southwest Forestry Univ, Kunming 650224, Yunnan, Peoples R China',2005],
['[Blumentritt, Melanie\', \' Gardner, Douglas J.\', \' Shaler, Stephen M.] Univ Maine, Sch Resources, Orono, ME USA\', \' [Cole, Barbara J. W.] Univ Maine, Dept Chem, Orono, ME 04469 USA',2012]]
dataf = pd.DataFrame(l, columns = ['Authors', 'Year'])
I tried to do this code but I have the same problem, it doesn't give all the counties only one per row:
def find_country(n):
for c in pycountry.countries:
if str(c.name).lower() in n.lower():
return c.name
country1 = (dataf['Authors']
.replace(r"\bUSA\b", "United States", regex=True)
.apply(lambda x: find_country(x)))
USA does not seem to be detected correctly by geotext - it's worth trying to raise an issue with that package. As a workaround here, I replace USA with United States, which is correctly detected.
df = (dataf['Authors']
.replace(r"\bUSA\b", "United States", regex=True)
.apply(lambda x: geotext.GeoText(x).countries)
)
I'm not sure what you were doing before, but this will get the list of countries for each of the rows in Author, including duplicates.
0 [Turkey, Sweden]
1 [China, China]
2 [United States, United States]
Name: Authors, dtype: object
As mentioned in the comment, if you want to have an actual list of lists, just add tolist() to the end.
df.tolist()
[['Turkey', 'Sweden'], ['China', 'China'], ['United States', 'United States']]

How to fix my RE to get my expected arguments of group

I m new learner to python and learning Regex at this moment.
I made a Regex that is supposed to find all phone numbers.
I think I did it right but my code doesn't seem to be working correctly.
phoneRegex = re.compile(r'''(
(\d{2,3}|\(\d(2,3)\))? # first 2-3 digits
(\s|-|\.)? # -
(\d{3,4}) # second 3-4 digits
(\s|-|\.) # -
(\d{4}) # last 4 digits.
(\s*(ext|x|ext.)\s*(\d{3,4}))? # extension
)''', re.VERBOSE)
phoneRegex.findall('010 1234 5678 ext1234')
I am working on automatetheboringstuff tutorials, and read the Regex chapter through for 3 times.
If there are some minor things that I should read or consider, sorry for my hasty, but I spent roughly 2hrs, and I am happy to any of your suggesting reading materials and help.
I appreciate in advance.
Expected result:
[('010 1234 5678 ext1234', '010', ' ', '1234', ' ', '5678', ' ext1234', 'ext', '1234')]
Actual result:
[('010 1234 5678 ext1234', '010', '', ' ', '1234', ' ', '5678', ' ext1234', 'ext', '1234')]
what is the 3rd thing ('') and where did it come from?

Python Print Table for Term and Definition with Handled Overflow

I'm trying to make a program that prints out a two column table (Term and Definition) something like this: (table width should be 80 characters)
+--------------------------------------------------------------------------+
| Term | Definition
|
+--------------------------------------+-----------------------------------+
| this is the first term. |This is the definition for thefirst|
| |term that wraps around because the |
| |definition is longer than the width|
| |of the column. |
+--------------------------------------+-----------------------------------+
|The term may also be longer than the |This is the definition for the |
|width of the column and should wrap |second term. |
|around as well. | |
+--------------------------------------+-----------------------------------+
I have existing code for this, but it prints out "this is the first term" on every line because I have used a nested for loop. (Also tried implementing the textwrap module) Here is the code that I have:
# read file
with open(setsList[selectedSet-1], "r", newline="") as setFile:
cardList = list(csv.reader(setFile))
setFile.close()
for i in range(len(cardList)):
wrapped_term = textwrap.wrap(cardList[i][0], 30)
wrapped_definition = textwrap.wrap(cardList[i][1], 30)
for line in wrapped_term:
for line2 in wrapped_definition:
print(line, " ",line2)
print("- - - - - - - - - - - - - - - - - - - - - - - - - - -")
Can anyone suggest a solution? Thank you.
After a lot of (trial) & error & random youtube videos, the solution: (if anyone has a similar problem)
with open("table.csv", "r", newline="") as setFile:
cardList = list(csv.reader(setFile))
setFile.close()
print("+------------------------------------------------------------------------------+")
print("| Term | Definition |")
print("+------------------------------------------------------------------------------+")
print()
for x in range(len(cardList)):
wrapped_term = textwrap.wrap(cardList[x][0], 30)
wrapped_definition = textwrap.wrap(cardList[x][1], 30)
wrapped_list = []
for i in range(len(wrapped_term)):
try:
wrapped_list.append([wrapped_term[i], wrapped_definition[i]])
except IndexError:
if len(wrapped_term) > len(wrapped_definition):
wrapped_list.append([wrapped_term[i], ""])
elif len(wrapped_term) < len(wrapped_definition):
wrapped_list.append(["", wrapped_definition[i]])
column1 = len(" Term ")
column2 = len(" Definition ")
print("+--------------------------------------+---------------------------------------+")
for item in wrapped_list:
print("|", item[0], " "*(column1 - len(item[0])),"|", item[1], " "*(column2-len(item[1])), "|")
print("+--------------------------------------+---------------------------------------+")
print("* *")
Basically, I created a wrapped version of each of my terms and definitions.
Then the try-catch stuff checks whether the term is longer than the definition (in terms of lines) and if so puts blank lines for the definition and vice versa.
I then created a wrapped_list (combined terms and definitions) to store this the above.
With help from this video: (https://www.youtube.com/watch?v=B9BRuhqEb2Q), I formatted the table.
Hope this helped anyone struggling with a similar problem - this can be applied to any number of columns in a table, and any length of csv file.

How to capture words spread through multiple lines which have anywhite space(newline, space, tab)

import re
c = """
class_monitor std4:
Name: xyz
Roll number: 123
Age: 9
Badge: Blue
class_monitor std5:
Name: abc
Roll number: 456
Age: 10
Badge: Red
"""
I want to print Name, Roll number and age for std4 and Name, roll number and badge for std5.
pat = (class_monitor)(.*4:)(\n|\s|\t)*(Name:)(.*)(\s|\n|\t)*(Roll number:)(.*)(\s|\n|\t)*(Age:)(.*)(\s|\n|\t)*(Badge:)(.*)
it matches the respective std if I toggle the second group (.*4:) to (.*5:) in pythex.
However, in a script mode, it is not working. Am I missing something here?

Scraping youtube playlist

I've been trying to write a python script which will fetch me the name of the songs contained in the playlist whose link will be provided. for eg.https://www.youtube.com/watch?v=foE1mO2yM04&list=RDGMEMYH9CUrFO7CfLJpaD7UR85wVMfoE1mO2yM04 from the terminal.
I've found out that names could be extracted by using "li" tag or "h4" tag.
I wrote the following code,
import sys
link = sys.argv[1]
from bs4 import BeautifulSoup
import requests
req = requests.get(link)
try:
req.raise_for_status()
except Exception as exc:
print('There was a problem:',exc)
soup = BeautifulSoup(req.text,"html.parser")
Then I tried using li-tag as:
i=soup.findAll('li')
print(type(i))
for o in i:
print(o.get('data-video-title'))
But it printed "None" those number of time. I belive it is not able to reach those li tags which contains data-video-title attribute.
Then I tried using div and h4 tags as,
for i in soup.findAll('div', attrs={'class':'playlist-video-description'}):
o = i.find('h4')
print(o.text)
But nothing happens again..
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=foE1mO2yM04&list=RDGMEMYH9CUrFO7CfLJpaD7UR85wVMfoE1mO2yM04'
data = requests.get(url)
data = data.text
soup = BeautifulSoup(data)
h4 = soup.find_all("h4")
for h in h4:
print(h.text)
output:
Mike Posner - I Took A Pill In Ibiza (Seeb Remix) (Explicit)
Alan Walker - Faded
Calvin Harris - This Is What You Came For (Official Video) ft. Rihanna
Coldplay - Hymn For The Weekend (Official video)
Jonas Blue - Fast Car ft. Dakota
Calvin Harris & Disciples - How Deep Is Your Love
Galantis - No Money (Official Video)
Kungs vs Cookin’ on 3 Burners - This Girl
Clean Bandit - Rockabye ft. Sean Paul & Anne-Marie [Official Video]
Major Lazer - Light It Up (feat. Nyla & Fuse ODG) [Remix] (Official Lyric Video)
Robin Schulz - Sugar (feat. Francesco Yates) (OFFICIAL MUSIC VIDEO)
DJ Snake - Middle ft. Bipolar Sunshine
Jonas Blue - Perfect Strangers ft. JP Cooper
David Guetta ft. Zara Larsson - This One's For You (Music Video) (UEFA EURO 2016™ Official Song)
DJ Snake - Let Me Love You ft. Justin Bieber
Duke Dumont - Ocean Drive
Galantis - Runaway (U & I) (Official Video)
Sigala - Sweet Lovin' (Official Video) ft. Bryn Christopher
Martin Garrix - Animals (Official Video)
David Guetta & Showtek - Bad ft.Vassy (Lyrics Video)
DVBBS & Borgeous - TSUNAMI (Original Mix)
AronChupa - I'm an Albatraoz | OFFICIAL VIDEO
Lilly Wood & The Prick and Robin Schulz - Prayer In C (Robin Schulz Remix) (Official)
Kygo - Firestone ft. Conrad Sewell
DEAF KEV - Invincible [NCS Release]
Eiffel 65 - Blue (KNY Factory Remix)
Ok guys, I have figured out what was happening. My code was perfect and it works fine, the problem was that I was passing the link as an argument from the terminal and co-incidentally, the link contained some symbols which were interpreted in some other fashion for eg. ('&').
Now I am passing the link as a string in the terminal and everything works fine. So dumb yet time-consuming mistake.

Resources