is there a method to detect person and associate a text? - python-3.x

I have a text like :
Take a loot at some of the first confirmed Forum speakers: John
Sequiera Graduated in Biology at Facultad de Ciencias Exactas y
Naturales,University of Buenos Aires, Argentina. In 2004 obtained a
PhD in Biology (Molecular Neuroscience), at University of Buenos
Aires, mentored by Prof. Marcelo Rubinstein. Between 2005 and 2008
pursued postdoctoral training at Pasteur Institute (Paris) mentored by
Prof Jean-Pierre Changeux, to investigate the role of nicotinic
receptors in executive behaviors. Motivated by a deep interest in
investigating human neurological diseases, in 2009 joined the
Institute of Psychiatry at King’s College London where she performed
basic research with a translational perspective in the field of
neurodegeneration.
Since 2016 has been chief of instructors / Adjunct professor at University of Buenos Aires, Facultad de Ciencias Exactas y Naturales.
Tom Gonzalez is a professor of Neuroscience at the Sussex Neuroscience, School of Life Sciences, University of Sussex. Prof.
Baden studies how neurons and networks compute, using the beautiful
collection of circuits that make up the vertebrate retina as a model.
I want to have in output :
[{"person" : "John Sequiera" , "content": "Graduated in Biology at Facultad...."},{"person" : "Tom Gonzalez" , "content": "is a professor of Neuroscience at the Sussex..."}]
so we want to get NER : PER for person and in content we put all contents after detecting person until we found a new person in the text ...
it is possible ?
i try to use spacy to extract NER , but i found a difficulty to get content :
import spacy
​
nlp = spacy.load("en_core_web_lg")
doc = nlp(text)
​
for ent in doc.ents:
print(ent.text,ent.label_)

Related

Dealing with timeout error while web scraping

The link i am trying to scrape is https://www.zomato.com/lucknow/skyhilton-1-alambagh/reviews specifically the 'Names' and 'Reviews'.
I keep getting the timeout error while requesting the url, i haven't defined the timeout limit in this case. Is there a way to make my request more manageable or should i use some other libraries/modules for this purpose.
Error message : ' TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond '
Here is my code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
from urllib.request import Request
url = 'https://www.zomato.com/lucknow/skyhilton-1-alambagh/reviews'
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
uClient = uReq(req)
page = uClient.read()
page_html = soup(page, "html.parser")
containers = page_html.findAll("div",{"class":"sc-eetwQk hAcPWO"})
print(soup.pretiffy(containers[0]))
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0'
}
params = {
'sort': 'dd',
'filter': 'reviews-dd',
'res_id': 18439027
}
def main(url):
with requests.Session() as req:
for page in range(1, 11):
print(f"{'*' * 30} Extracting Page# {page} {'*' * 30}")
params['page'] = page
r = req.get(url, params=params, headers=headers).json()
for x in r['entities']['REVIEWS'].values():
print("Username: {:<20}, Comment: {}".format(
x['userName'], x['reviewText']))
main("https://www.zomato.com/webroutes/reviews/loadMore")
Output:
****************************** Extracting Page# 1 ******************************
Username: Nitisha Dwivedi , Comment: I tried veg manchurian, veg noodles, pasta arrabiata, virgin mojito from sky hilton and the food was great, freshly cooked, very tasty and well presented. The waiters were attentive and service was overall good but some of the waiters are rude even. I would suggest you to come to this place and enjoy food!
Username: Sakshi Jaiswal , Comment: this place is lovely rocking<br/>the party place <br/>best saturday place<br/>live music great taste<br/><br/>
Username: Atul , Comment: They send me gravy chicken while I ordered tava chicken listed under DRY snacks and beforehand informed Mr. Dinesh Dixit as well about the order.But this is what I received. Gravy can be clearly seen with Oil spilling all over. When I informed ZOMATO and the manager about it, ZOMATO said they will give the feedback to the restaurant (unsatisfactory resolution) and the manager said this is how dry tava chicken looks like. Do not order online as no one will listen to you even if you are right . Pathetic experience from SkyHilton and ZOMATO as well this time.
Username: Sweety Singh , Comment:
Username: Aman Bhardwaj , Comment: the food is awesome you can even visit here with your family the taste of food is ❤️❤️❤️
****************************** Extracting Page# 2 ******************************
Username: Akanksha Singh , Comment:
Username: Harsh Mehrotra , Comment:
Username: Sheetal Kapoor , Comment: I have given them 4 stars because of the service as the restaurant need to really work upon that. <br/>
Username: Mâñvéñdrâ Singh , Comment:
Username: Vishal Yadav , Comment: Good food
****************************** Extracting Page# 3 ******************************
Username: Avni Singh , Comment:
Username: AVNI SINGH , Comment:
Username: Avni Singh , Comment:
Username: Vanshika Shukla , Comment:
Username: Anushka Singh , Comment:
****************************** Extracting Page# 4 ******************************
Username: Anushka Singh , Comment:
Username: Priya Singh , Comment:
Username: Chandramohan Yadav , Comment: Staff and ambience is too good and a healthy and friendly environment
Username: Kavita Vishwakarma , Comment:
Username: Govind Bahadur , Comment: Good food good taste all time to choose this place for ordering and dining very good place and serve very good
****************************** Extracting Page# 5 ******************************
Username: Govind Bahadur , Comment: Such a recommended place to all.<br/>Here food serve good with hygiene way and with a superior good taste.
Username: Anuj Kashyap , Comment: My friend was suggested me to order from here and i verry surprised by there taste and food quality.<br/>Thank you Sky Hilton to serve us verry well.
Username: Anuj Kashyap , Comment: My friend was suggested me to order from here and i verry surprised by there taste and food quality.<br/>Thank you Sky Hilton to serve us verry well.
Username: Akash Choubey , Comment: Good food and very good taste my friend was suggested to order from here and I will totally appreciate the food taste and hygiene packing 😋😋
Username: Akshat Anand , Comment: fine dine restaurant is very excellent and the service person is very kind ; dheer singh who is so amazing and overall experience is very amazing
****************************** Extracting Page# 9 ******************************
Username: Anushree Nigam , Comment: Dheer was very courteous while serving!
Username: Agosh Baranwal , Comment: Amazing staff. Very dedicated and polite.
Username: Sameer Agarwal , Comment:
Username: Vinod Kushwaha , Comment: Awesome food and service by Dheer singh
Username: Manendra Singh , Comment: Nice ambiance music was soo good i love skyhilton exxillent service given by ajeet patel
****************************** Extracting Page# 10 ******************************
Username: Vandana Singh , Comment: awesome service very nice food 🎎🎎🎎🎎🎎🎎🎎🎎<br/>Service dheer singh
Username: Harpreet Singh , Comment: Best service by shubham kanchan
Username: Shruti Mirchandani , Comment: Heard about this restaurant cum bar as one of the trending outlet in Lucknow in Alambagh.<br/>Can visit for good food simply served in very simple way have tried Handi Mutton, Pasta, Paneer Tikka Masala, jeera rice and of course drinks.<br/><br/>Paneer tikka masala is actually good not like any other restaurant who put All that capsicum and onion in paneer tikka masala it's blend of flavor full masala and taste good.<br/><br/>Handi masala was also tasted good.<br/><br/>Though didn't like the ambience at all as it could be done better.<br/><br/>Also can eenjoy Hukka in open sitting area at fifth floor.
Username: Anmol Kacker , Comment: I am writing this review after my fourth visit to this place. Rest all my visits were on a weekday afternoon so, decided to give this place a try on a Saturday night.<br/>The ambiance though a lot changed than before, was good but that's pretty much it. The music was way too loud and deafening. They are currently having a street food festival
so, decided to try a couple of items. Vada pav and omelette were the items ordered. Despite repeatedly telling the person who took the order to give a masala omelette, he got a plain one and that too cold. The crispy corn ordered was awful. Sweet corn soup, was simply cold water with some half boiled vegetables and traces of corn. It had just no taste at all. Tried finishing the items, unfortunately, couldn't even eat half of the meal.<br/>Finally, decided to get up and walk away. Saturday nights can be maddening with overflowing crowd but atleast do something about your food. Quick service stands nowhere if what you are delivering is so bad. Really thought the place would be a bit different this time. Unfortunately, it was as bad as before. Done with this place ! Never again !<br/><br/>Also, I am not big a fan of saturday night parties but all I have to say here is that people were coming in, looking around and walking away. Seeing heavily drunken men dancing is not what a saturday night means to me or I guess to anyone.
Username: Piyush Kumar , Comment: Please increase the salary of mr. Bhanu he is good in f&b Service skills

Need to get actor name out of the Json file

i want to get actor name out of this json file page_title and then match this with database i tried using nltk and spacy but there i have to train data. Do i have train for each and ever sentence i have more than 100k sentences. If i sit to train data it will takes a month or more. Is there any way that i can dump K_actor database to train spacy, nltk or any other way.
{"page_title": "Sonakshi Sinha To Auction Sketch Of Buddha To Help Migrant Labourers", "description": "Sonakshi Sinha took to Instagram to share a timelapse video of a sketch of Buddha that she made to auction to raise funds for migrant workers affected by Covid-19 crisis. ", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589815261_1589815196489_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/sonakshi-sinha-to-auction-sketch-of-buddha-to-help-migrant-labourers-2626123.html"}
{"page_title": "Anushka Sharma Calls Virat Kohli 'A Liar' on IG Live, Nushrat Bharucha Gets Propositioned on Twitter", "description": "In an Instagram live interaction with Sunil Chhetri, Virat Kohli was left embarrassed after Anushka Sharma called him a 'jhootha' from behind the camera. This and more in today's wrap.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589813980_1589813933996_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/anushka-sharma-calls-virat-kohli-a-liar-on-ig-live-nushrat-bharucha-gets-propositioned-on-twitter-2626093.html"}
{"page_title": "Ranveer Singh Shares a Throwback to the Days When WWF was His Life", "description": "Ranveer Singh shared a throwback picture from his childhood where he could be seen posing in front of a poster of WWE legend Hulk Hogan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812401_screenshot_20200518-195906_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/ranveer-singh-shares-a-throwback-to-the-days-when-wwf-was-his-life-2626067.html"}
{"page_title": "Salman Khan's Love Song 'Tere Bina' Gets 26 Million Views", "description": "Salman Khan's song Tere Bina, which was launched a few days ago, had garnered 12 million views within 24 hours. As it continues to trend, it has garnered 26 million views in less than a week.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589099778_screenshot_20200510-135934_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/salman-khans-love-song-tere-bina-gets-26-million-views-2626077.html"}
{"page_title": "Yash And Radhika Pandit Pose With Their Kids For a Perfect Family Picture", "description": "Kannada actor Yash tied the knot with actress Radhika Pandit in 2016. The couple shares two kids together.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812187_yash.jpg", "post_url": "https://www.news18.com/news/movies/yash-and-radhika-pandit-pose-with-their-kids-for-a-perfect-family-picture-2626055.html"}
{"page_title": "Malaika Arora Shares Beach Vacay Boomerang With Hopeful Note", "description": "Malaika Arora shared a throwback boomerang from a beach vacation where she could be seen playfully spinning. She also shared a hopeful message along with it.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589810291_screenshot_20200518-192603_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/malaika-arora-shares-beach-vacay-boomerang-with-hopeful-note-2626019.html"}
{"page_title": "Actor Nawazuddin Siddiqui's Wife Aaliya Sends Legal Notice To Him Demanding Divorce, Maintenance", "description": "The notice was sent to the ", "image_url": "https://images.news18.com/ibnlive/uploads/2019/10/Nawazuddin-Siddiqui.jpg", "post_url": "https://www.news18.com/news/movies/actor-nawazuddin-siddiquis-wife-aaliya-sends-legal-notice-to-him-demanding-divorce-maintenance-2626035.html"}
{"page_title": "Lisa Haydon Celebrates Son Zack\u2019s 3rd Birthday With Homemade Cake And 'Spiderman' Surprise", "description": "Lisa Haydon took to Instagram to share some glimpses from the special day. In the pictures, we can spot a man wearing a Spiderman costume.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807960_lisa-rey.jpg", "post_url": "https://www.news18.com/news/movies/lisa-haydon-celebrates-son-zacks-3rd-birthday-with-homemade-cake-and-spiderman-surprise-2625953.html"}
{"page_title": "Chiranjeevi Recreates Old Picture with Wife, Says 'Time Has Changed'", "description": "Chiranjeevi was last seen in historical-drama Sye Raa Narasimha Reddy. He was shooting for his next film, Acharya, before the coronavirus lockdown.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589808242_pjimage.jpg", "post_url": "https://www.news18.com/news/movies/chiranjeevi-recreates-old-picture-with-wife-says-time-has-changed-2625973.html"}
{"page_title": "Amitabh Bachchan, Rishi Kapoor\u2019s Pout Selfie Recreated By Abhishek, Ranbir is Priceless", "description": "A throwback picture that has gone viral on the internet shows Ranbir Kapoor and Abhishek Bachchan recreating a selfie of their fathers Rishi Kapoor and Amitabh Bachchan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807772_screenshot_20200518-184521_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/amitabh-bachchan-rishi-kapoors-pout-selfie-recreated-by-abhishek-ranbir-is-priceless-2625867.html"}
Something that you can do is to create an annoter script wherein you can replace actor names with '###' or some other string (which will be replaced later with actor names (entities) for training).
I trained 68K data/sentences in 9 hrs with my i3 laptop. You can dump data like this and the output file can be used for training the model.
That will save time and also give you ready made training data format for SpaCy.
from nltk import word_tokenize
from pandas import read_csv
import re
import os.path
def annot(Label, entity, textlist) :
finaldict = []
for text_token in textlist:
textbk=text_token
for value in entity:
#if entity has multi tokens
text=textbk
text=text_token
text=str(text).replace('###',value)
text=text.lower()
text = re.sub('[^a-zA-Z0-9\n\.]',' ', text)
if len(word_tokenize(value))<2:
#print('I am here')
newtext=word_tokenize(text)
traindata=[]
prev_length=0
prev_pos=0
k=0
while k != len(newtext):
if k == 0:
prev_pos=0
prev_length=len(newtext[k])
if value.lower()== str(newtext[k]):
ent=Label
tup=(prev_pos,prev_length,ent)
traindata.append(tup)
else:
pass
else :
prev_pos=prev_length+1
prev_length=prev_length+len(newtext[k])+1
if value.lower()==str(newtext[k]):
ent=Label
tup=(prev_pos,prev_length,ent)
traindata.append(tup)
else:
pass
k=k+1
mydict={'entities':traindata}
finaldict.append((text,mydict))
else:
traindata=[]
try:
begin=text.index(value.lower())
ent=Label
tup=(begin,len(value.lower()),ent)
traindata.append(tup)
except ValueError:
pass
mydict={'entities':traindata}
finaldict.append((text,mydict))
return finaldict
def getEntities(csv_file, column) :
df = read_csv(csv_file)
return df[column].to_list()
def getSentences(file_name) :
with open(file_name) as file1 :
sentences = [line1.rstrip('\n') for line1 in file1]
return sentences
def saveData (data, filename, path) :
filename = os.path.join(path, filename)
with open(filename, 'a') as file :
for sent in data :
file.write("{}\n".format(sent))
ents = getEntities(csv_file, column_name) #Actor names in your case
entities = [ent for ent in ents if str(ent) != 'nan']
sentences = getSentences(filepathandname) #Considering you have the sentences in a text file
label = 'ACTOR_NAMES'
data = annot(label, entities, sentences)
saveData(data, 'train_data.txt', path)
Hope this is a relevant answer for your question.

How to reconstruct original text from spaCy tokens, even in cases with complicated whitespacing and punctuation

' '.join(token_list) does not reconstruct the original text in cases with multiple whitespaces and punctuation in a row.
For example:
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizerSpaCy = Tokenizer(nlp.vocab)
context_text = 'this is a test \n \n \t\t test for \n testing - ./l \t'
contextSpaCyToksSpaCyObj = tokenizerSpaCy(context_text)
spaCy_toks = [i.text for i in contextSpaCyToksSpaCyObj]
reconstruct = ' '.join(spaCy_toks)
reconstruct == context_text
>False
Is there an established way of reconstructing original text from spaCy tokens?
Established answer should work with this edge case text (you can directly get the source from clicking the 'improve this question' button)
" UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05795279 Date: 01/07/2016\n\n\n RELEASE IN PART\n B5, B6\n\n\n\n\nFrom: H <hrod17#clintonemail.com>\nSent: Monday, July 23, 2012 7:26 AM\nTo: 'millscd #state.gov'\nCc: 'DanielJJ#state.gov.; 'hanleymr#state.gov'\nSubject Re: S speech this morning\n\n\n\n Waiting to hear if Monica can come by and pick up at 8 to take to Josh. If I don't hear from her, can you send B5\nsomeone else?\n\n Original Message ----\nFrom: Mills, Cheryl D [MillsCD#state.gov]\nSent: Monday, July 23, 2012 07:23 AM\nTo: H\nCc: Daniel, Joshua J <Daniel1.1#state.gov>\nSubject: FW: S speech this morning\n\nSee below\n\n B5\n\ncdm\n\n Original Message\nFrom: Shah, Rajiv (AID/A) B6\nSent: Monday, July 23, 2012 7:19 AM\nTo: Mills, Cheryl D\nCc: Daniel, Joshua.'\nSubject: S speech this morning\n\nHi cheryl,\n\nI look fwd to attending the speech this morning.\n\nI had one last minute request - I understand that in the final version there is no reference to the child survival call to\naction, but their is a reference to family planning efforts. Could you and josh try to make sure there is some specific\nreference to the call to action?\n\nAlso, in terms of acknowledgements it would be good to note torn friedan's leadership as everyone is sensitive to our ghi\ntransition and we want to continue to send the usaid-pepfar-cdc working together public message. I don't know if he is\nthere, but wanted to flag.\n\nLook forward to it.\n\nRaj\n\n\n\n\n UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05795279 Date: 01/07/2016\n\x0c"
You can very easily accomplish this by changing two lines in your code:
spaCy_toks = [i.text + i.whitespace_ for i in contextSpaCyToksSpaCyObj]
reconstruct = ''.join(spaCy_toks)
Basically, each token in spaCy knows whether it is followed by whitespace or not. So you call token.whitespace_ instead of joining them on space by default.

unable to get rid of all emojis

I need help removing emojis. I looked at some other stackoverflow questions and this is what I am de but for some reason my code doesn't get rid of all the emojis
d= {'alexveachfashion': 'Fashion Style * Haute Couture * Wearable Tech * VR\n👓👜⌚👠\nSoundCloud is Live #alexveach\n👇New YouTube Episodes ▶️👇', 'andrewvng': 'Family | Fitness | Friends | Gym | Food', 'runvi.official': 'Accurate measurement via SMART insoles & real-time AI coaching. Improve your technique & BOOST your performance with every run.\nSoon on Kickstarter!', 'triing': 'Augmented Jewellery™️ • Montreal. Canada.', 'gedeanekenshima': 'Prof na Etec Albert Einstein, Mestranda em Automação e Controle de Processos, Engenheira de Controle e Automação, Técnica em Automação Industrial.', 'jetyourdaddy': '', 'lavonne_sun': '☄️🔭 ✨\n°●°。Visual Narrative\nA creative heart with a poetic soul.\n————————————\nPARSONS —— Design & Technology', 'taysearch': 'All the World’s Information At Your Fingertips. (Literally) Est. 1991🇺🇸 🎀#PrincessofSearch 🔎Sample 👇🏽 the Search Engine Here 🗽', 'hijewellery': 'Fine 3D printed jewellery for tech lovers #3dprintedjewelry #wearabletech #jewellery', 'yhanchristian': 'Estudante de Engenharia, Maker e viciado em café.', 'femka': 'Fashion Futurist + Fashion Tech Lab Founder #technoirlab + Fashion Designer / Parsons & CSM Grad / Obsessed with #fashiontech #future #cryptocurrency', 'sinhbisen': 'Creator, TRiiNG, augmented jewellery label ⭕️ Transhumanist ⭕️ Corporeal cartographer ⭕️', 'stellawearables': '#StellaWearables ✉️Info#StellaWearables.com Premium Wearable Technology That Monitors Personal Health & Environments ☀️🏝🏜🏔', 'ivoomi_india': 'We are the manufacturers of the most innovative technologies and user-friendly gadgets with a global presence.', 'bgutenschwager': "When it comes to life, it's all about the experience.\nGoogle Mapper 🗺\n360 Photographer 📷\nBrand Rep #QuickTutor", 'storiesofdesign': 'Putting stories at the heart of brands and businesses | Cornwall and London | #storiesofdesign', 'trume.jp': '草創期から国産ウオッチの製造に取り組み、挑戦を続けてきたエプソンが世界に放つ新ブランド「TRUME」(トゥルーム)。目指すのは、最先端技術でアナログウオッチを極めるブランド。', 'themarinesss': "I didn't choose the blog life, the blog life chose me | Aspiring Children's Book Author | www.slayathomemum.com", 'ayowearable': 'The world’s first light-based wearable that helps you sleep better, beat jet lag and have more energy! #goAYO Get yours at:', 'wearyourowntechs': 'Bringing you the latest trends, Current Products and Reviews of Wearable Technology. Discover how they can enhance your Life and Lifestyle', 'roxfordwatches': 'The Roxford | The most stylish and customizable fitness smartwatch. Tracks your steps/calories/dist/sleep. Comes with FOUR bands, and a travel case!', 'playertek': "Track your entire performance - every training session, every match. \nBecause the best players don't hide.", '_kate_hartman_': '', 'hmsmc10': 'Health & Wellness 🍎\nBoston, MA 🏙\nSuffolk MPA ‘17 🎓 \n.\nJust Strong Ambassador 🏋🏻\u200d♀️', 'gadgetxtreme': 'Dedicated to reviewing gadgets, technologies, internet products and breaking tech news. Follow us to see daily vblogs on all the disruptive tech..', 'freedom.journey.leader': '📍MN\n🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿 \n📧Ashleybp5#gmail.com \n#homeschool #bossmom #builder #momlife', 'arts_food_life': 'Life through my phone.', 'medgizmo': 'Wearable #tech: #health #healthcare #wellness #gadgets #apps. Images/links provided as information resource only; doesn’t mean we endorse referenced', 'sawearables': 'The home of wearable tech in South Africa!\n--> #WearableTech #WearableTechnology #FitnessTech Find your wearable #', 'shop.mercury': 'Changing the way you charge.⚡️\nGet exclusive product discounts, and help us reach our goal below!🔋', 'invisawear': 'PRE-ORDERS NOW AVAILABLE! Get yours 25% OFF here: #girlboss #wearabletech'}
for key in d:
print("---with emojis----")
print(d[key])
print("---emojis removed----")
x=''.join(c for c in d[key] if c <= '\uFFFF')
print(x)
output example
---with emojis----
📍MN
🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿
📧Ashleybp5#gmail.com
#homeschool #bossmom #builder #momlife
---emojis removed----
MN
Wife • Homeschooling Mom to 5 • D Y I lover • Small town living in MN.
Ashleybp5#gmail.com
#homeschool #bossmom #builder #momlife
---with emojis----
Changing the way you charge.⚡️
Get exclusive product discounts, and help us reach our goal below!🔋
---emojis removed----
Changing the way you charge.⚡️
Get exclusive product discounts, and help us reach our goal below!
There is no technical definition of what an "emoji" is. Various glyphs may be used to render printable characters, symbols, control characters and the like. What seems like an "emoji" to you may be part of normal script to others.
What you probably want to do is to look at the Unicode category of each character and filter out various categories. While this does not solve the "emoji"-definition-problem per se, you get much better control over what you are actually doing without removing, for example, literally all characters of languages spoken by 2/3 of the planet.
Instead of filtering out certain categories, you may filter everything except the lower- and uppercase letters (and numbers). However, be aware that ꙭ is not "the googly eyes emoji" but the CYRILLIC SMALL LETTER DOUBLE MONOCULAR O, which is a normal lowercase letter to millions of people.
For example:
import unicodedata
s = "🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿"
# Just filter category "symbol"
t = ''.join(c for c in s if unicodedata.category(c) not in ('So', ))
print(t)
...results in
Wife • Homeschooling Mom to 5 • D Y I lover • Small town living in MN.
This may not be emoji-free enough, yet the • is technically a form of punctuation. So filter this as well
# Filter symbols and punctuations. You may want 'Cc' as well,
# to get rid of control characters. Beware that newlines are a
# form of control-character.
t = ''.join(c for c in s if unicodedata.category(c) not in ('So', 'Po'))
print(t)
And you get
Wife Homeschooling Mom to 5 D Y I lover Small town living in MN

How to download pubmed articles and read them?

Im having trouble to save pubmed articles and read them. I've seen at this page here that there are some special files types but no one of them worked for me. I want to save them in a way that I can continuous using the keys to get the the data. I don't know if its possible use it if I save it as a text file. My code is this one:
import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO
'''Class Crawler is responsable to browse the biological databases
from DownloadArticles import DownloadArticles
c = DownloadArticles()
c.articles_dataset_list
'''
class DownloadArticles():
def __init__(self):
Entrez.email='myemail#gmail.com'
self.dataC = self.saveArticlesFilesInXMLMode('pubmed', '26837606')
'''Metodo 4 ler dado em forma de texto.'''
def saveArticlesFilesInXMLMode(self,dbs, ids):
net_handle = Entrez.efetch(db=dbs, id=ids, rettype="medline", retmode="txt")
directory = "/dataset/Pubmed/DatasetArticles/"+ ids + ".fasta"
# if not os.path.exists(directory):
# os.makedirs(directory)
# filename = directory + '/'
# if not os.path.exists(filename):
out_handle = open(directory, "w+")
out_handle.write(net_handle.read())
out_handle.close()
net_handle.close()
print("Saved")
print("Parsing...")
record = SeqIO.read(directory, "fasta")
print(record)
return(record.read())
I'm getting this error: ValueError: No records found in handle
Pease someone can help me?
Now my code is like this, I am trying to do a function to save in .fasta like you did. And one to read the .fasta files like in the answer above.
import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO
def save_Articles_Files(dbName, idNum, rettypeName):
net_handle = Entrez.efetch(db=dbName, id=idNum, rettype=rettypeName, retmode="txt")
filename = path + idNum + ".fasta"
out_handle = open(filename, "w")
out_handle.write(net_handle.read())
out_handle.close()
net_handle.close()
print("Saved")
enter code here
Entrez.email='myemail#gmail.com'
dbName = 'pubmed'
idNum = '26837606'
rettypeName = "medline"
path ="/run/media/Dropbox/codigos/Codes/"+dbName
save_Articles_Files(dbName, idNum, rettypeName)
But my function is not working I need some help please!
You're mixing up two concepts.
1) Entrez.efetch() is used to access NCBI. In your case you are downloading an article from Pubmed. The result that you get from net_handle.read() looks like:
PMID- 26837606
OWN - NLM
STAT- In-Process
DA - 20160203
LR - 20160210
IS - 2045-2322 (Electronic)
IS - 2045-2322 (Linking)
VI - 6
DP - 2016 Feb 03
TI - Exploiting the CRISPR/Cas9 System for Targeted Genome Mutagenesis in Petunia.
PG - 20315
LID - 10.1038/srep20315 [doi]
AB - Recently, CRISPR/Cas9 technology has emerged as a powerful approach for targeted
genome modification in eukaryotic organisms from yeast to human cell lines. Its
successful application in several plant species promises enormous potential for
basic and applied plant research. However, extensive studies are still needed to
assess this system in other important plant species, to broaden its fields of
application and to improve methods. Here we showed that the CRISPR/Cas9 system is
efficient in petunia (Petunia hybrid), an important ornamental plant and a model
for comparative research. When PDS was used as target gene, transgenic shoot
lines with albino phenotype accounted for 55.6%-87.5% of the total regenerated T0
Basta-resistant lines. A homozygous deletion close to 1 kb in length can be
readily generated and identified in the first generation. A sequential
transformation strategy--introducing Cas9 and sgRNA expression cassettes
sequentially into petunia--can be used to make targeted mutations with short
indels or chromosomal fragment deletions. Our results present a new plant species
amenable to CRIPR/Cas9 technology and provide an alternative procedure for its
exploitation.
FAU - Zhang, Bin
AU - Zhang B
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
FAU - Yang, Xia
AU - Yang X
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
FAU - Yang, Chunping
AU - Yang C
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
FAU - Li, Mingyang
AU - Li M
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
FAU - Guo, Yulong
AU - Guo Y
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
LA - eng
PT - Journal Article
PT - Research Support, Non-U.S. Gov't
DEP - 20160203
PL - England
TA - Sci Rep
JT - Scientific reports
JID - 101563288
SB - IM
PMC - PMC4738242
OID - NLM: PMC4738242
EDAT- 2016/02/04 06:00
MHDA- 2016/02/04 06:00
CRDT- 2016/02/04 06:00
PHST- 2015/09/21 [received]
PHST- 2015/12/30 [accepted]
AID - srep20315 [pii]
AID - 10.1038/srep20315 [doi]
PST - epublish
SO - Sci Rep. 2016 Feb 3;6:20315. doi: 10.1038/srep20315.
2) SeqIO.read() is used to read and parse FASTA files. This is a format that is used to store sequences. A sequence in FASTA format is represented as a series of lines. The first line in a FASTA file starts with a ">" (greater-than) symbol. Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard one-letter code.
As you can see, the result that you get back from Entrez.efetch() (which I pasted above) doesn't look like a FASTA file. So SeqIO.read() gives the error that it can't find any sequence records in the file.

Resources