Hi I have the following string, how can I get the B000RMTGUQ of this entire string using regex pytnon?
{'asin': 'B000RMTGUQ', 'imUrl': 'http://ecx.images-amazon.com/images/I/515KlX4dEUL._BO2,204,203,200_PIsitb-sticker-v3-big,TopRight,0,-55_SX278_SY278_PIkin4,BottomRight,1,22_AA300_SH20_OU01_.jpg', 'related': {'also_bought': ['B009QJMXI8', 'B00CNQ7MJG']}, 'categories': [['Books', 'History', 'World', 'Jewish', 'Holocaust'], ['Books', 'History', 'World', 'Religious', 'Judaism'], ['Books', 'Politics & Social Sciences', 'Social Sciences'], ['Books', 'Religion & Spirituality', 'Judaism'], ['Kindle Store', 'Kindle eBooks', 'History', 'World', 'Jewish', 'Holocaust'], ['Kindle Store', 'Kindle eBooks', 'Politics & Social Sciences', 'Social Sciences'], ['Kindle Store', 'Kindle eBooks', 'Religion & Spirituality', 'Judaism']], 'description': "Kibbutz Buchenwald was founded in Germany in 1945 by 16 survivors of Buchenwald concentration camp. The Zionist training farm was organized to prepare Jews for emigration to Palestine. One of the founders was Yeohezkel Tydor, the author's father, who died in 1993. Baumel's narration of the kibbutz's history is divided into two sections. Part one examines the kibbutz from its creation until the departure of the founding group to Palestine in late summer 1945. Part two traces the kibbutz's subsequent history in Palestine and Germany, from the autumn of 1945 until the mid-1950s. Kibbutz Buchenwald was abolished in Germany in 1948; the kibbutz as it was founded in what is now Israel--named Netzer Sereni--still exists today. The story of these pioneers and their physical, psychological, ideological, and political struggles forms the nucleus of this absorbing book.George Cohen"}
You could use json to parse your string to a valid dictionary:
First note that a valid json is enclosed by double quotes. Also, note that author"s etc need single quotes. Hence you could do:
import json, re
dct = json.loads(re.sub('"s', "'s", re.sub("'", '"', string)))
dct['asin']
'B000RMTGUQ'
EDIT
from the comments below, it seems you do not have a json string but rather a valid python dictionary in string format:
therefore you could directly do:
dc = eval(string)
dc['asin']
Futher more consider using ast.literal_eval rather than eval.
data
string = """{'asin': 'B000RMTGUQ', 'imUrl': 'http://ecx.images-amazon.com/images/I/515KlX4dEUL._BO2,204,203,200_PIsitb-sticker-v3-big,TopRight,0,-55_SX278_SY278_PIkin4,BottomRight,1,22_AA300_SH20_OU01_.jpg', 'related': {'also_bought': ['B009QJMXI8', 'B00CNQ7MJG']}, 'categories': [['Books', 'History', 'World', 'Jewish', 'Holocaust'], ['Books', 'History', 'World', 'Religious', 'Judaism'], ['Books', 'Politics & Social Sciences', 'Social Sciences'], ['Books', 'Religion & Spirituality', 'Judaism'], ['Kindle Store', 'Kindle eBooks', 'History', 'World', 'Jewish', 'Holocaust'], ['Kindle Store', 'Kindle eBooks', 'Politics & Social Sciences', 'Social Sciences'], ['Kindle Store', 'Kindle eBooks', 'Religion & Spirituality', 'Judaism']], 'description': "Kibbutz Buchenwald was founded in Germany in 1945 by 16 survivors of Buchenwald concentration camp. The Zionist training farm was organized to prepare Jews for emigration to Palestine. One of the founders was Yeohezkel Tydor, the author's father, who died in 1993. Baumel's narration of the kibbutz's history is divided into two sections. Part one examines the kibbutz from its creation until the departure of the founding group to Palestine in late summer 1945. Part two traces the kibbutz's subsequent history in Palestine and Germany, from the autumn of 1945 until the mid-1950s. Kibbutz Buchenwald was abolished in Germany in 1948; the kibbutz as it was founded in what is now Israel--named Netzer Sereni--still exists today. The story of these pioneers and their physical, psychological, ideological, and political struggles forms the nucleus of this absorbing book.George Cohen"}"""
Related
I have a text like :
Take a loot at some of the first confirmed Forum speakers: John
Sequiera Graduated in Biology at Facultad de Ciencias Exactas y
Naturales,University of Buenos Aires, Argentina. In 2004 obtained a
PhD in Biology (Molecular Neuroscience), at University of Buenos
Aires, mentored by Prof. Marcelo Rubinstein. Between 2005 and 2008
pursued postdoctoral training at Pasteur Institute (Paris) mentored by
Prof Jean-Pierre Changeux, to investigate the role of nicotinic
receptors in executive behaviors. Motivated by a deep interest in
investigating human neurological diseases, in 2009 joined the
Institute of Psychiatry at King’s College London where she performed
basic research with a translational perspective in the field of
neurodegeneration.
Since 2016 has been chief of instructors / Adjunct professor at University of Buenos Aires, Facultad de Ciencias Exactas y Naturales.
Tom Gonzalez is a professor of Neuroscience at the Sussex Neuroscience, School of Life Sciences, University of Sussex. Prof.
Baden studies how neurons and networks compute, using the beautiful
collection of circuits that make up the vertebrate retina as a model.
I want to have in output :
[{"person" : "John Sequiera" , "content": "Graduated in Biology at Facultad...."},{"person" : "Tom Gonzalez" , "content": "is a professor of Neuroscience at the Sussex..."}]
so we want to get NER : PER for person and in content we put all contents after detecting person until we found a new person in the text ...
it is possible ?
i try to use spacy to extract NER , but i found a difficulty to get content :
import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp(text)
for ent in doc.ents:
print(ent.text,ent.label_)
i want to get actor name out of this json file page_title and then match this with database i tried using nltk and spacy but there i have to train data. Do i have train for each and ever sentence i have more than 100k sentences. If i sit to train data it will takes a month or more. Is there any way that i can dump K_actor database to train spacy, nltk or any other way.
{"page_title": "Sonakshi Sinha To Auction Sketch Of Buddha To Help Migrant Labourers", "description": "Sonakshi Sinha took to Instagram to share a timelapse video of a sketch of Buddha that she made to auction to raise funds for migrant workers affected by Covid-19 crisis. ", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589815261_1589815196489_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/sonakshi-sinha-to-auction-sketch-of-buddha-to-help-migrant-labourers-2626123.html"}
{"page_title": "Anushka Sharma Calls Virat Kohli 'A Liar' on IG Live, Nushrat Bharucha Gets Propositioned on Twitter", "description": "In an Instagram live interaction with Sunil Chhetri, Virat Kohli was left embarrassed after Anushka Sharma called him a 'jhootha' from behind the camera. This and more in today's wrap.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589813980_1589813933996_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/anushka-sharma-calls-virat-kohli-a-liar-on-ig-live-nushrat-bharucha-gets-propositioned-on-twitter-2626093.html"}
{"page_title": "Ranveer Singh Shares a Throwback to the Days When WWF was His Life", "description": "Ranveer Singh shared a throwback picture from his childhood where he could be seen posing in front of a poster of WWE legend Hulk Hogan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812401_screenshot_20200518-195906_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/ranveer-singh-shares-a-throwback-to-the-days-when-wwf-was-his-life-2626067.html"}
{"page_title": "Salman Khan's Love Song 'Tere Bina' Gets 26 Million Views", "description": "Salman Khan's song Tere Bina, which was launched a few days ago, had garnered 12 million views within 24 hours. As it continues to trend, it has garnered 26 million views in less than a week.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589099778_screenshot_20200510-135934_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/salman-khans-love-song-tere-bina-gets-26-million-views-2626077.html"}
{"page_title": "Yash And Radhika Pandit Pose With Their Kids For a Perfect Family Picture", "description": "Kannada actor Yash tied the knot with actress Radhika Pandit in 2016. The couple shares two kids together.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812187_yash.jpg", "post_url": "https://www.news18.com/news/movies/yash-and-radhika-pandit-pose-with-their-kids-for-a-perfect-family-picture-2626055.html"}
{"page_title": "Malaika Arora Shares Beach Vacay Boomerang With Hopeful Note", "description": "Malaika Arora shared a throwback boomerang from a beach vacation where she could be seen playfully spinning. She also shared a hopeful message along with it.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589810291_screenshot_20200518-192603_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/malaika-arora-shares-beach-vacay-boomerang-with-hopeful-note-2626019.html"}
{"page_title": "Actor Nawazuddin Siddiqui's Wife Aaliya Sends Legal Notice To Him Demanding Divorce, Maintenance", "description": "The notice was sent to the ", "image_url": "https://images.news18.com/ibnlive/uploads/2019/10/Nawazuddin-Siddiqui.jpg", "post_url": "https://www.news18.com/news/movies/actor-nawazuddin-siddiquis-wife-aaliya-sends-legal-notice-to-him-demanding-divorce-maintenance-2626035.html"}
{"page_title": "Lisa Haydon Celebrates Son Zack\u2019s 3rd Birthday With Homemade Cake And 'Spiderman' Surprise", "description": "Lisa Haydon took to Instagram to share some glimpses from the special day. In the pictures, we can spot a man wearing a Spiderman costume.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807960_lisa-rey.jpg", "post_url": "https://www.news18.com/news/movies/lisa-haydon-celebrates-son-zacks-3rd-birthday-with-homemade-cake-and-spiderman-surprise-2625953.html"}
{"page_title": "Chiranjeevi Recreates Old Picture with Wife, Says 'Time Has Changed'", "description": "Chiranjeevi was last seen in historical-drama Sye Raa Narasimha Reddy. He was shooting for his next film, Acharya, before the coronavirus lockdown.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589808242_pjimage.jpg", "post_url": "https://www.news18.com/news/movies/chiranjeevi-recreates-old-picture-with-wife-says-time-has-changed-2625973.html"}
{"page_title": "Amitabh Bachchan, Rishi Kapoor\u2019s Pout Selfie Recreated By Abhishek, Ranbir is Priceless", "description": "A throwback picture that has gone viral on the internet shows Ranbir Kapoor and Abhishek Bachchan recreating a selfie of their fathers Rishi Kapoor and Amitabh Bachchan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807772_screenshot_20200518-184521_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/amitabh-bachchan-rishi-kapoors-pout-selfie-recreated-by-abhishek-ranbir-is-priceless-2625867.html"}
Something that you can do is to create an annoter script wherein you can replace actor names with '###' or some other string (which will be replaced later with actor names (entities) for training).
I trained 68K data/sentences in 9 hrs with my i3 laptop. You can dump data like this and the output file can be used for training the model.
That will save time and also give you ready made training data format for SpaCy.
from nltk import word_tokenize
from pandas import read_csv
import re
import os.path
def annot(Label, entity, textlist) :
finaldict = []
for text_token in textlist:
textbk=text_token
for value in entity:
#if entity has multi tokens
text=textbk
text=text_token
text=str(text).replace('###',value)
text=text.lower()
text = re.sub('[^a-zA-Z0-9\n\.]',' ', text)
if len(word_tokenize(value))<2:
#print('I am here')
newtext=word_tokenize(text)
traindata=[]
prev_length=0
prev_pos=0
k=0
while k != len(newtext):
if k == 0:
prev_pos=0
prev_length=len(newtext[k])
if value.lower()== str(newtext[k]):
ent=Label
tup=(prev_pos,prev_length,ent)
traindata.append(tup)
else:
pass
else :
prev_pos=prev_length+1
prev_length=prev_length+len(newtext[k])+1
if value.lower()==str(newtext[k]):
ent=Label
tup=(prev_pos,prev_length,ent)
traindata.append(tup)
else:
pass
k=k+1
mydict={'entities':traindata}
finaldict.append((text,mydict))
else:
traindata=[]
try:
begin=text.index(value.lower())
ent=Label
tup=(begin,len(value.lower()),ent)
traindata.append(tup)
except ValueError:
pass
mydict={'entities':traindata}
finaldict.append((text,mydict))
return finaldict
def getEntities(csv_file, column) :
df = read_csv(csv_file)
return df[column].to_list()
def getSentences(file_name) :
with open(file_name) as file1 :
sentences = [line1.rstrip('\n') for line1 in file1]
return sentences
def saveData (data, filename, path) :
filename = os.path.join(path, filename)
with open(filename, 'a') as file :
for sent in data :
file.write("{}\n".format(sent))
ents = getEntities(csv_file, column_name) #Actor names in your case
entities = [ent for ent in ents if str(ent) != 'nan']
sentences = getSentences(filepathandname) #Considering you have the sentences in a text file
label = 'ACTOR_NAMES'
data = annot(label, entities, sentences)
saveData(data, 'train_data.txt', path)
Hope this is a relevant answer for your question.
So I am attempting to scrape a website of its staff roster and I want the end product to be a dictionary in the format of {staff: position}. I am currently stuck with it returning every staff name and position as a separate string. It is hard to clearly post the output, but it essentially goes down the list of names, then the position. So for example the first name on the list is to be paired with the first position, and so on. I have determined that each name and position are a class 'bs4.element.Tag. I believe I need to take the names and the positions and make a list out of each, then use zip to put the elements in a dictionary. I have tried implementing this but nothing so far has worked. The lowest I could get to the text I need by using the class_ parameter was the individual div that the p is contained in. I am still inexperienced with python and new to web scraping, but I am relativity well versed with html and css, so help would be greatly appreciated.
# Simple script attempting to scrape
# the staff roster off of the
# Greenville Drive website
import requests
from bs4 import BeautifulSoup
URL = 'https://www.milb.com/greenville/ballpark/frontoffice'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
staff = soup.find_all('div', class_='l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-3 l-grid__col--lg-3 l-grid__col--xl-3')
for staff in staff:
data = staff.find('p')
if data:
print(data.text.strip())
position = soup.find_all('div', class_='l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-6 l-grid__col--lg-6 l-grid__col--xl-6')
for position in position:
data = position.find('p')
if data:
print(data.text.strip())
# This code so far provides the needed data, but need it in a dict()
BeautifulSoup has find_next() which can be used to get the next tag with the matching filters specified. Find the "staff" div and the use find_next() to get the adjacent "position" div.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.milb.com/greenville/ballpark/frontoffice'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
staff_class = 'l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-3 l-grid__col--lg-3 l-grid__col--xl-3'
position_class = 'l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-6 l-grid__col--lg-6 l-grid__col--xl-6'
result = {}
for staff in soup.find_all('div', class_=staff_class):
data = staff.find('p')
if data:
staff_name = data.text.strip()
postion_div = staff.find_next('div', class_=position_class)
postion_name = postion_div.text.strip()
result[staff_name] = postion_name
print(result)
Output
{'Craig Brown': 'Owner/Team President', 'Eric Jarinko': 'General Manager', 'Nate Lipscomb': 'Special Advisor to the President', 'Phil Bargardi': 'Vice President of Sales', 'Jeff Brown': 'Vice President of Marketing', 'Greg Burgess, CSFM': 'Vice President of Operations/Grounds', 'Jordan Smith': 'Vice President of Finance', 'Ned Kennedy': 'Director of Inside Sales', 'Patrick Innes': 'Director of Ticket Operations', 'Micah Gold': 'Senior Account Executive', 'Molly Mains': 'Senior Account Executive', 'Houghton Flanagan': 'Account Executive', 'Jeb Maloney': 'Account Executive', 'Olivia Adams': 'Inside Sales Representative', 'Tyler Melson': 'Inside Sales Representative', 'Toby Sandblom': 'Inside Sales Representative', 'Katie Batista': 'Director of Sponsorships and Community Engagement', 'Matthew Tezza': 'Sponsor Services and Activations Manager', 'Melissa Welch': 'Sponsorship and Community Events Manager', 'Beth Rusch': 'Director of West End Events', 'Kristin Kipper': 'Events Manager', 'Grant Witham': 'Events Manager', 'Alex Guest': 'Director of Game Entertainment & Production', 'Lance Fowler': 'Director of Video Production', 'Davis Simpson': 'Director of Media and Creative Services', 'Cameron White': 'Media Relations Manager', 'Ed Jenson': 'Broadcaster', 'Adam Baird': 'Accountant', 'Mike Agostino': 'Director of Food and Beverage', 'Roger Campana': 'Assistant Director of Food and Beverage', 'Wilbert Sauceda': 'Executive Chef', 'Elise Parish': 'Premium Services Manager', 'Timmy Hinds': 'Director of Facility Operations', 'Zack Pagans': 'Assistant Groundskeeper', 'Amanda Medlin': 'Business and Team Operations Manager', 'Allison Roedell': 'Office Manager'}
Solution using CSS selectors and zip():
import requests
from bs4 import BeautifulSoup
url = 'https://www.milb.com/greenville/ballpark/frontoffice'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out = {}
for name, position in zip( soup.select('div:has(+ div p) b'),
soup.select('div:has(> div b) + div p')):
out[name.text] = position.text
from pprint import pprint
pprint(out)
Prints:
{'Adam Baird': 'Accountant',
'Alex Guest': 'Director of Game Entertainment & Production',
'Allison Roedell': 'Office Manager',
'Amanda Medlin': 'Business and Team Operations Manager',
'Beth Rusch': 'Director of West End Events',
'Brady Andrews': 'Assistant Director of Facility Operations',
'Brooks Henderson': 'Merchandise Manager',
'Bryan Jones': 'Facilities Cleanliness Manager',
'Cameron White': 'Media Relations Manager',
'Craig Brown': 'Owner/Team President',
'Davis Simpson': 'Director of Media and Creative Services',
'Ed Jenson': 'Broadcaster',
'Elise Parish': 'Premium Services Manager',
'Eric Jarinko': 'General Manager',
'Grant Witham': 'Events Manager',
'Greg Burgess, CSFM': 'Vice President of Operations/Grounds',
'Houghton Flanagan': 'Account Executive',
'Jeb Maloney': 'Account Executive',
'Jeff Brown': 'Vice President of Marketing',
'Jenny Burgdorfer': 'Director of Merchandise',
'Jordan Smith ': 'Vice President of Finance',
'Katie Batista': 'Director of Sponsorships and Community Engagement',
'Kristin Kipper': 'Events Manager',
'Lance Fowler': 'Director of Video Production',
'Matthew Tezza': 'Sponsor Services and Activations Manager',
'Melissa Welch': 'Sponsorship and Community Events Manager',
'Micah Gold': 'Senior Account Executive',
'Mike Agostino': 'Director of Food and Beverage',
'Molly Mains': 'Senior Account Executive',
'Nate Lipscomb': 'Special Advisor to the President',
'Ned Kennedy': 'Director of Inside Sales',
'Olivia Adams': 'Inside Sales Representative',
'Patrick Innes': 'Director of Ticket Operations',
'Phil Bargardi': 'Vice President of Sales',
'Roger Campana': 'Assistant Director of Food and Beverage',
'Steve Seman': 'Merchandise / Ticketing Advisor',
'Timmy Hinds': 'Director of Facility Operations',
'Toby Sandblom': 'Inside Sales Representative',
'Tyler Melson': 'Inside Sales Representative',
'Wilbert Sauceda': 'Executive Chef',
'Zack Pagans': 'Assistant Groundskeeper'}
The following code checks the presence of the company name (from tickerList) or its fragment in the text of the news (from newsList).
In the case when the company is found in the news print gives out the expected ticker of the company, but after adding this news to the list, something nonsense happens :(
It's looks like, when appending a dictionary (news) to the list (tickersNews), are the previous elements of the list overwritten. Why?
It should be noted that when news appending as a dictionary converted to a string, everything works as it should
import re
tickersList = [('ATI', 'Allegheny rporated', 'Allegheny Technologies Incorporated'), ('ATIS', 'Attis', 'Attis Industries, Inc.'), ('ATKR', 'Atkore International Group', 'Atkore International Group Inc.'), ('ATMP', 'Barclays + Select M', 'Barclays ETN+ Select MLP'), ('ATNM', 'Actinium', 'Actinium Pharmaceuticals, Inc.'), ('ATNX', 'Athenex', 'Athenex, Inc.'), ('ATOS', 'Atossa Genetics', 'Atossa Genetics Inc.'), ('ATRA', 'Atara Biotherapeutics', 'Atara Biotherapeutics, Inc.'), ('ATRC', 'AtriCure', 'AtriCure, Inc.'), ('ATRO', 'Astronics', 'Astronics Corporation'), ('ATRS', 'Antares Pharma', 'Antares Pharma, Inc.'), ('ATSG', 'Air Transport Services Group', 'Air Transport Services Group, Inc.'), ('CJ', 'C&J Energy', 'C&J Energy Services, Inc.'), ('CJJD', 'China Jo-Jo Drugstores', 'China Jo-Jo Drugstores, Inc.'), ('CLAR', 'Clarus', 'Clarus Corporation'), ('CLD', 'Cloud Peak Energy', 'Cloud Peak Energy Inc.'), ('CLDC', 'China Lending', 'China Lending Corporation'), ('CLDR', 'Cloudera', 'Cloudera, Inc.')]
newsList = [
{'title':'Atara Biotherapeutics Announces Planned Chief Executive Officer Transition'},
{'title':'Chongqing Jingdong Pharmaceutical and Athenex Announce a Strategic Partnership and Licensing Agreement to Develop and Commercialize KX2-391 in China'}
]
tickersNews = []
for news in newsList:
# pass through the list of companies looking for their mention in the news
for ticker, company, company_full in tickersList:
# clear the full name of the company from brackets, spaces, articles,
# points and commas and save fragments of the full name to the list
companyFullFragments = company_full.replace(',', '')\
.replace('.', '').replace('The ', ' ')\
.replace('(', ' ').replace(')', ' ')\
.replace(' ', ' ').strip().split()
# looking for a company in the news every time cutting off
# the last fragment from the full company name
for i in range(len(companyFullFragments), 0, -1):
companyFullFragmentsString = ' '.join(companyFullFragments[:i]).strip()
lookFor_company = r'(^|\s){0}(\s|$)'.format(companyFullFragmentsString)
results_company = re.findall(lookFor_company, news['title'])
# if the title of the news contains the name of the company,
# then we add the ticker, the found fragment and the full name
# of the company to the news, print the news and add it to the list
if results_company:
news['ticker'] = ticker#, companyFullFragmentsString, company_full
print(news['ticker'], 'found')
#tickersNews.append(str(news))
#-----------------------------Here is the problem!(?)
tickersNews.append(news)
# move on to the next company
break
print(20*'-', 'appended:')
for news in tickersNews:
print(news['ticker'])
Output (list of dict):
ATRA found
ATNX found
CJJD found
CLDC found
-------------------- appended:
ATRA
CLDC
CLDC
CLDC
Output (list of strings):
ATRA found
ATNX found
CJJD found
CLDC found
-------------------- appended as a strings:
["{'title': 'Atara Biotherapeutics Announces Planned Chief Executive Officer Transition', 'ticker': 'ATRA'}", "{'title': 'Chongqing Jingdong Pharmaceutical and Athenex Announce a Strategic Partnership and Licensing Agreement to Develop and Commercialize KX2-391 in China', 'ticker': 'ATNX'}", "{'title': 'Chongqing Jingdong Pharmaceutical and Athenex Announce a Strategic Partnership and Licensing Agreement to Develop and Commercialize KX2-391 in China', 'ticker': 'CJJD'}", "{'title': 'Chongqing Jingdong Pharmaceutical and Athenex Announce a Strategic Partnership and Licensing Agreement to Develop and Commercialize KX2-391 in China', 'ticker': 'CLDC'}"]
The problem originates from 2 lines: news['ticker'] = ticker and tickersNews.append(news) which are located inside for loop. Much simpler version of your problem is:
a = 10
a = 20
a = 30
print(a, a, a)
Output will be 30 30 30. I guess it's obvious.
To solve the problem you may use several approaches.
First possibility (easiest). Replace tickersNews.append(news) with tickersNews.append(news.copy()).
Second possibility (preferable). Don't use tickersNews. For every news create empty list news['ticker_list'] = list(). For every ticker append it to news['ticker_list']:
import re
tickersList = [('ATI', 'Allegheny rporated', 'Allegheny Technologies Incorporated'), ('ATIS', 'Attis', 'Attis Industries, Inc.'), ('ATKR', 'Atkore International Group', 'Atkore International Group Inc.'), ('ATMP', 'Barclays + Select M', 'Barclays ETN+ Select MLP'), ('ATNM', 'Actinium', 'Actinium Pharmaceuticals, Inc.'), ('ATNX', 'Athenex', 'Athenex, Inc.'), ('ATOS', 'Atossa Genetics', 'Atossa Genetics Inc.'), ('ATRA', 'Atara Biotherapeutics', 'Atara Biotherapeutics, Inc.'), ('ATRC', 'AtriCure', 'AtriCure, Inc.'), ('ATRO', 'Astronics', 'Astronics Corporation'), ('ATRS', 'Antares Pharma', 'Antares Pharma, Inc.'), ('ATSG', 'Air Transport Services Group', 'Air Transport Services Group, Inc.'), ('CJ', 'C&J Energy', 'C&J Energy Services, Inc.'), ('CJJD', 'China Jo-Jo Drugstores', 'China Jo-Jo Drugstores, Inc.'), ('CLAR', 'Clarus', 'Clarus Corporation'), ('CLD', 'Cloud Peak Energy', 'Cloud Peak Energy Inc.'), ('CLDC', 'China Lending', 'China Lending Corporation'), ('CLDR', 'Cloudera', 'Cloudera, Inc.')]
newsList = [
{'title':'Atara Biotherapeutics Announces Planned Chief Executive Officer Transition'},
{'title':'Chongqing Jingdong Pharmaceutical and Athenex Announce a Strategic Partnership and Licensing Agreement to Develop and Commercialize KX2-391 in China'}
]
for news in newsList:
news['ticker_list'] = list()
for ticker, company, company_full in tickersList:
companyFullFragments = company_full.replace(',', '')\
.replace('.', '').replace('The ', ' ')\
.replace('(', ' ').replace(')', ' ')\
.replace(' ', ' ').strip().split()
for i in range(len(companyFullFragments), 0, -1):
companyFullFragmentsString = ' '.join(companyFullFragments[:i]).strip()
lookFor_company = r'(^|\s){0}(\s|$)'.format(companyFullFragmentsString)
results_company = re.findall(lookFor_company, news['title'])
if results_company:
news['ticker_list'].append(ticker)
# print(ticker, 'found')
break
print('tickers for news:')
for news in newsList:
print(news['ticker_list'])
Output will be:
tickers for news:
['ATRA']
['ATNX', 'CJJD', 'CLDC']
I have this csv file with only two entries. Here it is:
Meat One,['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']
First one is title and second is a business headings.
Problem lies with entry two.
Here is my code:
import csv
with open('phonebookCOMPK-Directory.csv', "rt") as textfile:
reader = csv.reader(textfile)
for row in reader:
row5 = row[5].replace("[", "").replace("]", "")
listt = [(''.join(row5))]
print (listt[0])
it prints:
'Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers'
What i need to do is that i want to create a list containing these words and then print them like this using for loop to print every item separately:
Abattoirs
Exporters
Food Delivery
Butchers Retail
Meat Dealers-Retail
Meat Freezer
Meat Packers
Actually I am trying to reformat my current csv file and clean it so it can be more precise and understandable.
Complete 1st line of csv is this:
Meat One,+92-21-111163281,Al Shaheer Corporation,Retailers,2008,"['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']","[[' Outlets Address : Shop No. Z-10, Station Shopping Complex, MES Market, Malir-Cantt, Karachi. Landmarks : MES Market, Station Shopping Complex City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi. Landmarks : Nadra Chowrangi, Sky Garden, Tipu Sultan Road City : Karachi UAN : +92-21-111163281 '], ["" Outlets Address : Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi. Landmarks : Boat Basin, Jans Broast, Khayaban-e-Roomi City : Karachi UAN : +92-21-111163281 View Map ""], [' Outlets Address : Gulistan-e-Johar, Karachi. Landmarks : Perfume Chowk City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Tee Emm Mart, Creek Vista Appartments, Khayaban-e-Shaheen, Phase VIII, DHA, Karachi. Landmarks : Creek Vista Appartments, Nueplex Cinema, Tee Emm Mart, The Place City : Karachi Mobile : 0302-8333666 '], [' Outlets Address : Y-Block, DHA, Lahore. Landmarks : Y-Block City : Lahore UAN : +92-42-111163281 '], [' Outlets Address : Adj. PSO, Main Bhittai Road, Jinnah Supermarket, F-7 Markaz, Islamabad. Landmarks : Bhittai Road, Jinnah Super Market, PSO Petrol Pump City : Islamabad UAN : +92-51-111163281 ']]","Agriculture, fishing & Forestry > Farming equipment & services > Abattoirs in Pakistan"
First column is Name
Second column is Number
Third column is Owner
Forth column is Business type
Fifth column is Y.O.E
Sixth column is Business Headings
Seventh column is Outlets (List of lists containing every branch address)
Eighth column is classification
There is no restriction of using csv.reader, I am open to any technique available to clean my file.
Think of it in terms of two separate tasks:
Collect some data items from a ‘dirty’ source (this CSV file)
Store that data somewhere so that it’s easy to access and manipulate programmatically (according to what you want to do with it)
Processing dirty CSV
One way to do this is to have a function deserialize_business() to distill structured business information from each incoming line in your CSV. This function can be complex because that’s the nature of the task, but still it’s advisable to split it into self-containing smaller functions (such as get_outlets(), get_headings(), and so on). This function can return a dictionary but depending on what you want it can be a [named] tuple, a custom object, etc.
This function would be an ‘adapter’ for this particular CSV data source.
Example of deserialization function:
def deserialize_business(csv_line):
"""
Distills structured business information from given raw CSV line.
Returns a dictionary like {name, phone, owner,
btype, yoe, headings[], outlets[], category}.
"""
pieces = [piece.strip("[[\"\']] ") for piece in line.strip().split(',')]
name = pieces[0]
phone = pieces[1]
owner = pieces[2]
btype = pieces[3]
yoe = pieces[4]
# after yoe headings begin, until substring Outlets Address
headings = pieces[4:pieces.index("Outlets Address")]
# outlets go from substring Outlets Address until category
outlet_pieces = pieces[pieces.index("Outlets Address"):-1]
# combine each individual outlet information into a string
# and let ``deserialize_outlet()`` deal with that
raw_outlets = ', '.join(outlet_pieces).split("Outlets Address")
outlets = [deserialize_outlet(outlet) for outlet in raw_outlets]
# category is the last piece
category = pieces[-1]
return {
'name': name,
'phone': phone,
'owner': owner,
'btype': btype,
'yoe': yoe,
'headings': headings,
'outlets': outlets,
'category': category,
}
Example of calling it:
with open("phonebookCOMPK-Directory.csv") as f:
lineno = 0
for line in f:
lineno += 1
try:
business = deserialize_business(line)
except:
# Bad line formatting?
log.exception(u"Failed to deserialize line #%s!", lineno)
else:
# All is well
store_business(business)
Storing the data
You’ll have the store_business() function take your data structure and write it somewhere. Maybe it’ll be another CSV that’s better structured, maybe multiple CSVs, a JSON file, or you can make use of SQLite relational database facilities since Python has it built-in.
It all depends on what you want to do later.
Relational example
In this case your data would be split across multiple tables. (I’m using the word “table” but it can be a CSV file, although you can as well make use of an SQLite DB since Python has that built-in.)
Table identifying all possible business headings:
business heading ID, name
1, Abattoirs
2, Exporters
3, Food Delivery
4, Butchers Retail
5, Meat Dealers-Retail
6, Meat Freezer
7, Meat Packers
Table identifying all possible categories:
category ID, parent category, name
1, NULL, "Agriculture, fishing & Forestry"
2, 1, "Farming equipment & services"
3, 2, "Abattoirs in Pakistan"
Table identifying businesses:
business ID, name, phone, owner, type, yoe, category
1, Meat One, +92-21-111163281, Al Shaheer Corporation, Retailers, 2008, 3
Table describing their outlets:
business ID, city, address, landmarks, phone
1, Karachi UAN, "Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi", "Nadra Chowrangi, Sky Garden, Tipu Sultan Road", +92-21-111163281
1, Karachi UAN, "Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi", "Boat Basin, Jans Broast, Khayaban-e-Roomi", +92-21-111163281
Table describing their headings:
business ID, business heading ID
1, 1
1, 2
1, 3
…
Handling all this would require a complex store_business() function. It may be worth looking into SQLite and some ORM framework, if going with relational way of keeping the data.
You can just replace the line :
print(listt[0])
with :
print(*listt[0], sep='\n')