How do you form a compound query using Bing Marketplace API? - azure

I have an application that is using the new Bing API from the Azure Datamarketplace. I used to be able to query the Bing API using a simple syntax using OR AND, etc. That doesn't seem to work in the new API.
Old syntax:
"Jacksonville Jaguars" OR "NFL Jaguars" OR "Atlanta Falcons"
That would give me a query with any of those phrases (I am making a rt_Sports query for news).
I am calling HttpEncode on the query first, but am still not getting results. It works if I remove all the " marks, but then I am getting results sometimes for news about Falcons and Jaguars (the animals)... Not what I wanted.
Anyone have any idea how you can form a query that takes multiple phrases?
I have tried to not use the OR, not use the ', use a ", use the | instead of OR. All of these work against BING the website, just not in the API.
I just tried this via Bing and got 36 Million results:
NFL Football | Seattle Seahawks | New York Giants | Dallas Cowboys | New Orleans Saints | New England Patriots | Jacksonville Jaguars
Same thing in the API returns 0.
I got an email from a friend who I also emailed this question out to and his thought was that I was going about it wrong. That there should be a way to form a LINQ query off the Bing object with multiple where clauses.
But I don't see how that would be possible. You allow a BingSearchContainer and then call the News method on the container. The News method has only a single Query parameter.
var bingContainer = new Bing.BingSearchContainer(new Uri("https://api.datamarket.azure.com/Bing/Search"));
bingContainer.Credentials = new NetworkCredential(App.BingAPIAccountKey, App.BingAPIAccountKey);
string unifiedQuery = "NFL Football | Jacksonville Jaguars | Atlanta Falcons";
var newsQuery = bingContainer.News(unifiedQuery, null, "en-US", "Strict", null, null, null, "rt_Sports", "Relevance");
newsQuery.BeginExecute(BingNewsResultLoadedCallback, newsQuery);

Try changing unifiedQuery to the following:
var unifiedQuery = "'NFL Football' or 'Jacksonville Jaguars' or 'Atlanta Falcons'";
I tried something very similar to your sample code, using this format for the query string, and it worked for me:
var bingUri = new Uri("https://api.datamarket.azure.com/Bing/Search/v1/", UriKind.Absolute);
var bingContainer = new BingSearchContainer(bingUri);
bingContainer.Credentials = new NetworkCredential(BingAPIUserName, BingAPIAccountKey);
var unifiedQuery = "'NFL Football' or 'Jacksonville Jaguars' or 'Atlanta Falcons'";
var newsQuery = bingContainer.News(unifiedQuery, null, "en-US", "Strict", null, null, null, "rt_Sports", "Relevance");
var results = newsQuery.Execute();
foreach (var item in results)
{
Console.WriteLine(item.Title);
}
Here are my results:
Fantasy Football 2012: Ranking the Top 25 RBs
NFL Football No Longer Just a Sunday Game
Ravens Notebook: Ed Reed decided to play in game vs. Falcons since he 'wasn't doing anything else'
PrimeSport Partners with Jacksonville Jaguars to Offer Tickets and Official Fan
Packages for all Home and Away Games in 2012 Season
Jaguars cut former Ravens wide receiver Lee Evans
Falcons left tackle Baker finally feels healthy
Jaguars release veteran WR Lee Evans
NFC West: 2012 NFL Training Camp
Atlanta Falcons 2012 NFL TV Schedule
Jaguars training camp: Veteran WR Lee Evans released
Jaguars score 18 points in second half to beat Giants 32-31
Jacksonville Jaguars put running back Maurice Jones-Drew on reserve/did not report list
Postcard from camp: Falcons
Questions abound as NFL preseason opens in earnest
NFL fantasy football: Ryan Mathews loses value
The format for the unifiedQuery string is basically the OData URI query string format. For a full description of how these query strings work, check out the OData URI conventions documentation at http://www.odata.org/documentation/uri-conventions.

Related

Iterate through Financial Times pages when Web Scraping

I'm using BeautifulSoup to scrape the titles of the news in the Financial Times Website. The website url is the type where it ends with page=1, page=2 and so on. So I would like to scrape the news headlines of each page. My current is:
import subprocess
news_titles=[]
for page in range(5):
url="https://www.ft.com/world?page=".format(page)
result=requests.get(url)
reshult=result.content
soup=BeautifulSoup(reshult, "lxml")
for title in soup.findAll("div",{"class":"o-teaser__heading"}):
titles=title.find(text=True)
news_titles.append(titles)
with open('hug_file.txt', 'w') as f:
for item in news_titles:
f.write("%s\n" % item)
However I'm only getting the headlines from the first page. Is there any way someone can help me with my code?
Use the following code.
import requests
from bs4 import BeautifulSoup
news_titles=[]
for page in range(1,6):
url="https://www.ft.com/world?page={}".format(page)
result=requests.get(url)
reshult=result.content
soup=BeautifulSoup(reshult, "lxml")
for title in soup.findAll("div",{"class":"o-teaser__heading"}):
titles=title.find(text=True)
news_titles.append(titles)
print(news_titles)
Output:
['US warns Boris Johnson that UK secrets are at risk', 'Algeria’s powerful army chief Ahmed Gaid Salah dies', 'Philippines seeks to relaunch nuclear power ambitions', 'Saudi Arabia sentences five to death for Khashoggi murder', 'Thousands flee renewed offensive by Syrian regime', 'Voters turn on India’s ruling party over Hindu-first agenda', 'Australia’s bushfires have exposed leaders’ failings', 'Japan is wondering if the Olympics are really worth it', 'What is India’s citizenship law and why has it stirred such anger?', '‘Afghanistan Papers’ shed light on Biden', 'The case for public research spending', 'Help fight the illegal wildlife trade', 'Ukrainegate: a guide to the US impeachment inquiry', 'FT’s foreign affairs podcast with Gideon Rachman', 'China’s global spending spree will collapse, says top US official', 'US warns Boris Johnson that UK secrets are at risk', 'Bank of Canada deputy governor leads race for top job', 'Saudi Arabia sentences five to death for Khashoggi murder', 'Johnson pledges to stand up for Christians', 'Turkish court defies European ruling over activist', 'Squaring the Brexit circle', 'Year in a word: Greenland', 'Will the lights go out on Sark this Christmas?', 'Best of our weekday letters 2019', 'Bank of Canada deputy governor leads race for top job', 'Italy seeks to end shoppers’ reliance on cash', 'Why 2019 was not as bad as you think', 'UK companies risk being uninsured for data losses', 'Packing T-shirts? There’s a Uniqlo robot for that', 'Can the new UK government end homelessness?', 'Cuadrilla pushes for progress on UK fracking', 'UK ministers under fire for vague audit reform pledge', 'Citigroup set to post record revenues in Hong Kong', 'China companies push US listings as appetite wanes', 'Johnson pledges to stand up for Christians', 'Football bodies under pressure over racist chants', 'FirstFT: 2019 in review', 'The spirit of endeavour has not dimmed in 2019', 'Fears for Vimto sales after UAE and Saudi Arabia impose sugar tax', 'Smaller banks turn to currency derivatives for short term liquidity', 'China banks: still standing', 'Big Ben should remain silent on Brexit Day', 'Pound drops as post-election glow evaporates', 'UK’s military seeks new place in world after Brexit', 'Thousands flee renewed offensive by Syrian regime', 'China banks: still standing', 'Big Ben should remain silent on Brexit Day', 'The case for public research spending', 'Pound drops as post-election glow evaporates', 'Voters turn on India’s ruling party over Hindu-first agenda', 'UK’s military seeks new place in world after Brexit', 'Boris Johnson faces a battle to save the union', '2019: the year of street protest', 'Croatia president reaches run-off in re-election bid', 'Trade Secrets: a year in charts', 'Japan is wondering if the Olympics are really worth it', 'Spain’s businesses worried by prospect of radical left', 'What I want for Xmas — more empathy', 'Trump exposed', 'Machine learning: the big risks and how to manage them', 'Xi turns peacemaker amid dispute between Tokyo and Seoul', 'Australia’s bushfires have exposed leaders’ failings', '‘Afghanistan Papers’ shed light on Biden', 'Can social pacts spur inflation?', 'Further reading', 'The European (In)stability Mechanism', 'Ethiopia seizes crown as fastest-growing country in the 2010s', 'China’s new foreign investment law is a missed opportunity', 'Latam renewable energy investment hits record high', 'Help fight the illegal wildlife trade', 'Ukrainegate: a guide to the US impeachment inquiry', 'FT’s foreign affairs podcast with Gideon Rachman', 'The big market moments of 2019', 'FT poll: Christine Lagarde expected to change ECB inflation target', 'Bermuda’s status as insurance safe harbour under threat', 'Plunder of the Commons, by Guy Standing', 'China’s global spending spree will collapse, says top US official', 'George Mitchell, transformer of the energy market', 'China’s might damps criticism of Uighur crackdown', 'How 2019’s mammoth bond rally buoyed entire eurozone', 'Rolls-Royce cuts apprentice and graduate schemes by almost 30%', 'Hong Kong protests loom large over Taiwan election', 'Battered chipmakers look forward to a better 2020', 'What is India’s citizenship law and why has it stirred such anger?', 'Southern manufacturing outpaces north and Midlands', 'Productivity growth of 0.3% is ‘statistic of decade’', 'Heathrow shows detailed costings for third runway', 'UK visa numbers to be raised in science research push', 'Why China’s AI companies are struggling to evolve beyond surveillance', 'UK corporate pension transfer market set for record year', 'FirstFT: Today’s top stories\xa0', 'Bond wobble shrinks global pile of negative yields', 'Corporate Japan posts record number of M&A deals', 'Carney leaves a BoE more in tune with the modern world', 'Interim candidate lined up to succeed Bailey at FCA', 'US threats to Afghan war probe ‘troubling’, says ICC', 'Modi castigates protesters as death toll rises', 'Help fight the illegal wildlife trade', 'Ukrainegate: a guide to the US impeachment inquiry', 'FT’s foreign affairs podcast with Gideon Rachman', 'Pentagon wants open-source 5G plan to take on Huawei', 'Year in a word: Be water', 'New battle over Scottish independence has begun', 'Macron vows to forgo presidential pension amid strikes', 'Tesco pulls Christmas cards over China forced labour claim', 'Austerity, not the populists, destroyed Europe’s centre ground', 'US envoy defends Nord Stream 2 sanctions as ‘pro-European’', 'France loosens ties with west African currency', 'Brazilian consumers feel festive cheer as economy returns to life', 'Global food supply chains caught in honey trap', 'America’s competitiveness problem', 'How asset managers turned into business agitators', 'Fear of Russian attack hangs over Germany’s Chechens', 'India is at risk of sliding into a second Emergency', 'Pantomimes: Dame Fortune', 'UK election: how the Tories ‘got it done’', 'Political life shows managers need a stable team', 'Xinjiang security crackdown sparks Han Chinese exodus', 'Britain’s homeless crisis can be solved — here’s how', 'Political nous helps Bailey win race to head Bank of England', 'BuzzFeed’s international business losses quadruple', 'Death toll rises as anti-Modi protests intensify', 'Scars of Romania’s revolution still to heal', 'Thousands of new homes to be built on England’s floodplains', 'Mexico plans crackdown on private electricity market', 'Help fight the illegal wildlife trade', 'Ukrainegate: a guide to the US impeachment inquiry', 'FT’s foreign affairs podcast with Gideon Rachman', 'What lies ahead for Boris Johnson’s government?', 'Extinction/Chinese medicine: diseconomies of scales', 'Treasury rakes in £9m in Lifetime Isa penalties', 'India’s youth voice anger at Modi’s citizenship law', 'The economy is king in Trump’s re-election bid', 'Ukraine and Russia sign deal on gas supply to Europe', 'N Ireland talks on resuming Stormont put on pause', 'Johnson’s bill victory breaks Brexit gridlock', 'Argentina delays payments on $9bn in debt', 'Wall Street adds to record run', 'Thousands face lower tax bills after ‘loan charge’ ruling', 'Facebook bans pro-Trump media outlet over fake accounts', 'FT Asia-Pacific Innovative Lawyers 2020 open for submissions', 'Johnson clears path for Brexit with draft bill victory', 'Trump tries to spin impeachment into re-election gold', 'Corporate year in review: deals, drama, spies and successes', 'Violence escalates in Libya following Turkey pledge', 'New Bank of England governor offers stability', 'UK seeks extradition of US diplomat’s wife over teen death', 'CC Land and Meyer Bergman invest in £1.25bn London property', 'Puigdemont prepares to take seat as MEP after court win', 'Defence minister says UK military faces shortfall', 'So what happened to the Boris bounce?', 'Andrew Bailey wins race to be Bank of England governor', 'Colombia approves tax reforms despite protests', 'Help fight the illegal wildlife trade', 'Ukrainegate: a guide to the US impeachment inquiry', 'FT’s foreign affairs podcast with Gideon Rachman']
The first mistake is that you need to put the title for-loop inside the page for-loop.
Second, the format(page) seems to not append the page number to the URL correctly. Just simply make the url a concatenation of the main url and the string version of the page number.
Third, make the range be from (1,6), since page= and page=1 are the same.
I have revised your code below. It should work.
import requests
from bs4 import BeautifulSoup
news_titles=[]
for page in range(1,6):
url="https://www.ft.com/world?page=" + str(page)
result=requests.get(url)
reshult=result.content
soup=BeautifulSoup(reshult, "lxml")
for title in soup.findAll("div",{"class":"o-teaser__heading"}):
titles=title.find(text=True)
news_titles.append(titles)
for item in news_titles:
print (item)

Find all references to a supplied noun in StanfordNLP

I'm trying to parse some text to find all references to a particular item. So, for example, if my item was The Bridge on the River Kwai and I passed it this text, I'd like it to find all the instances I've put in bold.
The Bridge on the River Kwai is a 1957 British-American epic war film
directed by David Lean and starring William Holden, Jack Hawkins, Alec
Guinness, and Sessue Hayakawa. The film is a work of fiction, but
borrows the construction of the Burma Railway in 1942–1943 for its
historical setting. The movie was filmed in Ceylon (now Sri Lanka).
The bridge in the film was near Kitulgala.
So far my attempt has been to go through all the mentions attached to each CorefChain and loop through those hunting for my target string. If I find the target string, I add the whole CorefChain, as I think this means the other items in that CorefChain also refer to the same thing.
List<CorefChain> gotRefs = new ArrayList<CorefChain>();
String pQuery = "The Bridge on the River Kwai";
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
List<CorefChain.CorefMention> corefMentions = cc.getMentionsInTextualOrder();
boolean addedChain = false;
for (CorefChain.CorefMention cm : corefMentions) {
if ((!addedChain) &&
(pQuery.equals(cm.mentionSpan))) {
gotRefs.add(cc);
addedChain = true;
}
}
}
I then loop through this second list of CorefChains, re-retrieve the mentions for each chain and step through them. In that loop I show which sentences have a likely mention of my item in a sentence.
for (CorefChain gr : gotRefs) {
List<CorefChain.CorefMention> corefMentionsUsing = gr.getMentionsInTextualOrder();
for (CorefChain.CorefMention cm : corefMentionsUsing) {
//System.out.println("Got reference to " + cm.mentionSpan + " in sentence #" + cm.sentNum);
}
}
It finds some of my references, but not that many, and it produces a lot of false positives. As might be entirely apparently from reading this, I don't really know the first thing about NLP - am I going about this entirely the wrong way? Is there a StanfordNLP parser that will already do some of what I'm after? Should I be training a model in some way?
I think a problem with your example is that you are looking for references to a movie title, and there isn't support in Stanford CoreNLP for recognizing movie titles, book titles, etc...
If you look at this example:
"Joe bought a laptop. He is happy with it."
You will notice that it connects:
"Joe" -> "He"
and
"a laptop" -> "it"
Coreference is an active research area and even the best system can only really be expected to produce an F1 of around 60.0 on general text, meaning it will often make errors.

making link between a person name and pronoun in GATE

Is it possible to make a link between a person name and its PRP? in GATE E.g i have document "Maria Sharapova is a tennis player from russia. She participates in international tennis tournament. She is known for winning Wimbledon, US Open and Australian Open titles as for her looks, decibel-breaking grunts and commercial savvy - all of which made her the world's highest-paid female athlete." i want to annotate the "she" as "Maria Sharapova". I have written the following JAPE rule which identifies a pattern having a PRP after a person name
Phase: Simple
Input: Lookup Token Split
Options: control = appelt
Rule:joblookup
(
({Lookup.majorType == person_first}|
{Lookup.majorType == person_full})
({Token.kind==word})+
{Split.kind==internal}
{Token.category==PRP}
):sameathlete
-->
:sameathlete.sameAthlete1 = {kind="athlete", rule="same-athlete"}
How can i make the annotation that from She means we are talking about the same person whose name is mentioned 1 or 2 sentence before??
Please help
Have you tried Co-reference PR for gate?

Train model using Named entity

I am looking on standford corenlp using the Named Entity REcognizer.I have different kinds of input text and i need to tag it into my own Entity.So i started training my own model and it doesnt seems to be working.
For eg: my input text string is "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q"
I go through the examples to train my own models and and look for only some words that I am interested in.
My jane-austen-emma-ch1.tsv looks like this
Toyota PERS
Land Cruiser PERS
From the above input text i am only interested in those two words. The one is
Toyota and the other word is Land Cruiser.
The austin.prop look like this
trainFile = jane-austen-emma-ch1.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
Run the following command to generate the ner-model.ser.gz file
java -cp stanford-corenlp-3.4.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop
public static void main(String[] args) {
String serializedClassifier = "edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz";
String serializedClassifier2 = "C:/standford-ner/ner-model.ser.gz";
try {
NERClassifierCombiner classifier = new NERClassifierCombiner(false, false,
serializedClassifier2,serializedClassifier);
String ss = "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q";
System.out.println("---");
List<List<CoreLabel>> out = classifier.classify(ss);
for (List<CoreLabel> sentence : out) {
for (CoreLabel word : sentence) {
System.out.print(word.word() + '/' + word.get(AnswerAnnotation.class) + ' ');
}
System.out.println();
}
} catch (ClassCastException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Here is the output I am getting
Book/PERS of/PERS 49/O Magazine/PERS Articles/PERS on/O Toyota/PERS Land/PERS Cruiser/PERS 1956-1987/PERS Gold/O Portfolio/PERS http://t.co/EqxmY1VmLg/PERS http://t.co/F0Vefuoj9Q/PERS
which i think its wrong.I am looking for Toyota/PERS and Land Cruiser/PERS(Which is a multi valued fied.
Thanks for the Help.Any help is really appreciated.
I believe you should also put examples of 0 entities in your trainFile. As you gave it, the trainFile is just too simple for the learning to be done, it needs both 0 and PERSON examples so it doesn't annotate everything as PERSON. You're not teaching it about your not-of-interest entities. Say, like this:
Toyota PERS
of 0
Portfolio 0
49 0
and so on.
Also, for phrase-level recognition you should look into regexner, where you can have patterns (patterns are good for us). I'm working on this with the API and I have the following code:
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", customLocationFilename);
with the following customLocationFileName:
Make Believe Town figure of speech ORGANIZATION
( /Hello/ [{ ner:PERSON }]+ ) salut PERSON
Bachelor of (Arts|Laws|Science|Engineering) DEGREE
( /University/ /of/ [{ ner:LOCATION }] ) SCHOOL
and text: Hello Mary Keller was born on 4th of July and took a Bachelor of Science. Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to University of London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney Weaver) says they will pay this on the usual credit terms (30 days).
The output I get
Hello Mary Keller is a salut
4th of July is a DATE
Bachelor of Science is a DEGREE
$ 100,000 is a MONEY
40 % is a PERCENT
15th August is a DATE
University of London is a ORGANIZATION
Make Believe Town is a figure of speech
Sigourney Weaver is a PERSON
30 days is a DURATION
For more info on how to do this you can look at the example that got me going.
The NERClassifier* is word level, that is, it labels words, not phrases. Given that, the classifier seems to be performing fine. If you want, you can hyphenate words that form phrases. So in your labeled examples and in your test examples, you would make "Land Cruiser" to "Land_Cruiser".

shell script to extract multiline data out of a text file with regex

i am trying to extract some specific data out of a text file using regular expressions with shell script
that is using a multiline grep .. and the tool i am using is pcregrep so that i can get compatibility with perl's regular expressions
[58]Walid Chamoun Architects WLL
* [59]Map
* [60]Website
* [61]Email
* [62]Profile
* [63]Display Ad
Walid Chamoun Architects WLL
PO Box:
55803, Doha, Qatar
Location:
D-Ring Road, New Salata Shamail 40, Villa 340, Doha, Qatar
Tel:
(00974) 44568833
Fax:
(00974) 44568811
Mob:
(00974) 44568822
* Accurate Budget Costing
* Eco-Friendly Structural Design
* Exclusive & Unique Design
* Quality Architecture & Design
Company Profile
Walid Chamoun Architects (WCA) was founded in Beirut, Lebanon, in 1992,
committed to the concept of fully integrated design-build delivery of
projects. In late '90s, company established in-house architectural and
engineering services. As a full service provider, WCA expanded from
multi-family projects to industrial and office construction, which
added development services, including site acquisition and financing.
In 2001, WCA had opportunity and facilities to experience European
market and establish office in Puerto Banus, Marbella, Spain. By 2005,
WCA refined its structure to focus on specific market segments and new
office was opened in Doha, state of Qatar. From a solid foundation and
reputation built over eighteen years, WCA continually to provide
leadership in design-build through promotion of benefits and education
to its practitioners.
Project Planning: Project planning and investigation occurs before
design begins has greatest impact on cost, schedule and ultimately the
success of project. Creativity in Design: You can rely on our in-house
designers for design excellence in all aspects of the project. Our
designs have received recommendations and appreciations on national and
international levels. Creativity in Execution: Experienced in close
collaboration with the designers as part of the integrated team, our
construction managers, superintendents and field staff create value
throughout the project. Post Completion Services: Your needs can be
served through our skills and experience long after the last
construction crew has left the site. Performance: Corporate and
institutional clients, developers and public agencies repeatedly select
WCA on the basis of its consistent record of performance excellence.
Serving clients throughout the Middle East and GCC, WCA provides
complete planning for architectural, interior design and construction
on a single-responsibility basis. Our expertise spans industrial,
commercial, institutional, public and residential projects. Benefits of
Design-Build: Design-build is a system of contracting under which one
entity performs both design and construction. Benefits of design-build
project delivery include: Single point responsibility Early knowledge
of cost Time and Cost savings
Classification:
Architects - [64]Architects
[65]Al Ali Consulting & Engineering
* [66]Map
* Website
* Email
* Profile
* Display Ad
Is this your company?
[67]Upgrade this free listing here
PO Box:
467, Doha, Qatar
Tel:
(00974) 44360011
Company Profile
Classification:
Architects - [68]Architects
[69]Al Gazeerah Consulting Engineering
* [70]Map
* Website
* Email
* Profile
* Display Ad
Is this your company?
[71]Upgrade this free listing here
PO Box:
22414, Doha, Qatar
Tel:
(00974) 44352126
Company Profile
Classification:
Architects - [72]Architects
[73]Al Murgab Consulting Engineering
* [74]Map
* Website
* Email
* Profile
* Display Ad
Is this your company?
[75]Upgrade this free listing here
PO Box:
2856, Doha, Qatar
Tel:
(00974) 44448623
Company Profile
Classification:
Architects - [76]Architects
References
Visible links
1. http://www.qatcom.com/useraccounts/login
2. http://www.qatcom.com/useraccounts/register
3. http://www.qatcom.com/
4. http://www.qatcom.com/
5. http://www.qatcom.com/qataryellowpages/map-of-doha
6. http://www.qatcom.com/qataryellowpages/about-qatcom
7. http://www.qatcom.com/qataryellowpages/advertise-with-qatcom
8. http://www.qatcom.com/qataryellowpages/advertiser_testimonials
9. http://www.qatcom.com/useraccounts/login
10. http://www.qatcom.com/useraccounts/register
11. http://www.qatcom.com/contact-qatcom
12. http://www.qatcom.com/qataryellowpages/companies
13. http://www.qatcom.com/classifications/index/A
14. http://www.qatcom.com/classifications/index/B
15. http://www.qatcom.com/classifications/index/C
16. http://www.qatcom.com/classifications/index/D
17. http://www.qatcom.com/classifications/index/E
18. http://www.qatcom.com/classifications/index/F
19. http://www.qatcom.com/classifications/index/G
20. http://www.qatcom.com/classifications/index/H
21. http://www.qatcom.com/classifications/index/I
22. http://www.qatcom.com/classifications/index/J
23. http://www.qatcom.com/classifications/index/K
24. http://www.qatcom.com/classifications/index/L
25. http://www.qatcom.com/classifications/index/M
26. http://www.qatcom.com/classifications/index/N
27. http://www.qatcom.com/classifications/index/O
28. http://www.qatcom.com/classifications/index/P
for a sample data like this, i am trying to grab the details of companies namely
company name
po box
Tel
fax
mobile
company profile
into a .csv file
i am new to regular expressions and linux too..
all i could manage to get was something like this
\[\d*\][^\.]*[\(\d*\)\s\d*)]
can anyone help me out with this please..
improvements:
i figured out something like this
$ awk '/^\[/ && ! /Upgrade this free listing/ {print $0} /:$/ && ! /Classification/ {printf $0 ; getline x ; print x}' file
but that still isn't what i want it to be...
You can do this in awk, but you'll be better off parsing the HTML instead. A good tool to do that with would be Python using the Beautiful Soup module. But that's not very exciting, so here's how to do it the awkward (hah!) way:
#!/usr/bin/awk -f
function trim(s) {
gsub(/(^ +)|( +$)/, "", s)
return s
}
BEGIN {
count = 0
fields[0] = "company"
fields[1] = "pobox"
fields[2] = "tel"
fields[3] = "fax"
fields[4] = "mob"
fields[5] = "profile"
}
# company name
/^ +\[[0-9]+\].*$/ {
sub(/^ +\[[0-9]+\]/, "") # get rid of the Lynx reference
# this is a bit naughty: our regex also matches this other link, but there's only one of them, so we just filter it
if ($0 != "Upgrade this free listing here") data[count,"company"]=$0
}
# two line fields, easy!
/ +PO Box:$/ { getline; data[count,"pobox"]=$0 }
/ +Tel:$/ { getline; data[count,"tel"]=$0 }
/ +Fax:$/ { getline; data[count,"fax"]=$0 }
/ +Mob:$/ { getline; data[count,"mob"]=$0 }
# multi-line field, tricky because it can be empty
/^Company Profile$/ {
getline # skip empty line
# process lines until encountering Classification field
s = ""
do {
s = s $0
getline
} while ($0 !~ / +Classification:$/)
data[count,"profile"]=s
count++ # the Classification field denotes the end of the company record
}
END {
OFS=","
# output CSV header row
for ( key in fields ) {
printf "\"" fields[key] "\","
}
printf "\n"
# output data
for ( i=0; i<count; i++ ) {
for ( key in fields ) {
printf "\"" trim(data[i,fields[key]]) "\","
}
printf "\n"
}
}
Save as parse.awk and then invoke with ./parse.awk < sample.txt. Out comes a CSV, like this:
"tel","fax","mob","profile","company","pobox",
"(00974) 44568833","(00974) 44568811","(00974) 44568822","Walid Chamoun Architects (WCA) was founded in Beirut, Lebanon, in 1992, committed to the blablabla","Walid Chamoun Architects WLL","55803, Doha, Qatar",
"(00974) 44360011","","","","Al Ali Consulting & Engineering","467, Doha, Qatar",
"(00974) 44352126","","","","Al Gazeerah Consulting Engineering","22414, Doha, Qatar",
"(00974) 44448623","","","","Al Murgab Consulting Engineering","2856, Doha, Qatar",
There's comments that should hopefully explain what's going on. This will run in plain old awk and doesn't require fancy gawk features. Keep in mind that awk arrays are arbitrarily ordered. This is prone to breaking a whole bunch with varying input data, which is just one of the many reasons why you really should parse the HTML instead of such lynx -dump shenanigans.

Resources