I've been trying to read an xml file and convert it to json using groovy's JsonBuilder. The problem is that when I print with
def builder = new JsonBuilder(jsonObject)
println builder.toPrettyString()
I got a Caught: java.lang.StackOverflowError
Here is the whole stacktrace
Exception in thread "main" java.lang.StackOverflowError
at groovy.json.JsonOutput.writeObject(JsonOutput.java:259)
at groovy.json.JsonOutput.writeIterator(JsonOutput.java:442)
at groovy.json.JsonOutput.writeObject(JsonOutput.java:272)
at groovy.json.JsonOutput.writeIterator(JsonOutput.java:442)
at groovy.json.JsonOutput.writeObject(JsonOutput.java:272)
at groovy.json.JsonOutput.writeIterator(JsonOutput.java:442)
at groovy.json.JsonOutput.writeObject(JsonOutput.java:272)
at groovy.json.JsonOutput.writeIterator(JsonOutput.java:442)
at groovy.json.JsonOutput.writeObject(JsonOutput.java:272)
at groovy.json.JsonOutput.writeIterator(JsonOutput.java:442)
at groovy.json.JsonOutput.writeObject(JsonOutput.java:272)
at groovy.json.JsonOutput.writeIterator(JsonOutput.java:442)
at groovy.json.JsonOutput.writeObject(JsonOutput.java:272)
at groovy.json.JsonOutput.writeIterator(JsonOutput.java:442)
Here the code.
package firstgroovyproject
import groovy.json.JsonBuilder
class XmlToJsonII {
static void main(def args){
def carRecords = '''
<records>
<car name='HSV Maloo' make='Holden' year='2006'>
<countries>
<country>
Austria
</country>
<country>
Spain
</country>
</countries>
<record type='speed'>Production Pickup Truck with speed of 271kph
</record>
</car>
<car name='P50' make='Peel' year='1962'>
<countries>
<country>
Monaco
</country>
</countries>
<record type='size'>Smallest Street-Legal Car at 99cm wide and 59 kg
in weight</record>
</car>
<car name='Royale' make='Bugatti' year='1931'>
<record type='price'>Most Valuable Car at $15 million</record>
<countries>
<country>
Italia
</country>
</countries>
</car>
<car name='Seat' make='Ibiza' year='1985'>
<record type='price'>barato</record>
<countries>
<country>
Spain
</country>
</countries>
</car>
</records>
'''
def xmlRecords = new XmlSlurper().parseText(carRecords)
def jsonObject = [:]
jsonObject.records = []
def records = jsonObject.records
xmlRecords.car.each {xmlCar ->
records.add([
countries:
xmlCar.countries.children().each{ country ->
println "country : ${country.text()}"
[country: country.text()]
},
])
}
def builder = new JsonBuilder(jsonObject)
println builder.toPrettyString()
//println builder.toString()
}
}
tl;dr: Your second (inner) each should be a collect instead.
The real answer: The return value of each is the original Iterable that invoked it. In this case that would be the XML object collection defined by the expression xmlCar.countries.children(). Since the objects in that collection contain references back to their own parents your JSON build causes an infinite regress that results in a stack overflow.
This doesn't happen with your first (outer) use of each, because you are not using the return value. Instead you are adding to a preexisting list (records).
By changing the second (inner) each into a collect, you are still iterating through the children of the countries elements. However, instead of returning the original XML children (the result of each) you are compiling and returning a list of maps of the form [country:"country_string_from_xml"], which seems to be the desired behavior.
Confusing each and collect is a common rookie mistake in Groovy... which only made it all the more galling when it happened to ME last week. :-)
Related
I'm looking how to grab all sub elements in a XML file, but something is going wrong.
import xml.etree.ElementTree as ET
tree = ET.parse('C:\\Users\\f6792150\\Documents\profile.xml')
root = tree.getroot()
for child in root:
print(child.tag, child.attrib)
Here is what I got:
INFO {}
INFO {}
INFO {}
INFO {}
INFO {}
INFO {}
I was expecting to get all childs inside the INFO tag, e.g. (TICKER, NAME, ADDRESS, PHONE, etc) but it comes empty. Below is the XML file i'm using:
<?xml version="1.0"?>
<collection shelf = 'profile'>
<INFO>
<TICKER>AAPL</TICKER>
<NAME> Apple Inc.</NAME>
<ADDRESS>1 Infinite Loop;Cupertino, CA 95014;United State</ADDRESS>
<PHONE>408-996-1010</PHONE>
<WEBSITE>http://www.apple.com</WEBSITE>
<SECTOR>Technology</SECTOR>
<INDUSTRY>Consumer Electronics</INDUSTRY>
<FULL_TIME>100,000</FULL_TIME>
<BUS_SUMM>Apple</BUS_SUMM>
<SOURCE>https://finance.yahoo.com/quote/AAPL/profile?p=AAPL</SOURCE>
</INFO>
<INFO>
<TICKER>T</TICKER>
<NAME> AT and T Inc.</NAME>
<ADDRESS>208 South Akard Street;Dallas, TX 75202;United States</ADDRESS>
<PHONE>210-821-4105</PHONE>
<WEBSITE>http://www.att.com</WEBSITE>
<SECTOR>Communication Services</SECTOR>
<INDUSTRY> Telecom Services</INDUSTRY>
<FULL_TIME>254,000</FULL_TIME>
<BUS_SUMM>at and t</BUS_SUMM>
<SOURCE>https://finance.yahoo.com/quote/T/profile?p=T</SOURCE>
</INFO>
</collection>
Cheers!
Try something like:
for child in root.findall('./INFO//'):
print(child.tag,child.text)
Output:
TICKER AAPL
NAME Apple Inc.
ADDRESS 1 Infinite Loop;Cupertino, CA 95014;United State
PHONE 408-996-1010
etc.
i want to get actor name out of this json file page_title and then match this with database i tried using nltk and spacy but there i have to train data. Do i have train for each and ever sentence i have more than 100k sentences. If i sit to train data it will takes a month or more. Is there any way that i can dump K_actor database to train spacy, nltk or any other way.
{"page_title": "Sonakshi Sinha To Auction Sketch Of Buddha To Help Migrant Labourers", "description": "Sonakshi Sinha took to Instagram to share a timelapse video of a sketch of Buddha that she made to auction to raise funds for migrant workers affected by Covid-19 crisis. ", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589815261_1589815196489_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/sonakshi-sinha-to-auction-sketch-of-buddha-to-help-migrant-labourers-2626123.html"}
{"page_title": "Anushka Sharma Calls Virat Kohli 'A Liar' on IG Live, Nushrat Bharucha Gets Propositioned on Twitter", "description": "In an Instagram live interaction with Sunil Chhetri, Virat Kohli was left embarrassed after Anushka Sharma called him a 'jhootha' from behind the camera. This and more in today's wrap.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589813980_1589813933996_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/anushka-sharma-calls-virat-kohli-a-liar-on-ig-live-nushrat-bharucha-gets-propositioned-on-twitter-2626093.html"}
{"page_title": "Ranveer Singh Shares a Throwback to the Days When WWF was His Life", "description": "Ranveer Singh shared a throwback picture from his childhood where he could be seen posing in front of a poster of WWE legend Hulk Hogan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812401_screenshot_20200518-195906_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/ranveer-singh-shares-a-throwback-to-the-days-when-wwf-was-his-life-2626067.html"}
{"page_title": "Salman Khan's Love Song 'Tere Bina' Gets 26 Million Views", "description": "Salman Khan's song Tere Bina, which was launched a few days ago, had garnered 12 million views within 24 hours. As it continues to trend, it has garnered 26 million views in less than a week.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589099778_screenshot_20200510-135934_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/salman-khans-love-song-tere-bina-gets-26-million-views-2626077.html"}
{"page_title": "Yash And Radhika Pandit Pose With Their Kids For a Perfect Family Picture", "description": "Kannada actor Yash tied the knot with actress Radhika Pandit in 2016. The couple shares two kids together.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812187_yash.jpg", "post_url": "https://www.news18.com/news/movies/yash-and-radhika-pandit-pose-with-their-kids-for-a-perfect-family-picture-2626055.html"}
{"page_title": "Malaika Arora Shares Beach Vacay Boomerang With Hopeful Note", "description": "Malaika Arora shared a throwback boomerang from a beach vacation where she could be seen playfully spinning. She also shared a hopeful message along with it.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589810291_screenshot_20200518-192603_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/malaika-arora-shares-beach-vacay-boomerang-with-hopeful-note-2626019.html"}
{"page_title": "Actor Nawazuddin Siddiqui's Wife Aaliya Sends Legal Notice To Him Demanding Divorce, Maintenance", "description": "The notice was sent to the ", "image_url": "https://images.news18.com/ibnlive/uploads/2019/10/Nawazuddin-Siddiqui.jpg", "post_url": "https://www.news18.com/news/movies/actor-nawazuddin-siddiquis-wife-aaliya-sends-legal-notice-to-him-demanding-divorce-maintenance-2626035.html"}
{"page_title": "Lisa Haydon Celebrates Son Zack\u2019s 3rd Birthday With Homemade Cake And 'Spiderman' Surprise", "description": "Lisa Haydon took to Instagram to share some glimpses from the special day. In the pictures, we can spot a man wearing a Spiderman costume.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807960_lisa-rey.jpg", "post_url": "https://www.news18.com/news/movies/lisa-haydon-celebrates-son-zacks-3rd-birthday-with-homemade-cake-and-spiderman-surprise-2625953.html"}
{"page_title": "Chiranjeevi Recreates Old Picture with Wife, Says 'Time Has Changed'", "description": "Chiranjeevi was last seen in historical-drama Sye Raa Narasimha Reddy. He was shooting for his next film, Acharya, before the coronavirus lockdown.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589808242_pjimage.jpg", "post_url": "https://www.news18.com/news/movies/chiranjeevi-recreates-old-picture-with-wife-says-time-has-changed-2625973.html"}
{"page_title": "Amitabh Bachchan, Rishi Kapoor\u2019s Pout Selfie Recreated By Abhishek, Ranbir is Priceless", "description": "A throwback picture that has gone viral on the internet shows Ranbir Kapoor and Abhishek Bachchan recreating a selfie of their fathers Rishi Kapoor and Amitabh Bachchan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807772_screenshot_20200518-184521_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/amitabh-bachchan-rishi-kapoors-pout-selfie-recreated-by-abhishek-ranbir-is-priceless-2625867.html"}
Something that you can do is to create an annoter script wherein you can replace actor names with '###' or some other string (which will be replaced later with actor names (entities) for training).
I trained 68K data/sentences in 9 hrs with my i3 laptop. You can dump data like this and the output file can be used for training the model.
That will save time and also give you ready made training data format for SpaCy.
from nltk import word_tokenize
from pandas import read_csv
import re
import os.path
def annot(Label, entity, textlist) :
finaldict = []
for text_token in textlist:
textbk=text_token
for value in entity:
#if entity has multi tokens
text=textbk
text=text_token
text=str(text).replace('###',value)
text=text.lower()
text = re.sub('[^a-zA-Z0-9\n\.]',' ', text)
if len(word_tokenize(value))<2:
#print('I am here')
newtext=word_tokenize(text)
traindata=[]
prev_length=0
prev_pos=0
k=0
while k != len(newtext):
if k == 0:
prev_pos=0
prev_length=len(newtext[k])
if value.lower()== str(newtext[k]):
ent=Label
tup=(prev_pos,prev_length,ent)
traindata.append(tup)
else:
pass
else :
prev_pos=prev_length+1
prev_length=prev_length+len(newtext[k])+1
if value.lower()==str(newtext[k]):
ent=Label
tup=(prev_pos,prev_length,ent)
traindata.append(tup)
else:
pass
k=k+1
mydict={'entities':traindata}
finaldict.append((text,mydict))
else:
traindata=[]
try:
begin=text.index(value.lower())
ent=Label
tup=(begin,len(value.lower()),ent)
traindata.append(tup)
except ValueError:
pass
mydict={'entities':traindata}
finaldict.append((text,mydict))
return finaldict
def getEntities(csv_file, column) :
df = read_csv(csv_file)
return df[column].to_list()
def getSentences(file_name) :
with open(file_name) as file1 :
sentences = [line1.rstrip('\n') for line1 in file1]
return sentences
def saveData (data, filename, path) :
filename = os.path.join(path, filename)
with open(filename, 'a') as file :
for sent in data :
file.write("{}\n".format(sent))
ents = getEntities(csv_file, column_name) #Actor names in your case
entities = [ent for ent in ents if str(ent) != 'nan']
sentences = getSentences(filepathandname) #Considering you have the sentences in a text file
label = 'ACTOR_NAMES'
data = annot(label, entities, sentences)
saveData(data, 'train_data.txt', path)
Hope this is a relevant answer for your question.
i am using geopy library for my Flask web app. i want to save user location which i am getting from my modal(html form) in my database(i am using mongodb), but every single time i am getting this error:
TypeError: Object of type 'Location' is not JSON serializable
Here's the code:
#app.route('/register', methods=['GET', 'POST'])
def register_user():
if request.method == 'POST':
login_user = mongo.db.mylogin
existing_user = login_user.find_one({'email': request.form['email']})
# final_location = geolocator.geocode(session['address'].encode('utf-8'))
if existing_user is None:
hashpass = bcrypt.hashpw(
request.form['pass'].encode('utf-8'), bcrypt.gensalt())
login_user.insert({'name': request.form['username'], 'email': request.form['email'], 'password': hashpass, 'address': request.form['add'], 'location' : session['location'] })
session['password'] = request.form['pass']
session['username'] = request.form['username']
session['address'] = request.form['add']
session['location'] = geolocator.geocode(session['address'])
flash(f"You are Registerd as {session['username']}")
return redirect(url_for('home'))
flash('Username is taken !')
return redirect(url_for('home'))
return render_template('index.html')
Please Help, let me know if you want more info..
According to the geolocator documentation the geocode function "Return a location point by address" geopy.location.Location objcet.
Json serialize support by default the following types:
Python | JSON
dict | object
list, tuple | array
str, unicode | string
int, long, float | number
True | true
False | false
None | null
All the other objects/types are not json serialized by default and there for you need to defined it.
geopy.location.Location.raw
Location’s raw, unparsed geocoder response. For details on this,
consult the service’s documentation.
Return type: dict or None
You might be able to call the raw function of the Location (the geolocator.geocode return value) and this value will be json serializable.
Location is indeed not json serializable: there are many properties in this object and there is no single way to represent a location, so you'd have to choose one by yourself.
What type of value do you expect to see in the location key of the response?
Here are some examples:
Textual address
In [9]: json.dumps({'location': geolocator.geocode("175 5th Avenue NYC").address})
Out[9]: '{"location": "Flatiron Building, 175, 5th Avenue, Flatiron District, Manhattan Community Board 5, Manhattan, New York County, New York, 10010, United States of America"}'
Point coordinates
In [10]: json.dumps({'location': list(geolocator.geocode("175 5th Avenue NYC").point)})
Out[10]: '{"location": [40.7410861, -73.9896298241625, 0.0]}'
Raw Nominatim response
(That's probably not what you want to expose in your API, assuming you want to preserve an ability to change geocoding service to another one in future, which might have a different raw response schema).
In [11]: json.dumps({'location': geolocator.geocode("175 5th Avenue NYC").raw})
Out[11]: '{"location": {"place_id": 138642704, "licence": "Data \\u00a9 OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright", "osm_type": "way", "osm_id": 264768896, "boundingbox": ["40.7407597", "40.7413004", "-73.9898715", "-73.9895014"], "lat": "40.7410861", "lon": "-73.9896298241625", "display_name": "Flatiron Building, 175, 5th Avenue, Flatiron District, Manhattan Community Board 5, Manhattan, New York County, New York, 10010, United States of America", "class": "tourism", "type": "attraction", "importance": 0.74059885426854, "icon": "https://nominatim.openstreetmap.org/images/mapicons/poi_point_of_interest.p.20.png"}}'
Textual address + point coordinates
In [12]: location = geolocator.geocode("175 5th Avenue NYC")
...: json.dumps({'location': {
...: 'address': location.address,
...: 'point': list(location.point),
...: }})
Out[12]: '{"location": {"address": "Flatiron Building, 175, 5th Avenue, Flatiron District, Manhattan Community Board 5, Manhattan, New York County, New York, 10010, United States of America", "point": [40.7410861, -73.9896298241625, 0.0]}}'
I am using Stanford CoreNLP. I need to detect and identify the "Coreference set"s and "representative mention"s for each CorefChain in my input text:
For example:
Input:
Obama was elected to the Illinois state senate in 1996 and served there for eight years. In 2004, he was elected by a record majority to the U.S. Senate from Illinois and, in February 2007, announced his candidacy for President.
Output: With "Pretty Print" I can get the output below:
**Coreference set:
(2,4,[4,5]) -> (1,1,[1,2]), that is: "he" -> "Obama"
(2,24,[24,25]) -> (1,1,[1,2]), that is: "his" -> "Obama"
(3,22,[22,23]) -> (1,1,[1,2]), that is: "Obama" -> "Obama"**
However, I need to programmatically identify and detect the output above, which is called the "Coreference set". (I mean I need to identify all the pairs like: "he" -> "Obama")
Note: My base code is the one below (it is from http://stanfordnlp.github.io/CoreNLP/coref.html):
import edu.stanford.nlp.hcoref.CorefCoreAnnotations;
import edu.stanford.nlp.hcoref.data.CorefChain;
import edu.stanford.nlp.hcoref.data.Mention;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import java.util.Properties;
public class CorefExample {
public static void main(String[] args) throws Exception {
Annotation document = new Annotation("Obama was elected to the Illinois state senate in 1996 and served there for eight years. In 2004, he was elected by a record majority to the U.S. Senate from Illinois and, in February 2007, announced his candidacy for President.");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,mention,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
System.out.println("---");
System.out.println("coref chains");
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
System.out.println("\t"+cc);
}
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
System.out.println("---");
System.out.println("mentions");
for (Mention m : sentence.get(CorefCoreAnnotations.CorefMentionsAnnotation.class)) {
System.out.println("\t"+m);
}
}
}
}
///// Any Idea? THANK YOU in ADVANCE
A CorefChain contains that information.
For instance you can get a:
List<CorefChain.CorefMention>
by using this method:
cc.getMentionsInTextualOrder();
This will give you all the CorefChain.CorefMention's in the document for that particular cluster.
You can get the representative mention with this method:
cc.getRepresentativeMention();
A CorefChain.CorefMention represents a particular mention in the coref cluster. You can get information such as the full string and position from a CorefChain.CorefMention (sentence number, mention number in sentence):
for (CorefChain.CorefMention cm : cc.getMentionsInTextualOrder()) {
String textOfMention = cm.mentionSpan;
IntTuple positionOfMention = cm.position;
}
Here is the link to the javadoc for CorefChain:
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefChain.html
Here is the link to the javadoc for CorefChain.CorefMention:
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefChain.CorefMention.html
Is it possible to access a node of Xml using an arbitary path?
Eg: Given the xml:
<records>
<bike name='Chopper' />
<car name='HSV Maloo' make='Holden' year='2006'>
<country>Australia</country>
<record type='speed'>Production Pickup Truck with speed of 271kph</record>
</car>
<car name='P50' make='Peel' year='1962'>
<country>Isle of Man</country>
<record type='size'>Smallest Street-Legal Car at 99cm wide and 59 kg in weight</record>
</car>
</records>
How do I access the contents of the xml using an arbitrary path, provided as a string -- eg:
XmlSlurper xml = new XmlSlurper.parse(theXml)
assert xml['bike.#name'] == 'Chopper'
assert xml['car[0].country'] == 'Australia'
One method is to use the Eval.x static method to evaluate a String;
def xml = '''| <records>
| <bike name='Chopper' />
| <car name='HSV Maloo' make='Holden' year='2006'>
| <country>Australia</country>
| <record type='speed'>Production Pickup Truck with speed of 271kph</record>
| </car>
| <car name='P50' make='Peel' year='1962'>
| <country>Isle of Man</country>
| <record type='size'>Smallest Street-Legal Car at 99cm wide and 59 kg in weight</record>
| </car>
| </records>'''.stripMargin()
// Make our GPathResult
def slurper = new XmlSlurper().parseText( xml )
// Define our tests
def tests = [
[ query:'bike.#name', expected:'Chopper' ],
[ query:'car[0].country', expected:'Australia' ]
]
// For each test
tests.each { test ->
// assert that we get the expected result
assert Eval.x( slurper, "x.$test.query" ) == test.expected
}