beautiful souphelp with defining the search criteria - python-3.x

I am trying to scrape a website with terms and their english translation and explanation. I have managed with Beautiful Soup and Request to get the tr or td entries, but I wonder whether someone could suggest how to refine my request to extract only the term and the explanation? I think that it is impossible to separate the english translation from the rest though?? The 'table' from the actual website is a combination of 19 tables(http://www.nodimarinari.it/Protopage10.html). The following is one entry
<tr>
<td valign="top" width="124">
<div align="center"><b><font color="#000066">tabella delle maree</font></b></div>
</td>
<td align="left" width="616"><font color="#000066"> (s.f.) (Maree) tide
table Tabella che riporta, per una determinata zona, l'andamento della
marea, cioe' giorno ed ora delle massime/minime e le relative variazioni
dei livelli delle acque. </font></td>
</tr>
<tr>
<td valign="top" width="124">
<div align="center"><font color="#000066"><b></b></font></div>
</td>
<td align="left" width="616"><font color="#000066"></font></td>
</tr>
Ideally I would like to get 'tabella delle maree','tide table', 'Tabella che riporta... '

Here is my results that according to my understanding of your question. Let me know if this what you're trying to achieve.
Code:
import requests
from bs4 import BeautifulSoup
url = requests.get("http://www.nodimarinari.it/Protopage10.html")
soup = BeautifulSoup(url.text, 'html.parser')
words = soup.find_all("font", color="#000066")
num = 0
term = []
explanation = []
for word in words:
if len(word.text) > 1:
if num % 2 == 0:
term.append(word.text)
elif num % 2 == 1:
explanation.append(str(word.text).replace("\n", ""))
num += 1
for i in range(0, len(term)):
print(term[i] + ": " + explanation[i] + "\n")
Output:
abbandonare: (v.) (Emergenze) to abandon L'attodi lasciare la imbarcazione pericolante da parte del capitano e dell'equipaggio,dopo aver esaurito tutti i tentativi suggeriti dall'arte nautica.
abbisciare: (v.) (Cavi e nodi) to range, tojag Preparare su uno spazio piano un cavo od una catena, ad ampie spire,in modo che il cavo o la catena possano svolgersi liberamente e scorreresenza impedimenti.
abbittare: (v.) (Cavi e nodi) to bitt Legareun cavo o una catena ad una bitta.
abbonacciare: (v.) (Vento e Mare) to becalm, tofall calm Il calmarsi del vento e del mare
abbordaggio: (s.m.) [A]. (Abbordi) collisionUrto o collisione accidentale tra imbarcazioni. La legislazione marittimaprevede un'ampio corredo di "norme per evitare l'abbordo in mare" checostituiscono la regolamentazione di base della navigazione [B]. a. intenzionaleboarding, running foul Investire intenzionalmente un'altra imbarcazionecon l'obiettivo di danneggiarla o affondarla. Anticamente si chiamavaanche "arrembaggio"
abbordare: (v.) (A bordo) to collide, to board,to run into vedi "abbordaggio".
abbordo: (s.m.) (Abbordi) collision Equivalentemoderno del termine "abbordaggio".
abbozzare: (v.) (Cavi e nodi) to stop Trattenerecon una legatura provvisoria, detta bozza, un cavo od una catena tesa,per evitarne lo scorrimento durante il tempo necessario per legarla definitivamente.
...
...
...
vogare: (v.) (Remi) to row Sinonimodi "remare".
volta: (s.f.) [A]. (Cavi enodi) bitter, wrap Indica comunemente un giro di un cavo o di una catenaattorno ad una bitta o ad un altro attrezzo atto a trattenere la cimastessa. [B]. dar v. (Manovre) to cleat, to belay, to make fast Legareuna cima od una catena attorno ad una bitta o una galloccia, con o senzanodi.
zattera di salvataggio: (s.f.) (A bordo) liferaftVedi "autogonfiabile".
zavorra: (s.f.) (Terminologia)ballast Pesi che vengono disposti a bordo dell'imbarcazione, normalmenteil piu' in basso possibile nella chiglia, per equilibrare la spinta lateraledel vento ed il conseguente movimento di rollio e di sbandata. Normalmentetali pesi sono in piombo e nei velieri moderni e' la lama di deriva stessa,costruita con metalli pesanti, ad essere zavorrata. In alcune particolarivelieri da regata la zavorra e' costituita da acqua di mare che puo' esserepompata in apposite casse (dal lato sopravvento) quando necessario.
zenit: (s.m.) (Geografia) zenitPunto della sfera celeste che si trova esattamente sulla verticale aldi sopra del luogo di osservazione.
zinco: (s.m.) (Varie) zinc,zinc plate Metallo dotato di particolari caratteristiche elettrolitichecol quale si realizzano piastre ed elementi (chiamati anodi) che vengonoutilizzati per evitare la corrosione di elementi metallici immersi adopera delle correnti galvaniche.Tali elementi vengono posizionati nellaparte immersa dello scafo, in corrispondenza di parti metalliche, in particolaredell'elica e della zavorra, e nel tempo si consumano evitando la corrosionedelle parti metalliche contigue. Per questo vengono anche chiamati zinchisacrificali.
Explanation:
So I used BeautifulSoup to parse all the terms and their meanings. To do that I had to parse all the text inside all the <font color="#000066"> elements inside the html code.
The words list contain the data in the following format:
[<font color="#000066">A</font>, <font color="#000066"> </font>, <font color="#000066"><b></b></font>, <font color="#000066"></font>, <font color="#000066">abbandonare</font>, <font color="#000066"> (v.) (Emergenze) to abandon L'atto
di lasciare la imbarcazione pericolante da parte del capitano e dell'equipaggio,
dopo aver esaurito tutti i tentativi suggeriti dall'arte nautica. </font>, <font color="#000066"><b></b></font>, <font color="#000066"></font>, <font color="#000066">abbisciare</font>, <font color="#000066"> (v.) (Cavi e nodi) to range, to
jag Preparare su uno spazio piano un cavo od una catena, ad ampie spire, ETC.]
The if condition if len(word.text) > 1: is used to ignore the single literals parsed (e.g: ['A', 'B', ..., 'Z']), that way we only deal with the terms and the meanings.
The next if condition if num % 2 == 0: is used to move all the terms from the words list to a list called term which is meant to contain only terms.
The last if condition if num % 2 == 1: is used to to move all the explanations from the words list to a list called explanation which is meant to contain only explanations.
The last for loop is for printing purposes only.

Related

Cannot import SongLyrics from lyrics_extractor

I wrote this code:
from lyrics_extractor import SongLyrics
apiKey = 'AIzaSyBDPVEi1OtzB3Nm6i9fd8HTkMCjsselIpM'
engineID = '35df92fbe0cad839c'
extract_lyrics = SongLyrics(apikey, engineID)
lyrics = extract_lyrics.get_lyrics("Reyes de la noche")
print(lyrics)
It is not working, I obtained this import error:
ImportError: cannot import name 'Songlyrics' from 'lyrics_extractor' (C:\Users\mica\AppData\Local\Programs\Python\Python39\lib\site-packages\lyrics_extractor_init_.py)
What could be wrong?
I found two problems:
You need to install the library with PIP:
pip install lyrics-extractor
apiKey has the wrong capitalization in line 4.
A further suggestion: the returned value is a dict, which looks nicer when printed like this:
from lyrics_extractor import SongLyrics
apiKey = 'AIzaSyBDPVEi1OtzB3Nm6i9fd8HTkMCjsselIpM'
engineID = '35df92fbe0cad839c'
extract_lyrics = SongLyrics(apiKey, engineID)
lyrics = extract_lyrics.get_lyrics("Reyes de la noche")
print(lyrics['title'])
print(lyrics['lyrics'])
Output:
Guasones – Reyes De La Noche Lyrics
Fuimos mucho mas que nada
Fuimos la mentira
Fuimos lo peor
Fuimos los soldados a la madrugada
Con esta ambición
Y ahora estoy en libertad
Y ahora que puedo pensar
En no volver hacer ese
El mismo de antes
Y que tristeza hay en la ciudad, amor
Sábado soleado
Y en el centro de la estatua del dolor
Me sentí parado
Fuimos muchos más que todos
Reyes de la noche
De esta tempestad
Si te vendí, si te robe, te traicione
Fui por uno más
Fuimos perros de la noche
Oxidados en tristeza
Y querer lo que querer
Sin tener que lastimar
Recordando que tu amor
Se robo la dignidad
Ahora olvidemos los dos
No volvamos a empezar
¿Para que?...
PS: I was manually writing out lyrics this morning for a song called "Brothers and Sisters" by Les Nubians. When I tried to use this library to get the lyrics for that song, like this:
extract_lyrics.get_lyrics("Brothers and Sisters")
... I got lyrics for a more popular song of the same title by another musical group. However, after adding the name of the group I wanted, I got the correct song's lyrics:
extract_lyrics.get_lyrics("Brothers and Sisters Nubians")
Installation
pip install lyrics-extractor
For more details go to this

can't get the right text when webscraping a webpage (with Python 3)

This is an example of the variable 'extract', when i execute my python code:
<a href "https://www.dekrantenkoppen.be/detail/1520347/Vulkaanuitbarsting-en-aardbevingen-dichtbij-de-hoofdstad-van-IJsland.html" rel="dofollow" title="In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.">Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland
W^hen i execute my code, i now get the text 'Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'
What i really wanted is the part after 'title=', so this text: 'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.'
I'm new in this section, and i find it very difficult to understand. Can someone give me a good direction?
See my code at this moment.
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
#print(soup)
content=''
rows = soup.find_all('a')
tel=0
for row in rows:
if tel != 0:
#'tel' is only used to skip the first returned result. The first resultline is something that i don't need. I only need all the text after every 'title ='
print(row)
extract=row.get_text()
print('')
print(extract)
content=content+extract+'\n'
if tel == 10:
#a loop of max 10 times gives me enough information, i only need the first 10 articles
break
else:
tel=tel+1
#show me the result-text of the crawling
print('')
print('RESULT TEXT OF THE FIRST 10 ARTICLES:')
print(content)
You want the title attribute. From the description the following use of nth-child will get you the first 10 descriptions. I also needed 'lxml' parser. pip3 install lxml if not installed.
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
print([i['title'] for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)')])
CSS:
li:nth-child(-n+10) > a:nth-child(2)
This asks for the first 10 li elements with at least two a tag children, and selects for the second a tag in each case. The > is a child combinator specifying that what is on the right must be a child of what is on the left.
Read about:
:nth-child ranges
Child combinators, pseudo class selectors and type selectors
Additional request for for loop with published date time info:
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)'):
print(i.parent['title'])
print(i.parent.a.text.strip())
print(i['title'])
print()
If I understand your question correctly, you should try to modify your code as follows:
...
for row in rows:
if tel != 0:
print(row)
extract=row["title"]
content=content+extract.replace("Meer informatie", "")+'\n'
...
Quick look at what you're doing, simply replace get_text() with .get('title'), that should work:
>>> data = '''<a href "https://www.dekrantenkoppen.be/detail/1520347/Vulkaanuitbarsting-en-aardbevingen-dichtbij-de-hoofdstad-van-IJsland.html" rel="dofollow" title="In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.">Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(data)
>>> soup.find('a').get('title')
'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.'

ERROR when Parsing JSON Array as a JSON Object

I'm trying to convert news into json in EJS but I'm getting this error:
JSON.parse: bad control character in string literal at line 1 column 491 of the JSON data
My code in EJS:
var news = JSON.parse('<%- JSON.stringify(news) %>');
My JSON when I just use JSON.stringify(news):
[{"link":"https://www.google.com/appserve/mkt/p/AL8lKjMTbG3Cwo5S_mSyJIgnnTSnG5CShXJluGQmCXZQem2HZJYO3Xni6oi7YkO1bf0AiPEzctP9t7p4iXohF_oSmh4lekT6N4_Xi_3FsOwFL7v0aGYESdS7Hwy0ljNy94xgSIa-p44jadT8K72ncdZLathEWT9VLdgFzKPyIbWNdfzXjna8ZZ-l0MnFGvP7ctxpUmlnS0bdioEzr6Vea3RXuJ9ql7G66mK3Mmk8b88uTsEU0DgQibiPnhj57-Xddg","title":"PT anuncia candidatura de Fernando Haddad à Presidência no ...","snippet":"11 set. 2018 ... O prazo dado pelo TSE para o partido apresentar à Justiça Eleitoral o substituto
de Lula terminava às 19h desta terça. Na chapa original ..."},{"link":"https://www.google.com/appserve/mkt/p/AL8lKjOk4AvSk26yeM_CLTwL-pbhemUMarCpKtz2cjmDF5u-c1MKiPSNST_LKzGUhr5gIxpym10fbAyKebM_k5ztcDT4ccJLm7CiDn51AF8esXzqtzkfStIpvDMLDSzBEiJsY_vZ-n16Bq1rVLol6NPlA84ZHjMcdBFsRqyvAn1uWquy2kWhGINO24sm7grPZssXqNuqn3kE2VQWxUZrqU76ZfKon7_2sP_JnVoX616a7NqZ87w","title":"Pesquisa Ibope: Lula, 37%; Bolsonaro, 18%; Marina, 6%; Ciro, 5%","snippet":"20 ago. 2018 ... Em razão desse quadro jurídico, o Ibope pesquisou outro cenário, com o atual
candidato a vice na chapa de Lula, Fernando Haddad."},{"link":"https://www.google.com/appserve/mkt/p/APDk4sNLn4v5wG7jZc1IMxmNqeoSmmu3CoLrgwpbMBTtGaGIPP6qNZCYoZJE-h7JvLCpES_LHdTgnGxzrSzi7qXlTfpcw_NWkH9lhvjaWJZmZCbHZM5jPivEEvXVa-370EEkKothZUYVdHjcJ_7Gd0FtSViBkVPdM1eplOHC2c-cxE5l-6iUJv-eIvsaBoTNH8m-N7KeL55C2RHIhWox4t_O_J6TePdYShasXr2p9CljniJyBnJg2gJqpDnpEg","title":"PT registra candidatura de Lula a presidente com ato em frente ao ...","snippet":"15 ago. 2018 ... Presidente do PT, Gleisi Hoffmann, entrega registro da candidatura de Lula a
servidor do TSE — Foto: Nelson Jr./ASCOM/TSE ..."},{"link":"https://www.google.com/appserve/mkt/p/AL8lKjPgDq40q5PA8lJ4zEjCmTIcjEId9kKJi_u79_HEROFIYaeDAza3mH3HEG5yW1W_0o0zUQTvFVUd9gRF64I5kZ9WUqwlrXBv3N0-ficFsPjhfdxo4H3CqkPvbjICafkJlZM2LnCTC7sKH_Wu7emMCZEki0XlYNuHyOOp114XJK5GbvR33QyBK3e9sVl9hoOpXsP44bNFtPy2ntiTo5ew58YJgSaB59sg1xrtW7PdaGKbcHDsg6bRyw","title":"Fachin nega pedido da defesa de Lula para suspender inelegibilidade","snippet":"6 set. 2018 ... Fachin nega pedido para suspender condenação de Lula no caso do ... o pedido
da defesa do ex-presidente Luiz Inácio Lula da Silva para ..."},{"link":"https://www.google.com/appserve/mkt/p/APDk4sNLSBs3Q0r_tjyeUzsiMNvPILeN0DuMwwFLp7LWs11DJltsjOoj6lALGoBQ_HdCP5Pxa8rTJIe6SmXbEyi5iZ6CFNdrHxF0dwF2pf3J2UX0uIREc_MoNTGwEj53rHgTA21KV0gXTwNC4U-SPsRqFhI322S4AjnWeRr1h1ThAl1gJEEDEMSWm5-JEWPkSH4zrzjmbltEp_mKtMqG7PTLQdat0d_wIowh1wGQ3oMDrSM95XpVBOzvm2yg_t4wd5lh","title":"Entenda o que significa a impugnação da candidatura de Lula feita ...","snippet":"16 ago. 2018 ... As ações de impugnação ainda não impedem Lula de concorrer. O ministro Luís
Roberto Barroso foi sorteado como relator do registro de ..."},{"link":"https://www.google.com/appserve/mkt/p/AL8lKjMNo82Rv1wcvOpdbrQX7C7zD2sW7lFk0EBtJZZNs8rx0-tq2bYfyB_BsCGjt9fNa82lkSBH8Y4P_Of_uj9ryfcFxjIt_pbf49sLtNeAR529DWuje9j8Ps8-Z3qH2Wrh6N_Sb8P9eS88o26QKNLi94PDrPvFg7Q5Hv_i02L95dF_ZOV_avLu9sE0zzaX2l6hCQ1t-lXxegaf","title":"Comitê da ONU reafirma direito de Lula ser candidato, diz defesa ...","snippet":"10 set. 2018 ... Zanin afirmou que a defesa de Lula informará ao STF a nova decisão. Uma nova
decisão do Comitê de Direitos Humanos da ONU reafirmou ..."},{"link":"https://www.google.com/appserve/mkt/p/AFOm0uEggpSZolqYBXndC11CTFh4Bx4ojWZe-LQ4LpX3DXG401x0u8rk7AS41mhCRQqX6Ze7qXrfe0blgoqgeKkBvl2Roik39be5d8XX2HlZpL4bYGlC18o9U650kXjAkUn_vWc6YJRH5bUf1Hw15Zz4IlOE5idUbGmd0rnPl43DKMDReRgIHGc72dtcdXrz-ihJualpVoNpmfdaEQ","title":"Datafolha: sem Lula, Bolsonaro lidera e disputa fica acirrada ...","snippet":"31 jan. 2018 ... RIO — Pesquisa Datafolha divulgada nesta quarta-feira mostra que o ex-
presidente Luiz Inácio Lula da Silva , mesmo condenado pelo ..."},{"link":"https://www.google.com/appserve/mkt/p/AFIPhzVJ3RSCVztRel4V7tABx7g_RC0G4uE43Wb1CdAxSf810y0LN7XN0iU1Ub5cj7-anDSrgzKJX9k1sx0tGFtzZhnTSyvCbAjhgqEHA-rrGWKMVAaHUXewZYj5IBY","title":"Após tiros e sob tensão, Lula encerra caravana em Curitiba horas ...","snippet":"28 mar. 2018 ... Poucas horas depois de a caravana do ex-presidente Luiz Inácio Lula da Silva
chegar a Curitiba, ainda sob abalo pelo ataque que atingiu ..."},{"link":"https://www.google.com/appserve/mkt/p/AJ-PF7wwGAYu07EXXfFG9u-ipk-azIf0t1m96KDvmxh5uuiAjuDg9QFlysO2nHCs9x0IRYCtpd_CphjEsDCZGMtxpcpzgeQdP-rJumaQVCIx6a2YCXteM6qFB6cerc5JSvbVfp5qTEwyDuB_R8gR9ZOK44cZxY2VIjPIAS4qsX-Wb2y0nSVXTtuM4BOZdPthrzxE8fXHoRc_zmYheufyS3FGXg","title":"Presidenciáveis e políticos comentam decisão de desembargador ...","snippet":"8 jul. 2018 ... O STF já havia negado HC para soltar Lula. Ele será solto neste domingo. A
decisão foi monocrática, de um desembargador que foi filiado ao ..."},{"link":"https://www.google.com/appserve/mkt/p/AIQrb_7rn0-8BzJy4lp5vhkK_EOS2rsg2pNbSKHMcSAv7Zxb_T8GoBAYLmUZidgqaFo8Q0sgjTMTLhJe8ouJ0iutSoUi_6XA04aBJe0-d6df3YTmby4CBj7BfXDOYO4mHhO7iFv9i2pOooL4TWu-A5KQ7KqAk-GQP72jmClTQuBmZ5Frl6rlGiaW1A","title":"STJ nega habeas corpus preventivo por unanimidade e decide que ...","snippet":"6 mar. 2018 ... Na mesma decisão, os cinco ministros da Quinta Turma do STJ negaram um
pedido extra da defesa para suspender a inelegibilidade de Lula ..."}]
EDIT: I have found out the error is the "snippet" tag, but I need it.
I think that JSON.parse() is going to have a tough time with what you are feeding it as it just sees a simple string, not evaluated EJS or even valid JSON.
Perhaps something more along the lines of:
var json = <your-JSON-here>;
var news = ejs.render('<%- JSON.stringify(news) %>', {news: json});
Then if you want, you can always pull it back in as an object with JSON.parse():
news = JSON.parse(news);

use of Scrapy's regular expressions

I'm practicing on scraping newspaper's articles with Scrapy. I have some problems in sub-stringing text from web pages. Witht the built-in re and re_first functions I can set where to start the search, but I don't find how to set where to end.
Here follows the code:
import scrapy
from spider.items import Articles
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class QuotesSpider(scrapy.Spider):
name = "lastampa"
allowed_domains = ['lastampa.it']
def start_requests(self):
urls = [
'http://www.lastampa.it/2017/10/26/economia/lavoro/datalogic-cerca-laureati-e-laureandi-per-la-ricerca-e-sviluppo-rPsS8gVM5ZX7gEZcugklwJ/pagina.html'
]
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
items = []
item = Articles()
item['date'] = response.xpath('//div[contains(#class, "ls-articoloDataPubblicazione")]').re_first(r'content=\s*(.*)')
item['author'] = response.xpath('//div[contains(#class, "ls-articoloAutore")]').re_first(r'">\s*(.*)')
item['title'] = response.xpath('//div[contains(#class, "ls-articoloTitolo")]').re_first(r'<h3>\s*(.*)')
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]').re_first(r'">\s*(.*)')
item['text'] = response.xpath('//div[contains(#class, "ls-articoloTesto")]').re_first(r'<p>\s*(.*)')
items.append(item)
Well, with this code I can get the needed text, but also all the following tags till the end of paths.
e.g.
'subtitle': 'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam</div>'
How can I escape the ending </div> (or any other character after a defined point) ?
Can someone can turn on the light on this. Thanks
You don't have to mess with regular expressions like this if you only need to extract the text from elements. Just use text() in your XPath expressions. For example for subtitle field, extract it using:
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]/text()').extract_first().strip()
which, in your example, will produce:
u'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam'
Article text is a bit more complicated, as it's spread over child elements. However, you can extract it from there, post-process it and store as a single string like this:
item['text'] = ''.join([x.strip() for x in response.xpath('//div[contains(#class, "ls-articoloTesto")]//text()').extract()])
which will produce:
u'Datalogic, leader mondiale nelle tecnologie di identificazione automatica dei dati e dei processi di automazione industriale, ha lanciato la selezione di giovani talenti laureati o laureandi in ingegneria e/o in altre materie scientifiche che abbiano voglia di confrontarsi con una realt\xe0 internazionale. Il primo step delle selezioni si \xe8 tenuto in occasione dell\u2019Open Day: i candidati hanno sostenuto un primo colloquio con senior manager nazionali e internazionali, arrivati per l\u2019occasione anche da sedi estere. Durante l\u2019Open Day \xe8 stata presentata l\u2019azienda insieme alle 80 posizioni aperte di cui una cinquantina in Italia. La ricerca dei candidati \xe8 rivolta a laureandi e neolaureati di tutta Italia interessati a far parte di una realt\xe0 dinamica e internazionale, provenienti da Facolt\xe0 tecnico-scientifiche quali Ingegneria, Fisica, Informatica.Ai giovani verr\xe0 proposto un interessante percorso di carriera che permetter\xe0 di approfondire e sviluppare tecnologie all\u2019avanguardia nei 10 centri di Ricerca dell\u2019azienda dislocati tra Italia, Stati Uniti, Cina e Vietnam, dove si ha l\u2019opportunit\xe0 di approfondire tutte le tematiche relative all\u2019elettronica e al software design: dallo sviluppo di sistemi operativi mobili, di sensori laser e tecnologie ottiche, allo studio di sistemi di intelligenza artificiale. Informazioni sul sito www.datalogic.com.Alcuni diritti riservati.'

Acentuation in php, showing latin symbols

I'm writing a program to parse html data from a portuguese website, thing is, when I echo the data I read I get those weird symbols:
Meu PC estragou e tenho um netbook que u
so para assuntos acadΩmicos. Que raiva, nπo roda nem CS aqui. Aff que raiva! Com
o pode? Jß coloquei todas as mais baixar configuraτ⌡es e nao roda!
the original text is:
Meu PC estragou e tenho um netbook que u
so para assuntos acadêmicos. Que raiva, não roda nem CS aqui. Aff que raiva! Com
o pode? Já coloquei todas as mais baixar configurações e nao roda!
notice the acentuation:
acadêmicos -> acadΩmicos
Já -> Jß
how do I fix this ? I already tried:
echo utf8_decode($assunto);
but it didn't work ! help!
Set the encoding in the header.
header( 'Content-Type: text/html; charset=UTF-8' );
http://php.net/manual/en/function.header.php
Then to make sure the browser uses UTF-8 meta tag in the HTML head:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

Resources