Difficulty in reading a table on a web - rvest

I am trying to read a table from
https://www.cdc.gov/nchs/pressroom/sosmap/cancer_mortality/cancer.htm.
I tried two ways, directly reading the csv file and using R package "rvest", but neither worked.
The following is my code:
url = "https://www.cdc.gov/nchs/pressroom/sosmap/cancer_mortality/cancer.htm"
## The csv file link
f = "blob:https://www.cdc.gov/32faf737-1355-4e1a-8b4f-a970820a290b"
# Way #1
rates1 = read.csv(file = f, header=TRUE, sep=",")
rates1
# Way #2
library("rvest")
rates2 = url %>% read_html() %>%
html_nodes(xpath='//*[#id="root"]/section/section[2]/div[2]/div[1]/div[2]') %>%
html_table(fill=TRUE) %>%
.[[1]]
rates2
Error messages:
cannot open file 'blob:https://www.cdc.gov/32faf737-1355-4e1a-8b4f-a970820a290b': No such file or directoryError in file(file, "rt") : cannot open the connection

Related

how to get XML from xml.pm7 in Python

Short context/goal:
We get those xml.p7m files from our Italian creditors via mail, and have to process them, so we can pay those.
The goal is to automate this process, so people don't have to manually validate the data.
If we don't get all the data we need to automatically process the invoice, we create a pdf so people can validate the data.
The Problem:
(I can't attach the full xml.pm7 because of all those characters in there that shouldn't be there, so I did it via picture, but the relevant tag can be copy and pasted)
I don't speak Italian, so it's hard for me to find the
information I need to continue with this task
Since we cant parse the .p7m files with an xml parser (not well
formed, obviously), I have to remove all the unneeded stuff, so we can parse the xml part. Which works in (so far) ~ 450 of 532 mails. With "manually replacing error tags" (see below) we reach for those a 100%, but adding exceptions every now and then isn't the correct way to handle this:
The relevant code:
# Some p7m have some things like ♦♥ in it, they need to be removed (and while we're at it, remove the signature so it's well formed)
# Special Characters which translate to --> \x04\xc2\x82\x03\xc3\xa8
# <Nazione>IT</Nazione>♦♥è
# <Descrizione>nr ordine 9♦♥è303067091</Descrizione>
# <NumeroLinea>6<\Numero♦♥èLinea>
# <Quant♦♥èita>0.00</Quantita>
# </Anagraf♦♥èica>
def getXmlTextRemovedSignature(path, filePath, m_iv_id, mailSubject, amountOfAttachments, logger):
txt = ""
parseError = 0
try:
with open(filePath, encoding='latin-1') as f:
try:
txt = f.read()
except Exception as e:
logger.errorLog("Couldn't read file text: {}".format(e))
# no opening tag to find --> no xml --> decode the file, save it, and get the text
if not re.findall('<',txt):
image_64_decode = base64.decodebytes(txt.encode())
image_result = open(path + 'decoded.xml', 'wb') # create a writable image and write the decoding result
image_result.write(image_64_decode)
image_result.close()
txt = open(path + 'decoded.xml', encoding='latin-1').read()
try:
txt = re.sub(r'[^A-Za-z0-9<>\'\+\-\,\?\"\.\=\/\:\r\n\t\!\[\] {1,}]', '', txt)
if re.findall(r'<\?xml', txt):
txt = '<?xml' + re.split(r'<\?xml',txt,1)[1]
else:
tag = re.findall(r'<.{0,5}FatturaElettronica', txt)[0]
txt = tag + re.split(tag, txt,1)[1]
if re.findall(r'<\/.{0,5}FatturaElettronica>', txt):
tag = re.findall(r'<\/.{0,5}FatturaElettronica>', txt)[0]
txt = re.split(tag, txt, 1)[0] + tag
except Exception as e:
logger.errorLog("Error while removing the signature, m_iv_id = {}, mailsubject: {}, error: {}".format(m_iv_id, mailSubject, e))
# Things to replace
# txt = txt.replace('PDatiRiepilogo','DatiRiepilogo')
# txt = txt.replace('DatiRiepiDlogo', 'DatiRiepilogo')
# txt = txt.replace('DatiRiepiloBgo', 'DatiRiepilogo')
# txt = txt.replace('j/DatiRiepilogo', '/DatiRiepilogo')
# txt = txt.replace('AliquotIaIVA', 'AliquotaIVA')
# txt = txt.replace('AliquotaIVA', 'AliquotaIVA')
# txt = txt.replace('AliquoJtaIVA', 'AliquotaIVA')
# txt = txt.replace('AliquotaHIVA', 'AliquotaIVA')
# txt = txt.replace('AliquotaGIVA', 'AliquotaIVA')
# txt = txt.replace('cAliquotaIVA', 'AliquotaIVA')
# txt = txt.replace('UnitaMiEsura', 'UnitaMisura')
# txt = txt.replace('UnitaMisuraE', 'UnitaMisura')
# txt = txt.replace('De.ttaglioLinee', 'DettaglioLinee')
# txt = txt.replace('DettaglioLin-ee', 'DettaglioLinee')
# txt = txt.replace('Dett?aglioLinee', 'DettaglioLinee')
# txt = txt.replace('DePttaglioLinee', 'DettaglioLinee')
# txt = txt.replace('DettaglioLiEnee', 'DettaglioLinee')
# txt = txt.replace('DettaglioLinee+', 'DettaglioLinee')
# txt = txt.replace('Dettaglio0Linee', 'DettaglioLinee')
# txt = txt.replace('ImponibileImportoS', 'ImponibileImporto')
# txt = txt.replace('SNumeroLinea', 'NumeroLinea')
# txt = txt.replace('FatturaElettronic!aBody', 'FatturaElettronicaBody')
# txt = txt.replace('/FatturaEle ttronica', '/FatturaElettronica')
# txt = txt.replace('SpeseAccessorXie', 'SpeseAccessorie')
# txt = txt.replace('Quantit]a', 'Quantita')
# txt = txt.replace('DatiPaga4mento]a', 'DatiPagamento')
# txt = txt.replace('DatiPaga4mento', 'DatiPagamento')
# txt = txt.replace('IXmposta', 'Imposta')
# txt = txt.replace('DataScfadenzaPagamento', 'DataScadenzaPagamento')
# txt = txt.replace('DataScadenzaPagame/nto', 'DataScadenzaPagamento')
# txt = txt.replace('Dat=aScadenzaPagamento', 'DataScadenzaPagamento')
# txt = txt.replace('RiferimLentoTesto', 'RiferimentoTesto')
# txt = txt.replace('CodiceTipoN', 'CodiceTipo')
# txt = txt.replace('D[ettaglioPagamento', 'DettaglioPagamento')
# txt = txt.replace('DatiBeniServizi', 'DatiBeniServizi')
# txt = txt.replace('AttachmeBnt', 'Attachment')
# txt = txt.replace('ModalitaPagamhento', 'ModalitaPagamento')
# txt = txt.replace('ModalXitaPagamento', 'ModalitaPagamento')
try:
ET.fromstring(txt)
except Exception as e:
print(parseError)
parseError = 3
logger.errorLog("Couldn't parse xml, m_iv_id = {}, mailsubject: {}, error: {}".format(m_iv_id, mailSubject, e))
if txt:
amountOfAttachments[0]+=1
return {"xml" : txt, "parseError" : parseError}
except Exception as e:
logger.errorLog("Error while removing the signature, m_iv_id = {}, mailsubject: {}, path: {}, error: {}".format(m_iv_id, mailSubject, path, e))
p7m example:
0� *�H��
��0�10
`�He
<?xml version="1.0" encoding="utf-8"?>
<p:FatturaElettronica xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" versione="FPR12" xmlns:p="http://ivaservizi.agenziaentrate.gov.it/docs/xsd/fatture/v1.2">
<FatturaElettronicaHeader>
<DatiTrasmissione>
<IdTrasmittente>
<IdPaese>IT</IdPaese>
<IdCodice>02002750483</IdCodice>
</IdTrasmittente>
<ProgressivoInvio>000YP</ProgressivoInvio>
<FormatoTrasmissione>FPR12</FormatoTrasmissione>
<CodiceDestinatario>0000000</CodiceDestinatario>
<PECDestinatario>CENSORED</PECDestinatario>
</DatiTrasmissione>
<CedentePrestatore>
<DatiAnagrafici>
<IdFiscaleIVA>
<IdPaese>IT</IdPaese>
<IdCodice>CENSORED</IdCodice>
</IdFiscaleIVA>
<Anagrafica>
<Denominazione>CENSORED</Denominazione>
</Anagrafica>
<RegimeFiscale>RF01��</RegimeFiscale>
</DatiAnagrafici>
<Sede>
<Indirizzo>CENSORED</Indirizzo>
<CAP>70128</CAP>
<Comune>Bari</Comune>
<Provincia>BA</Provincia>
<Nazione>IT</Nazione>
</Sede>
<IscrizioneREA>
<Ufficio>BA</Ufficio>
<NumeroREA>482324</NumeroREA>
<CapitaleSociale>300000.00</CapitaleSociale>
<SocioUnico>SM</SocioUnico>
<StatoLiquidazione>LN</StatoLiquidazione>
</IscrizioneREA>
</CedentePrestatore>
<CessionarioCommittente>
<DatiAnagrafici>
<IdFiscaleIVA>
<IdPaese>IT</IdPaese>
<IdCodice>CENSORED</IdCodice>
</IdFiscaleIVA>
<Anagrafica>
<Denominazione>CENSORED </Denominazione>
</Anagrafica>
</DatiAnagrafici>
<Sede>
<Indirizzo>CENSORED</Indirizzo>
<CAP>39100</CAP>
<Comune>Bolzano</Comune>
<Pro��vincia>BZ</Provincia>
<Nazione>IT</Nazione>
</Sede>
</CessionarioCommittente>
<TerzoIntermediarioOSoggettoEmittente>
<DatiAnagrafici>
<IdFiscaleIVA>
<IdPaese>IT</IdPaese>
<IdCodice>CENSORED</IdCodice>
</IdFiscaleIVA>
<CodiceFiscale>CENSORED</CodiceFiscale>
<Anagrafica>
<Denominazione>CENSORED</Denominazione>
</Anagrafica>
</DatiAnagrafici>
</TerzoIntermediarioOSoggettoEmittente>
<SoggettoEmittente>TZ</SoggettoEmittente>
</FatturaElettronicaHeader>
<FatturaElettronicaBody>
<DatiGenerali>
<DatiGeneraliDocumento>
<TipoDocumento>TD01</TipoDocumento>
<Divisa>EUR</Divisa>
<Data>2021-09-07</Data>
<Numero>288/2021</Numero>
<ImportoTotaleDocumento>101.27</ImportoTotaleDocumento>
<Causale>OR. 9303258741
COD. 43583</Causale>
</DatiGeneraliDocumento>
</DatiGenerali>
�� <DatiBeniServizi>
<DettaglioLinee>
<NumeroLinea>1</NumeroLinea>
<CodiceArticolo>
<CodiceTipo>INTERNO</CodiceTipo>
<CodiceValore>OFF</CodiceValore>
</CodiceArticolo>
<Descrizione>LAVORO D'OFFICINA JEEP RENEGADE GE721ZJ KM 8763 </Descrizione>
<UnitaMisura>NR</UnitaMisura>
<PrezzoUnitario>0.0000</PrezzoUnitario>
<PrezzoTotale>0.0000</PrezzoTotale>
<AliquotaIVA>22.00</AliquotaIVA>
</DettaglioLinee>
<DettaglioLinee>
<NumeroLinea>2</NumeroLinea>
<CodiceArticolo>
<CodiceTipo>INTERNO</CodiceTipo>
<CodiceValore>OFF</CodiceValore>
</CodiceArticolo>
<Descrizione>SOSTITUZIONE FILTRO OLIO COD. 46337528 </Descrizione>
<Quantita>1.00</Quantita>
<UnitaMisura>NR</UnitaMisura>
<PrezzoUnitario>8.4900</PrezzoUnitario>
<PrezzoTotale>8.4900</PrezzoTotale>
<AliquotaIVA>22.00</AliquotaIVA>
�� </DettaglioLinee>
<DettaglioLinee>
<NumeroLinea>3</NumeroLinea>
<CodiceArticolo>
<CodiceTipo>INTERNO</CodiceTipo>
<CodiceValore>OFF</CodiceValore>
</CodiceArticolo>
<Descrizione>OLIO MOTORE 5W-30 4,5 LITRI </Descrizione>
<Quantita>1.00</Quantita>
<UnitaMisura>NR</UnitaMisura>
<PrezzoUnitario>51.5200</PrezzoUnitario>
<PrezzoTotale>51.5200</PrezzoTotale>
<AliquotaIVA>22.00</AliquotaIVA>
</DettaglioLinee>
<DettaglioLinee>
<NumeroLinea>4</NumeroLinea>
<CodiceArticolo>
<CodiceTipo>INTERNO</CodiceTipo>
<CodiceValore>OFF</CodiceValore>
</CodiceArticolo>
<Descrizione>LAVORO DI MESSA IN OPERA </Descrizione>
<Quantita>1.00</Quantita>
<UnitaMisura>NR</UnitaMisura>
<PrezzoUnitario>23.0000</PrezzoUnitario>
<PrezzoTotale>23.0000</PrezzoTotale>
<AliquotaIVA>22.00</Aliquo�JtaIVA>
</DettaglioLinee>
<DatiRiepilogo>
<AliquotaIVA>22.00</AliquotaIVA>
<ImponibileImporto>83.01</ImponibileImporto>
<Imposta>18.26</Imposta>
<EsigibilitaIVA>I</EsigibilitaIVA>
</DatiRiepilogo>
</DatiBeniServizi>
</FatturaElettronicaBody>
</p:FatturaElettronica>
I can't copy and paste the full .p7m text, sadly, there is a lot more stuff after the closing tag like this:
after I used my function, I still get something like this:
<AliquotaIVA>22.00</AliquoJtaIVA>
AliquoJtaIVA
after I replaced the unwanted stuff from the original:
<AliquotaIVA>22.00</Aliquo�JtaIVA>
Can someone help me out in how to parse those xml.p7m with python(or another language, I really don't care at this point)?
I don't see a logic there, and I really don't want to maintain all the special cases for the next year or so (those would be in a database table btw., so we don't have to adjust the code all the time, but for testing purposes it's hard coded)
Edit:
Another Problem would be if those special characters would appear in a tag with numbers

pytorch-kaldi timit tutorial error shared_list[0] IndexError: list index out of range

# ***** Reading the Data********
if processed_first:
#Reading all the features and labels for this chunk
shared_list = []
p = threading.Thread(target=read_lab_fea, args=(cfg_file, is_production, shared_list, output_folder))
p.start()
p.join()
data_name = shared_list[0]
data_end_index = shared_list[1]
fea_dict = shared_list[2]
lab_dict = shared_list[3]
arch_dict = shared_list[4]
data_set = shared_list[5]
enter image description here
First I did run kaldi's run.sh file
When I did that, I corrected cmd.sh's contents.
Original --> call queue.pl
to --> call run.pl
Because I met bug when i run original source
Reference : https://www.google.com/url?q=https://groups.google.com/g/kaldi-help/c/tokwXTLdGFY?pli%3D1&sa=D&source=editors&ust=1631002151871000&usg=AOvVaw1FYQHJEmI-kkAAeAB2tcKt
enter image description here
I found that fea_dict and lab_dict in data_io.py has no shared element. How can I progress the TIMIT tutorial experiments?
I'm doing experiment using cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg file. Just correcting absolute directory of linux.
I refered https://github.com/mravanelli/pytorch-kaldi/issues/185
run copy-feats.
I saw--> kaldierror::KaldiFatalError

ipywidgets - widgets.FileUpload, updated CSV file read the CSV file

I am using jupyterhub and hosting the .ipynb file and hosted on server. I have usecase to upload a CSV from localdrive file and read the same for other dataframe tasks.
uploader = widgets.FileUpload(
accept='*.csv', # Accepted file extension e.g. '.txt', '.pdf', 'image/*', 'image/*,.pdf'
multiple=False # True to accept multiple files upload else False
)
display(uploader)
[input_file] = uploader.value
print(input_file)
pd.read_csv(input_file)
print(input_file) - is printing Test.csv which is CSV file name
I am able to print [input_file] but `pd.read_csv(input_file)' is throwing below error
FileNotFoundError: [Errno 2] No such file or directory: 'Test.csv'
Not sure were the CSV is uploaded, how can i read that data. Please help.
I don't have your exact ipywidgets version, but can you try this:
input_file = list(uploader.value.values())[0]
content = input_file['content']
content = io.StringIO(content.decode('utf-8'))
df = pd.read_csv(content)

Customer Error: imread read blank (None) image for file- Sagemaker AWS

I am following this tutorial with my custom data and my custom S3 buckets where train and validation data are. I am getting the following error:
Customer Error: imread read blank (None) image for file: /opt/ml/input/data/train/s3://image-classification/image_classification_model_data/train/img-001.png
I have all my training data are in one folder named 'train' I have set up my lst file like this suggested by doc,
22 1 s3://image-classification/image_classification_model_data/train/img-001.png
86 0 s3://image-classification/image_classification_model_data/train/img-002.png
...
My other configurations:
s3_bucket = 'image-classification'
prefix = 'image_classification_model_data'
s3train = 's3://{}/{}/train/'.format(s3_bucket, prefix)
s3validation = 's3://{}/{}/validation/'.format(s3_bucket, prefix)
s3train_lst = 's3://{}/{}/train_lst/'.format(s3_bucket, prefix)
s3validation_lst = 's3://{}/{}/validation_lst/'.format(s3_bucket, prefix)
train_data = sagemaker.inputs.TrainingInput(s3train, distribution='FullyReplicated',
content_type='application/x-image', s3_data_type='S3Prefix')
validation_data = sagemaker.inputs.TrainingInput(s3validation, distribution='FullyReplicated',
content_type='application/x-image', s3_data_type='S3Prefix')
train_data_lst = sagemaker.inputs.TrainingInput(s3train_lst, distribution='FullyReplicated',
content_type='application/x-image', s3_data_type='S3Prefix')
validation_data_lst = sagemaker.inputs.TrainingInput(s3validation_lst, distribution='FullyReplicated',
content_type='application/x-image', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data, 'train_lst': train_data_lst,
'validation_lst': validation_data_lst}
I checked the images downloaded and checked physically, I see the image. Now sure what this error gets thrown out as blank. Any suggestion would be great.
Sagemaker copies the input data you specify in s3train into the instance in /opt/ml/input/data/train/ and that's why you have an error, because as you can see from the error message is trying to concatenate the filename in the lst file with the path where it expect the image to be. So just put only the filenames in your lstand should be fine (remove the s3 path).

Issues reading NetCDF4 LST files from Copernicus

I have a series of hourly LST data from Copernicus. These can be read and displayed fine in Panopoly (Windows version 4.10.6) but:
Crash QGIS 2.18.13 when opening with the 'NetCDF4 Browser' plugin (V0.3)
Reads LST variable incorrectly using R ncdf4 code:
*
file1 <- nc_open('g2_BIOPAR_LST_201901140400_E114.60S35.43E119.09S30.22_GEO_V1.2.nc')
file2 <- nc_open('g2_BIOPAR_LST_201901140500_E114.60S35.43E119.09S30.22_GEO_V1.2.nc')
# attributes(file1)$names
# Just for one variable for now
dat_new <- cbind(
ncvar_get(file1, 'LST'),
ncvar_get(file2, 'LST'))
dim(dat_new)
print(dim(dat_new))
var <- file1$var['LST']$LST
# Create a new file
file_new <- nc_create(
filename = 'Copernicus_LST.nc',
# We need to define the variables here
vars = ncvar_def(
name = 'LST',
units = var$units,
dim = dim(dat_new)))
# And write to it
ncvar_put(
nc = file_new,
varid = 'LST',
vals = dat_new)
# Finally, close the file
nc_close(file_new)
Returns:
[1] "Error, passed variable has a dim that is NOT of class ncdim4!"
[1] "Error occurred when processing dim number 1 of variable LST"
Error in ncvar_def(name = "LST", units = var$units, dim = dim(dat_new)) :
This dim has class: integer
Similarly, using the python netCDF4 approach
compiled = netCDF4.MFDataset(['g2_BIOPAR_LST_201901140400_E114.60S35.43E119.09S30.22_GEO_V1.2.nc','g2_BIOPAR_LST_201901140500_E114.60S35.43E119.09S30.22_GEO_V1.2.nc']))
Returns
ValueError: MFNetCDF4 only works with NETCDF3_* and NETCDF4_CLASSIC formatted files, not NETCDF4
I'm presuming that this is an issue with the file formatting from Copernicus... Has anyone else encountered this? Have put these two example files here.
Thanks!

Resources