Recently I wanted to try the newly implemented xml_read function within pandas. I thought about testing the feature with SEPA camt-format xml. I'm stuck with the functions parameters, as I'm unfamiliar with the lxml logic. I tried pointing to the transactions values as rows ("Ntry" tag), as I thought this will then loop through those rows and creates the dataframe. Setting xpath to default returns an empty dataframe with the columns "GrpHdr" and "Rpt", but the relevant data is one level below "Rpt". Setting xpath='//*' creates a huge dataframe with every tag as column and values randomly sorted.
If anyone is familiar with using the pandas xml_read and nested xmls, I'd appreciate any hints.
The xml file looks like this (fake values):
<Document>
<BkToCstmrAcctRpt>
<GrpHdr>
<MsgId>Account</MsgId>
<CreDtTm>2021-08-05T14:20:23.077+02:00</CreDtTm>
<MsgRcpt>
<Nm> Name</Nm>
</MsgRcpt>
</GrpHdr>
<Rpt>
<Id>Account ID</Id>
<CreDtTm>2021-08-05T14:20:23.077+02:00</CreDtTm>
<Acct>
<Id>
<IBAN>DEXXXXX</IBAN>
</Id>
</Acct>
<Bal>
<Tp>
<CdOrPrtry>
</CdOrPrtry>
</Tp>
<Amt Ccy="EUR">161651651651</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<Dt>
<DtTm>2021-08-05T14:20:23.077+02:00</DtTm>
</Dt>
</Bal>
<Ntry>
<Amt Ccy="EUR">11465165</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<Sts>BOOK</Sts>
<BookgDt>
<Dt>2021-08-02</Dt>
</BookgDt>
<ValDt>
<Dt>2021-08-02</Dt>
</ValDt>
<BkTxCd>
<Domn>
<Cd>PMNT</Cd>
<Fmly>
<Cd>RCDT</Cd>
<SubFmlyCd>ESCT</SubFmlyCd>
</Fmly>
</Domn>
<Prtry>
<Cd>NTRF+65454</Cd>
<Issr>DFE</Issr>
</Prtry>
</BkTxCd>
<NtryDtls>
<TxDtls>
<Amt Ccy="EUR">4945141.0</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<BkTxCd>
<Domn>
<Cd>PMNT</Cd>
<Fmly>
<Cd>RCDT</Cd>
<SubFmlyCd>ESCT</SubFmlyCd>
</Fmly>
</Domn>
<Prtry>
<Cd>NTRF+55155</Cd>
<Issr>DFEsds</Issr>
</Prtry>
</BkTxCd>
<RltdPties>
<Dbtr>
<Nm>Name</Nm>
</Dbtr>
<Cdtr>
<Nm>Name</Nm>
</Cdtr>
</RltdPties>
<RmtInf>
<Ustrd>Referenz NOTPROVIDED</Ustrd>
<Ustrd> Buchug</Ustrd>
</RmtInf>
</TxDtls>
</NtryDtls>
</Ntry>
</Rpt>
</BkToCstmrAcctRpt>
</Document>
The bank statement is not a shallow xml, thus not very suitable for pandas.read_xml (as indicated in the documentation).
Instead I suggest to use sepa library.
from sepa import parser
import re
import pandas as pd
# Utility function to remove additional namespaces from the XML
def strip_namespace(xml):
return re.sub(' xmlns="[^"]+"', '', xml, count=1)
# Read file
with open('example.xml', 'r') as f:
input_data = f.read()
# Parse the bank statement XML to dictionary
camt_dict = parser.parse_string(parser.bank_to_customer_statement, bytes(strip_namespace(input_data), 'utf8'))
statements = pd.DataFrame.from_dict(camt_dict['statements'])
all_entries = []
for i,_ in statements.iterrows():
if 'entries' in camt_dict['statements'][i]:
df = pd.DataFrame()
dd = pd.DataFrame.from_records(camt_dict['statements'][i]['entries'])
df['Date'] = dd['value_date'].str['date']
df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%Y-%m-%d')
iban = camt_dict['statements'][i]['account']['id']['iban']
df['IBAN'] = iban
df['Currency'] = dd['amount'].str['currency']
all_entries.append(df)
df_entries = pd.concat(all_entries)
Related
Question:
1、how to select the rows(Pseudo code) : columns['Name']='Name_A' (Name_A just a example) & columns['time'] isin (2021-11-21 00:00:00,2021-11-22 00:00:00) .
I have store about 4 billion rows data to a hdf5 file.
Now, I want to select some data.
My code like this:
import pandas as pd
ss = pd.HDFStore("xh_data_L9.hdf5") #<class 'pandas.io.pytables.HDFStore'>
print(type(ss))
print(ss.keys())
s_1 = ss.select('alldata',start=0,stop=500) # data example
print(s_1)
ss.close()
I found HDFStore.select usage like this:
HDFStore.select(key, where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, auto_close=False)
# can not run success.
s_3 = ss.select('alldata',where="Time>2021-11-21 00:00:00 & Time<2021-11-22 00:00:00)")
s_3 = ss.select('alldata',['Name'] == 'Name_A')
I have google some method,but don't how to use "where"
code and result
I found that the reason was whether the data_columns was established when the file was created.
#this method created hdf5 don't have data_columns
ss.append('store',df_temp,index=True)
#this method created hdf5 have data_columns
store.append("store", df_temp, format="table", data_columns=True)
#query whether include data_columns
import pandas as pd
ss = pd.HDFStore("store.hdf5")
print(ss.info())
if the result include " dc->[Time,Name,Value]".
ss.select("store",where="Name='Name_A'")
#Single quotation marks are required before and after the varies.
The following is the official website explanation for data_columns:
data_columns :
list of columns, or True, default None
List of columns to create as indexed data columns for on-disk
queries, or True to use all columns. By default only the axes of the object are indexed.
See here <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#query-via-data-columns>.
Please i need help, am having trouble trying to put my scraped data into a data frame that has 3 columns i.e. date, source and keywords extracted from each scraped website for further text analysis, my code is borrowed from https://stackoverflow.com/users/12229253/foreverlearning and is given below:
from newspaper import Article
import nltk
nltk.download('punkt')
urls = ['https://dailypost.ng/2022/02/02/securing-nigeria-duty-of-all-tiers-of-government-oyo-senator-balogun/', 'https://guardian.ng/news/declare-bandits-as-terrorists-senate-tells-buhari/', 'https://www.thisdaylive.com/index.php/2021/10/24/when-will-fg-declare-bandits-as-terrorists/', 'https://punchng.com/rep-wants-buhari-to-name-lawmaker-sponsoring-terrorism/', 'https://punchng.com/national-assembly-plans-to-meet-us-congress-over-875m-weapons-deal-stoppage/']
results = {}
for url in urls:
article = Article(url)
article.download()
article.parse()
article.nlp()
results[url] = article
for url in urls:
print(url)
article = results[url]
print(article.authors)
print(article.publish_date)
print(article.keywords)
I played around with it and here is how you can make it into a data frame. Assuming that you wanted to use pandas in the first place:
import nltk
import pandas as pd
from newspaper import Article
nltk.download('punkt')
urls = ['https://dailypost.ng/2022/02/02/securing-nigeria-duty-of-all-tiers-of-government-oyo-senator-balogun/', 'https://guardian.ng/news/declare-bandits-as-terrorists-senate-tells-buhari/', 'https://www.thisdaylive.com/index.php/2021/10/24/when-will-fg-declare-bandits-as-terrorists/', 'https://punchng.com/rep-wants-buhari-to-name-lawmaker-sponsoring-terrorism/', 'https://punchng.com/national-assembly-plans-to-meet-us-congress-over-875m-weapons-deal-stoppage/']
# create a data frame with the needed columns
saved_data = pd.DataFrame(columns=['Date', 'Source', 'KeyWords'])
# put into a data frame that has 3 columns i.e. date, source and keywords
def add_data_to_df(urls, saved_data):
for url in urls: # process each url separately
article = Article(url)
article.download()
article.parse()
article.nlp()
# create a row with the data you need using attributes
record = {'Date': article.publish_date, 'Source': url, 'KeyWords': article.keywords}
# append info about each url as a new row
saved_data = saved_data.append(record, ignore_index = True)
return saved_data
Now, when you run this function
add_data_to_df(urls, saved_data), you should see a data frame with contents similar to the ones I got below during testing:
Date Source KeyWords
0 2022-02-02 00:00:00 https://dailypost.ng/2022/02/02/securing-niger... [nigeria, securing, oyo, state, senator, prote...
1 2021-09-30 04:25:24+00:00 https://guardian.ng/news/declare-bandits-as-te... [shutdown, terrorists, nigeria, guardian, decl...
2 2021-10-24 00:00:00 https://www.thisdaylive.com/index.php/2021/10/... [terrorists, nigerian, declare, state, militar...
3 2021-10-05 14:41:48+00:00 https://punchng.com/rep-wants-buhari-to-name-l... [president, buhari, house, national, lawmaker,...
4 2021-07-31 00:30:47+00:00 https://punchng.com/national-assembly-plans-to... [plans, congress, deal, nigeria, nigerian, rig...
(Sorry for the format, I am showing the output as plain text since I am not allowed to attach screenshots, but you will have a nice pandas format)
Edit: adding a function to save the data frame to a csv file. Note that this one of the shortest ways of doing this and it will save the file to the current working directory, i.e., where you are executing your code:
# this function saves given data to csv
def save_to_csv(saved_data):
saved_data.to_csv('output.csv', index=False, sep=',')
# process the articles and create a csv
save_to_csv(add_data_to_df(urls, saved_data))
When invoking Azure ML Batch Endpoints (creating jobs for inferencing), the run() method should return a pandas DataFrame or an array as explained here
However this example shown, doesn't represent an output with headers for a csv, as it is often needed.
The first thing I've tried was to return the data as a pandas DataFrame and the result is just a simple csv with a single column and without the headers.
When trying to pass the values with several columns and it's corresponding headers, to be later saved as csv, as a result, I'm getting awkward square brackets (representing the lists in python) and the apostrophes (representing strings)
I haven't been able to find documentation elsewhere, to fix this:
This is the way I found to create a clean output in csv format using python, from a batch endpoint invoke in AzureML:
def run(mini_batch):
batch = []
for file_path in mini_batch:
df = pd.read_csv(file_path)
# Do any data quality verification here:
if 'id' not in df.columns:
logger.error("ERROR: CSV file uploaded without id column")
return None
else:
df['id'] = df['id'].astype(str)
# Now we need to create the predictions, with previously loaded model in init():
df['prediction'] = model.predict(df)
# or alternative, df[MULTILABEL_LIST] = model.predict(df)
batch.append(df)
batch_df = pd.concat(batch)
# After joining all data, we create the columns headers as a string,
# here we remove the square brackets and apostrophes:
azureml_columns = str(batch_df.columns.tolist())[1:-1].replace('\'','')
result = []
result.append(azureml_columns)
# Now we have to parse all values as strings, row by row,
# adding a comma between each value
for row in batch_df.iterrows():
azureml_row = str(row[1].values).replace(' ', ',')[1:-1].replace('\'','').replace('\n','')
result.append(azureml_row)
logger.info("Finished Run")
return result
Using a csv extract from a registration system I am attempting to format the data to use as contact/distribution list import into a virtual meeting application. Using the following function I am able to pull the needed data into a nested list ([name1, email1] [name2, email2],...).
def createDistributionList():
with open(fileOpen) as readFile, open('test2.txt', 'w') as writeFile:
data = pd.read_csv(readFile)
df = pd.DataFrame(data, columns= ['Attendee Name', 'Attendee Email'])
distList = df.values.tolist()
print(' '.join(map(str, distList)))
The format I need the data in is one long string - name1(email1);name2(email2);...
I have been unable to get the output that I am looking for. Any assistance or a pointer to a relevant reference would be greatly appreciated.
You can use list comprehension for that:
tup = (["name1", "email1"], ["name2", "email2"], ["name3", "email3"])
print(";".join(["{}({})".format(l[0], l[1]) for l in tup]))
I am trying to create/add a new column into the following dataframe (df_new):
I want this new column (df['category']) to be feed from df['tags'].
The tags columns, is a list of objects, and the value that I want to retrieve is the category, and if there is no category I want to set it as unknown.
This is a sample of my JSON file
{"submissionTime":"2019-02-25T09:26:00","b_data":{"bName":"Masato","b_Acc":[{"id":0,"transactions":[{"date":"2019-12-19","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-12-03","text":"LINE FEE","amount":-460.21,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-12-31","text":"INTEREST","amount":-871.62,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-12-31","text":"LOAN SERVICE FEE","amount":-120,"type":"Loan Related Fees","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-12-18","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-12-02","text":"LINE FEE","amount":-498.34,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-11-29","text":"INTEREST","amount":-794.4,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-11-19","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-11-01","text":"LINE FEE","amount":-484.87,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-10-31","text":"INTEREST","amount":-882.04,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-10-21","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-10-01","text":"LINE FEE","amount":-503.59,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-09-30","text":"INTEREST","amount":-916.98,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-09-30","text":"LOAN SERVICE FEE","amount":-120,"type":"Loan Related Fees","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-09-19","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-09-02","text":"LINE FEE","amount":-489.65,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-08-30","text":"INTEREST","amount":-892.13,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]}]}]}}
and this is how I have been able to do so far:
import json
import numpy as np
import pandas as pd
with open('question.json') as json_data:
d = json.load(json_data)
df = pd.json_normalize(d['b_data']['b_Acc'])
frames = []
#https://pandas.pydata.org/pandas-docs/stable/merging.html
for index, row in df.iterrows():
frames = frames + row['transactions']
df_new = pd.DataFrame(frames)
df['category'] = df_new['tags'].apply(pd.Series)[0]
This chould potentially work if category is always the first element of that array, however in raw 0 the first element is institution, the second raw is creditDebit (which i would like to be unknow as there is no category)
This will do what you did in for loop
s=pd.DataFrame(pd.DataFrame(df.transactions.tolist()).stack().str['tags'].tolist())