I have different data files that are mapped on relational stores. I do have a formatter which contains the separators used by the different data files (most of them csv). Here is an example of how it looks like:
DQKI 435741198746445 45879645422727JHUFHGLOBAL COLLATERAL SERVICES AGGREGATOR V9
The rule to read this file is as following: from index 0 to 3, it's the code name, from index 8 to 11, it's PID, from index 11 to 20, it's account number, and so on...
How do you specify such rule in ActivePivot Relational Stores?
The relational-store of ActivePivot ships with a high performance, multithreaded CSV-Source to parse files and load them into data stores. I suppose that's what you hope to use for your fixed-length field file.
But this is not supported in the current version of the Relational Store (1.5.x).
You could pre-process your file with a small script to add a separator character at the end of each of the fields. Then the entire CSV Source can be reused immediately.
You could write your own data source that defines fields as offset in the text line. If you do that you can reuse all of the fast field parsers available in the CSV Source project (they work on any char sequences):
com.quartetfs.fwk.format.impl.DoubleParser
com.quartetfs.fwk.format.impl.FloatParser
com.quartetfs.fwk.format.impl.DoubleVectorParser
com.quartetfs.fwk.format.impl.FloatVectorParser
com.quartetfs.fwk.format.impl.IntegerParser
com.quartetfs.fwk.format.impl.IntegerVectorParser
com.quartetfs.fwk.format.impl.LongParser
com.quartetfs.fwk.format.impl.ShortParser
com.quartetfs.fwk.format.impl.StringParser
com.quartetfs.fwk.format.impl.DateParser
Related
I have a set of 100K XML-ish (more on that later) legacy files with a consistent structure - an Archive wrapper with multiple Date and Data pair records.
I need to extract the individual records and write them to individual text files, but am having trouble parsing the data due to illegal characters and random CR/space/tab leading and trailing data.
About the XML Files
The files are inherited from a retired system and can't be regenerated. Each file is pretty small (less then 5 MB).
There is one Date record for every Data record:
vendor-1-records.xml
<Archive>
<Date>10 Jan 2019</Date>
<Data>Vendor 1 Record 1</Data>
<Date>12 Jan 2019</Date>
<Data>Vendor 1 Record 2</Data>
(etc)
</Archive>
vendor-2-records.xml
<Archive>
<Date>22 September 2019</Date>
<Data>Vendor 2 Record 1</Data>
<Date>24 September 2019</Date>
<Data>Vendor 2 Record 2</Data>
(etc)
</Archive>
...
vendor-100000-records.xml
<Archive>
<Date>12 April 2019</Date>
<Data>Vendor 100000 Record 1</Data>
<Date>24 October 2019</Date>
<Data>Vendor 100000 Record 2</Data>
(etc)
</Archive>
I would like to extract each Data record out and use the Date entry to define a unique file name, then write the contents of the Data record to that file as so
filename: vendor-1-record-1-2019-1Jan-10.txt contains
file contents: Vendor 1 record 1
(no tags, just the record terminated by CR)
filename: vendor-1-record-2-2019-1Jan-12.txt contains
file contents: Vendor 1 record 2
filename: vendor-2-record-1-2019-9Sep-22.txt contains
file contents: Vendor 2 record 1
filename: vendor-2-record-2-2019-9Sep-24.txt contains
file contents: Vendor 2 record 2
Issue 1 : illegal characters in XML Data records
One issue is that the elements contain multiple characters that XML libraries like Etree/etc terminate on including control characters, formatting characters and various Alt+XXX type characters.
I've searched online and found all manner of workaround and regex and search and replace scripts but the only thing that seems to work in Python is lxml's etree with recover=True.
However, that doesn't even always work because some of the files are apparently not UTF-8, so I get the error:
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Issue 2 - Data records have random amounts of leading and following CRs and spaces
For the files I can parse with lxml.etree, the actual Data records are also wrapped in CRs and random spaces:
<Data>
(random numbers of CR + spaces and sometimes tabs)
*content<CR>*
(random numbers of CR + spaces and sometimes tabs)
</Data>
and therefore when I run
parser = etree.XMLParser(recover=True)
tree = etree.parse('vendor-1-records.xml', parser=parser)
tags_needed = tree.iter('Data')
for it in tags_needed:
print (it.tag,it.attrib)
I get a collection of empty Data tags (one for each data record in the file) like
Data {}
Data {}
Questions
Is there a more efficient language/module than Python's lxml for ignoring the illegal characters? As I said, I've dug through a number of cookbook blog posts, SE articles, etc for pre-processing the XML and nothing seems to really work - there's always one more control character/etc that hangs the parser.
SE suggested a post about cleaning XML which references an old Atlassian tool ( Stripping Invalid XML characters in Java). I did some basic tests and it seems like it might work but open to other suggestions.
I have not used regex with Python much - any suggestions on how to handle cleaning the leading/trailing CR/space/tab randomness in the Data tags? The actual record string I want in that Data tag also has a CR at the end and may contain tabs as well so I can't just search and replace. Maybe there is a regex way to pull that but my regex-fu is pretty weak.
For my issues 1 and 2, I kind of solved my own problem:
Issue 1 (parsing and invalid characters)
I ran the entire set of files through the Atlassian jar referenced in (Stripping Invalid XML characters in Java) with a batch script:
for %%f in (*.xml) do (
java -jar atlassian-xml-cleaner-0.1.jar %%f > clean\%%~f
)
This utility standardized all of the XML files and made them parseable by lxml.
Issue 2 (CR, spaces, tabs inside the Data element)
This configuration for lxml stripped all whitespace and handled the invalid character issue
from lxml import etree
parser = etree.XMLParser(encoding = 'utf-8',recover=True,remove_blank_text=True)
tree = etree.parse(filepath, parser=parser)
With these two steps I'm now able to start extracting records and writing them to individual files:
# for each date, finding the next item gives me the Data element and I can strip the tab/CR/whitespace:
for item in tree.findall('Date'):
dt = parse_datestamp(item.text.strip())
content = item.getnext().text.strip()
I've been tasked to map an input xml (actually an SAP idoc xml), and to generate a number of flat files. Each input xml may yield multiple output files (one output file per lot number), so I will be using xsl:key and the key() function in my mapping, based on the lot number
The thing is, the lot number itself will not be in the file itself, but the output file name needs to contain that lot number value.
So the question really is: can I map the lot number to the xml and have the flat file assembler skip it when it produces the file? Or is there another way the lot number can be applied as file name by the assembly without having it inside the file itself?
In your orchestration you can set a context property for each output message:
msgOutput(FILE.ReceivedFileName) = "DynamicStuff";
msgOutput then goes to the send shape.
In your send port you set the output file like this:
FixedStuff_%SourceFileName%.xml
The result:
FixedStuff_DynamicStuff.xml
If the value is not required in the message content, don't map it. That's it.
To insert at value in the file name, lot number in this case, you will need to promote that value to the FILE.ReceivedFileName Context Property. Then, you can use the %SourceFileName% Macro as part of the name setting in the Send Port. You can set FILE.ReceivedFileName by either Property Promotion or xpath() in an Orchestration.
Bonus: Sorting and Grouping in xslt is rather unwieldy, which is why I don't do that anymore. Instead, you can use SQL: BizTalk: Sorting and Grouping Flat File Data In SQL Instead of XSL
I have a set of multiple API's I need to source data from and need four different data categories. This data is then used for reporting purposes in Excel.
I initially created web queries in Excel, but my Laptop just crashes because there is too many querie which have to be updated. Do you guys know a smart workaround?
This is an example of the API I will source data from (40 different ones in total)
https://api.similarweb.com/SimilarWebAddon/id.priceprice.com/all
The data points I need are:
EstimatedMonthlyVisits, TopOrganicKeywords, OrganicSearchShare, TrafficSources
Any ideas how I can create an automated report which queries the above data on request?
Thanks so much.
If Excel is crashing due to the demand, and that doesn't surprise me, you should consider using Python or R for this task.
install.packages("XML")
install.packages("plyr")
install.packages("ggplot2")
install.packages("gridExtra")
require("XML")
require("plyr")
require("ggplot2")
require("gridExtra")
Next we need to set our working directory and parse the XML file as a matter of practice, so we're sure that R can access the data within the file. This is basically reading the file into R. Then, just to confirm that R knows our file is in XML, we check the class. Indeed, R is aware that it's XML.
setwd("C:/Users/Tobi/Documents/R/InformIT") #you will need to change the filepath on your machine
xmlfile=xmlParse("pubmed_sample.xml")
class(xmlfile) #"XMLInternalDocument" "XMLAbstractDocument"
Now we can begin to explore our XML. Perhaps we want to confirm that our HTTP query on Entrez pulled the correct results, just as when we query PubMed's website. We start by looking at the contents of the first node or root, PubmedArticleSet. We can also find out how many child nodes the root has and their names. This process corresponds to checking how many entries are in the XML file. The root's child nodes are all named PubmedArticle.
xmltop = xmlRoot(xmlfile) #gives content of root
class(xmltop)#"XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode"
xmlName(xmltop) #give name of node, PubmedArticleSet
xmlSize(xmltop) #how many children in node, 19
xmlName(xmltop[[1]]) #name of root's children
To see the first two entries, we can do the following.
# have a look at the content of the first child entry
xmltop[[1]]
# have a look at the content of the 2nd child entry
xmltop[[2]]
Our exploration continues by looking at subnodes of the root. As with the root node, we can list the name and size of the subnodes as well as their attributes. In this case, the subnodes are MedlineCitation and PubmedData.
#Root Node's children
xmlSize(xmltop[[1]]) #number of nodes in each child
xmlSApply(xmltop[[1]], xmlName) #name(s)
xmlSApply(xmltop[[1]], xmlAttrs) #attribute(s)
xmlSApply(xmltop[[1]], xmlSize) #size
We can also separate each of the 19 entries by these subnodes. Here we do so for the first and second entries:
#take a look at the MedlineCitation subnode of 1st child
xmltop[[1]][[1]]
#take a look at the PubmedData subnode of 1st child
xmltop[[1]][[2]]
#subnodes of 2nd child
xmltop[[2]][[1]]
xmltop[[2]][[2]]
The separation of entries is really just us, indexing into the tree structure of the XML. We can continue to do this until we exhaust a path—or, in XML terminology, reach the end of the branch. We can do this via the numbers of the child nodes or their actual names:
#we can keep going till we reach the end of a branch
xmltop[[1]][[1]][[5]][[2]] #title of first article
xmltop[['PubmedArticle']][['MedlineCitation']][['Article']][['ArticleTitle']] #same command, but more readable
Finally, we can transform the XML into a more familiar structure—a dataframe. Our command completes with errors due to non-uniform formatting of data and nodes. So we must check that all the data from the XML is properly inputted into our dataframe. Indeed, there are duplicate rows, due to the creation of separate rows for tag attributes. For instance, the ELocationID node has two attributes, ValidYN and EIDType. Take the time to note how the duplicates arise from this separation.
#Turning XML into a dataframe
Madhu2012=ldply(xmlToList("pubmed_sample.xml"), data.frame) #completes with errors: "row names were found from a short variable and have been discarded"
View(Madhu2012) #for easy checking that the data is properly formatted
Madhu2012.Clean=Madhu2012[Madhu2012[25]=='Y',] #gets rid of duplicated rows
Here is a link that should help you get started.
http://www.informit.com/articles/article.aspx?p=2215520
If you have never used R before, it will take a little getting used to, but it's worth it. I've been using it for a few years now and when compared to Excel, I have seen R perform anywhere from a couple hundred percent faster to many thousands of percent faster than Excel. Good luck.
I have a bibliographic record in RUSMARC (Russian UNIMARC) standard. For further processing I need to convert this record into MARCXML (MARC21 in XML) format.
How to accomplish such a tranformation programmatically?
UPDATE
I have some routine to read and parse ISO 2709 format. However, RUSMARC (and UNIMARC in general) is different to MARC21 in terms of fields meaning.
UNIMARC records should be converted to MARC21 according to specifications published by Library of Congress (http://www.loc.gov/marc/unimarctomarc21.html).
First, you need to read your RUSMARC (UNIMARC) record into memory and construct XML according to UNISlim schema (http://www.rusmarc.ru/shema/UNISlim.xsd).
Then you can use XSL transformation that converts UNIMARC XML (in UNISlim schema) into MARCXML.
You can git this XSL transformation here: https://github.com/edsd/biblio-metadata
I have exported the contents of a table with transaction SE16, by selecting all the entries and going selecting Download, unconverted.
I'd like to import these entries into another system (where the same table exists and is active).
Furthermore, when I import, there's a possibility that the specific key already exists for a number of entries (old entries).
Other entries won't have a field with the same key present in the table where they're to be imported (new entries).
Is there a way to easily update my table in the second system with the file provided from the first system? If needed, I can export the data in the 3 other format types (Spreadsheet, Rich text format and HTML format). It seems to me though like the spreadsheet and rich text formats sometimes corrupt the data, and the html is far too verbose.
[EDIT]
As per popular demand, the table i'm trying to export / import is a Z table whose fields are all numeric, character, date or time fields (flat data types).
I'm trying to do it like this because the clients don't have any basis resource to help them transport, and would like to "kinna" automate the process of updating one of the tables in one system.
At the moment it's a business request to do it like this, but I'm open to suggestions (and the clients are open too)
Edit
Ok I doubt that what you describe in your comment exists out of the box, but you can easily write something like that:
Create a method (or function module if that floats your boat) that accepts the following:
iv_table name TYPE string and
iv_filename TYPE string
This would be the method:
method upload_table.
data: lt_table type ref to data,
lx_root type ref to cx_root.
field-symbols: <table> type standard table.
try.
create data lt_table type table of (iv_table_name).
assign lt_table->* to <table>.
call method cl_gui_frontend_services=>gui_upload
exporting
filename = iv_filename
has_field_separator = abap_true
changing
data_tab = <table>
exceptions
others = 4.
if sy-subrc <> 0.
"Some appropriate error handling
"message id sy-msgid type 'I'
" number sy-msgno
" with sy-msgv1 sy-msgv2
" sy-msgv3 sy-msgv4.
return.
endif.
modify (p_name) from table <table>.
"write: / sy-tabix, ' entries updated'.
catch cx_root into lx_root.
"lv_text = lx_root->get_text( ).
"some appropriate error handling
return.
endtry.
endmethod.
This would still require that you make sure that the exported file matches the table that you want to import. However cl_gui_frontend_services=>gui_upload should return sy-subrc > 0 in that case, so you can bail out before you corrupt any data.
Original Answer:
I'll assume that you want to update a z-table and not a SAP standard table.
You will probably have to format your datafile a little bit to make it tab or comma delimited.
You can then upload the data file using cl_gui_frontend_services=>gui_upload
Then if you want to overwrite the existing data in the table you can use
modify zmydbtab from table it_importeddata.
If you do not want to overwrite existing entries you can use.
insert zmydbtab from table it_importeddata.
You will get a return code of sy-subrc = 4 if any of the keys already exists, but any new entries will be inserted.
Note
There are many reasons why you would NOT do this for a SAP-standard table. Most prominent is that there is almost always more to the data-model than what we are aware of. Also when creating transactional data, there are often follow-on events or workflow that kicks off, that will not be the case if you're updating the database directly. As a rule of thumb, it is usually a bad idea to update SAP standard tables directly.
In that case try to find a BADI, or if that's not available, record a BDC and do the updates that way.
If the system landscape was setup correctly, your client would not need any kind of basis operations support whatsoever to perform the transports. So instead of re-inventing the wheel, I'd strongly suggest to catch up on what the CTS and TMS can do once they're setup with sensible settings.