I have to extract data from XML files with the size of several hundreds of MB in a Google Cloud Function and I was wondering if there are any best practices?
Since I am used to nodejs I was looking at some popular libraries like fast-xml-parser but it seems cumbersome if you only want specific data from a huge xml. I am also not sure if there are any performance issues when the XML is too big. Overall this does not feel like the best solution to parse and extract data from huge XMLs.
Then I was wondering if I could use BigQuery for this task where I simple convert the xml to json and throw it into a Dataset where I then can use a query to retrieve the data I want.
Another solution could be to use python for the job since it is good in parsing and extracting data from a XML so even though I have no experience in python I was wondering if this path could still be
the best solution?
If anything above does not make sense or if one solution is preferable to the other or if anyone can share any insights I would highly appreciate it!
I suggest you to check this article in which they discuss how to load XML data into BigQuery using Python Dataflow. I think that this approach may work in your situation.
Basically what they suggest is:
To parse the xml into a Python dictionary using the package xmltodict.
Specify a schema of the output table in BigQuery.
Use a Beam pipeline to take an XML file and use it to populate a BigQuery table.
Related
I am using turtle files containing biographical information for historical research. Those files are provided by a major library and most of the information in the files is not explicit. While people's professions, for instance, are sometimes stated alongside links to the library's URIs, I only have URIs in the majority of cases. This is why I will need to retrieve the information behind them at some point in my workflow, and I would appreciate some advice.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
I have also seen that there are ways to convert RDFs directly to CSV, but although CSV is nice to work with, I would get a lot of unwanted "background noise" by simply converting all the data.
What would you recommend?
RDFlib's all about working with RDF data. If you have RDF data, my suggestion is to do as much RDF-native stuff that you can and then only export to CSV if you want to do something like print tabular results or load into Pandas DataFrames. Of course there are always more than one way to do things, so you could manipulate data in CSV, but RDF, by design, has far more information in it than a CSV file can so when you're manipulating RDF data, you have more things to get hold of.
most of the information in the files is not explicit
Better phrased: most of the information is indicated with objects identified by URIs, not given as literal values.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
No! You should store the ttl files you can get and then you may indeed retrieve all the other data referred to by URI but, presumably, that data is also in RDF form so you should download it into the same graph you loaded the initial ttl files in to and then you can have the full graph with links and literal values it it as your disposal to manipulate with SPARQL queries.
I need to create data driven unit tests for different APIs in karate framework. The various elements to be passed in the JSON payload should be taken as input from an excel file.
A few points:
I recommend you look at Karate's built-in data-table capabilities, it is far more readable, integrates into your test-script and you won't need to depend on other software. Refer these examples: call-table.feature and dynamic-params.feature
Next I would recommend using JSON instead of an Excel or CSV file, it is natively supported by Karate: call-json-array.feature
Finally, if you really wanted to, you can call any Java code and if you return data in a Map / List form, it will be ready for Karate to use. This example shows how to read a database via JDBC: dogs.feature. So although this is not built into Karate, just write a simple utility to read a CSV or Excel file and you can do pretty much anything Java can do.
EDIT: Karate now supports CSV files that can be used to even do data-driven testing: https://github.com/intuit/karate#csv-files
I am working on an exploratory data analysis using python on a huge Dataset (~20 Million records and 10 columns). I would be segmenting, aggregating data and create some visualizations, I might as well create some decision trees liner regression models using that dataset.
Because of the large data set I need to use a data-frame that allows out of core data storage. Since I am relatively new to Python and working with large data-sets, i want to use a method which would allow me to easily use sklearn on my data-sets. I'm confused weather to use Py-tables, Blaze or s-Frame for this exercise. If someone could help me understand what are their pros and cons. What are the factors that are important in this kind of decision making that would be much appreciated.
good question! one option you may consider is to not use any of the libraries aformentioned, but instead read and process your file chunk-by-chunk, something like this:
csv="""\path\to\file.csv"""
pandas allows to read data from (large) files chunk-wise via a file-iterator:
it = pd.read_csv(csv, iterator=True, chunksize=20000000 / 10)
for i, chunk in enumerate(it):
...
I have an API service where user push arbitrary json objects, these json objects can be nested and others just normal. I am facing some challenges on how to effectively convert the incoming json objects into something much more suitable for storage on cassandra. Advice on how to handle is highly appreciated.
Thanks.
As suggested by #omnibear you should take a look at the linked answer. Basically what you need to figure out before deciding on a solution is the answer to the question: "how do you process each JSON after storing it?". Some possible scenarios:
if you process it as it is, then you can store it as a blob
if you have situations where you need to modify a predefined subset of the attributes of the JSON, then you might want to store those as columns and the rest of the JSON as a blob
I am using Hadoop to process text messages(SMS). but I am not sure of the best way to pre-process these data so that I can do an efficient search. for example, after preprocessing the data if someone searches for 'NY' I will be able to display the messages containing the word 'NY'.
Is it advisable to write the pre-processed data to an xml file and not to a database.
NOTE: I have around 200K text messages in an .csv file.
The way I import preprocessed data to hdfs is to first import the data (csv file in your case) into a database and then create a table view that fine-tunes it to your needs. Then I import the data into hdfs using Sqoop. More Information on sqoop can be found here
http://www.cloudera.com/blog/2009/06/introducing-sqoop/
for doing a sqoop import from a database take a look at
http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_connecting_to_a_database_server
You probably want to index the text messages, maybe using something like Lucene.
Go for
Solr (Especially used for text mining)
Powerful full-text search
Provides dynamic clustering
Provides database integration as well
Supports .csv,.xml,word,pdf..
Highly scalable