Bulk load XML files into Cassandra - cassandra

I'm looking into using Cassandra to store 50M+ documents that I currently have in XML format. I've been hunting around but I can't seem to find anything I can really follow on how to bulk load this data into Cassandra without needing to write some Java (not high on my list of language skills!).
I can happily write a script to convert this data into any format if it would make the loading easier although CSV might be tricky given the body of the document could contain just about anything!
Any suggestions welcome.
Thanks
Si

If you're willing to convert the XML to a delimited format of some kind (i.e. CSV), then here are a couple options:
The COPY command in cqlsh. This actually got a big performance boost in a recent version of Cassandra.
The cassandra-loader utility. This is a lot more flexible and has a bunch of different options you can tweak depending on the file format.
If you're willing to write code other than Java (for example, Python), there are Cassandra drivers available for a bunch of programming languages. No need to learn Java if you've got another language you're better with.

Related

Binary file conversion in distributed manner - Spark, Flume or any other option?

We have a scenario where there will be a continuous incoming set of binary files (ASN.1 type to be exact). We want to convert these binary files to different format, say XML or JSON, and write to a different location. I was wondering what would be the best architectural design to handle this kind of problem? I know we could use Spark cluster for CSV, JSON, parquet kind of files but I'm not sure we could use it for binary file processing, or we could use Apache Flume to move files from one place to another and even use interceptor to convert the contents.
It's ideal if we can switch the ASN.1 decoder whenever we have performance considerations without changing the underlying framework of distributed processing (ex: to use C++ based or python based or Java based decoder library).
In terms of scalability, reliability and future-proofing your solution, I'd look at Apache NiFi rather than Flume. You can start by developing your own ASN.1 Processor or try using the patch thats already available but not part of a released version yet.

Statistics with HDFS data

In our company, we use HDFS. So far everything works out and we can extract data by using queries.
In the past I had worked a lot with Project R. It was always great for my analyses. So I checked Project R and the support of HDFS (rbase, rhdfs,...).
Nevertheless, I am a little bit confused since I found tons of tutorials where they do analyses with simple data saved in CSV files. Don't get me wrong. That's fine but I want to ask if there is a possibility to write queries, extracting the data and do some statistics in one run.
Or in other words: When we talk about statistics for data stored in HDFS, how do you handle this?
Thanks a lot and hopefully some of you can help me out to see pros and cons for my question.
All the best -
Peter
You might like to check out Apache Hive and Apache Spark. Although there are many other option but I am not sure whether you are asking how to work on the data from hdfs when the data is not handed down to you in a file.

Sorting data from excel spreadsheets into new files

so my issue comes from the excel data I currently have which I need to convert into 4 separate forms, each with different details. The specfics don't really matter, but what I'm trying to do is code some kind of script that would do into this data and extract the stuff I need, therefore saving me tons of time copying and pasting.
The problem is, i'm not really sure where to start. I have done some research so I am familiar with csv files and I already have a pretty good grasp on java. What would be the best way to approach this problem, from what I have researched, python is very helpful at these string type manipulations, but I also know that it could be done in java using buffered reads/file writes, but I feel like that could get really clunky.
Thanks

Is there any opensource tool for converting xml schema to database schema for linux?

Is there any opensource tool which convert xml schema to database schema for linux. All I need is it should read xml schema, generate corresponding database schema and create tables with that. I tried to google and all I could find is xsd2db and its written in c#, but of no use for me. I am using centos and my database is postgresql. Any help is appreciated. Thanks in advance.
Native support appears on the way, but I can't find anything native. Also not finding any kind of decent tools to do the job.
So, I though this would be a neat weekend project to learn a bit more about XSD. I created xsd2pgsql to handle this. It's still pretty rough around the edges, so I'd like you to try it out and let me know of any problems you have. Or fork it if you'd like to help.
XML isn't the greatest format to represent a database since it's 3d and a DB is pretty much 2d. So some assumptions are made by this script, like all element children of root are the primary table and any complexType after that will be a table. That said, this should work on most XML Schemas(or at least the few I've tested).
You can get all the options with the -h option. But basically, you can provide it with the XSD file(s) as the arguments and you can use the options to change behavior slightly or to have it run the SQL directly on your DB. If it's a production system, I'd recommend not connecting directly to the DB and making sure the SQL output is good to go or not, and to make any adjustments.
Here's an example usage with the sample files in the repository: python xsd2pgsql.py -f sample-2.xsd sample.xsd
NOTE: Currently this doesn't handle any relations/references between tables/XML complex types. You'll have to add those and any indexes you want after the fact. Custom namespaces aren't yet supported either.
Hope this helps.

Data manipulating environment

I am looking for something* to aid me in manipulating and interpreting data.
Data of the names, addresses and that sorts.
Currently, I am making heavy use of Python to find whether one piece of information relate to another, but I am noticing that a lot of my code could easily be substituted with some sort of Query Language.
Mainly, I need an environment where I can import data in any format, be it xml, html, csv, or excel or database files. And I wish for the software to read it and tell me what columns there are etc., so that I can only worry about writing code that interprets it.
Does this sound concrete enough, if so, anyone in possession of such elegant software?
*Can be a programming language, IDE, combination of those.
Have you looked at the Pandas module in Python? http://pandas.pydata.org/pandas-docs/stable/
When combined with Ipython notebook, it makes a great data manipulation platform.
I think it may let you do a lot of what you want to do. I am not sure how well it handles html, but it's built to handle csv, excel and database files

Resources