What is the best way to load Excel files to a Hive table?
Is there a command to change them to tab delimited format?
You could look at something with tika parsing, or apache pos parsing for xls spreadsheets.
https://poi.apache.org/
https://tika.apache.org/
You'll need a java-ish language to use this stuff, so consider groovy, jython, clojure, scala, or if you know it java.
I'm doing something similar with a bunch of xlsx files already in hdfs, with this sort of pre-processing before the output ends up in hive. Hopefully your xlsx sheets are somewhat straight forward and just resemble 2d datasets. (embedded pivot tables, charts, etc. don't come across into hive with any context.)
Good luck, it's not pretty... xls is tuff to work with because it's just so flexible.
You can try the newest version of the HadoopOffice library which has a HiveSerde for Excel files https://github.com/ZuInnoTe/hadoopoffice/wiki
Related
I am using the box/spout library for exporting simple Excel files and it is no longer maintained and I wonder what solution I should choose for the current and also future projects.
Box/spout was much faster than the library I used before and as long as you don't need fancy formatting it did, what was needed and in a sufficient fast way.
I wonder which library to use instead now. An export to csv isn't an option since my users are used to the comfort of an Excel file and most are not able to open a csv file in Excel and convert it to Excel format.
I am currently using Symfony 5 and php8.1 in an Alpine Linux container
I know it isn't a direct code question, but I would be glad to know your experience or approach to excel exports in the year 2022.
P.S.: Before I used PHPExcel which was very slow, when you had many rows to export. It got a major refactoring and is now called PhpSpreadsheet but I don't know if they fixed the performance issues with many rows
In our company, we use HDFS. So far everything works out and we can extract data by using queries.
In the past I had worked a lot with Project R. It was always great for my analyses. So I checked Project R and the support of HDFS (rbase, rhdfs,...).
Nevertheless, I am a little bit confused since I found tons of tutorials where they do analyses with simple data saved in CSV files. Don't get me wrong. That's fine but I want to ask if there is a possibility to write queries, extracting the data and do some statistics in one run.
Or in other words: When we talk about statistics for data stored in HDFS, how do you handle this?
Thanks a lot and hopefully some of you can help me out to see pros and cons for my question.
All the best -
Peter
You might like to check out Apache Hive and Apache Spark. Although there are many other option but I am not sure whether you are asking how to work on the data from hdfs when the data is not handed down to you in a file.
I'm looking into using Cassandra to store 50M+ documents that I currently have in XML format. I've been hunting around but I can't seem to find anything I can really follow on how to bulk load this data into Cassandra without needing to write some Java (not high on my list of language skills!).
I can happily write a script to convert this data into any format if it would make the loading easier although CSV might be tricky given the body of the document could contain just about anything!
Any suggestions welcome.
Thanks
Si
If you're willing to convert the XML to a delimited format of some kind (i.e. CSV), then here are a couple options:
The COPY command in cqlsh. This actually got a big performance boost in a recent version of Cassandra.
The cassandra-loader utility. This is a lot more flexible and has a bunch of different options you can tweak depending on the file format.
If you're willing to write code other than Java (for example, Python), there are Cassandra drivers available for a bunch of programming languages. No need to learn Java if you've got another language you're better with.
so my issue comes from the excel data I currently have which I need to convert into 4 separate forms, each with different details. The specfics don't really matter, but what I'm trying to do is code some kind of script that would do into this data and extract the stuff I need, therefore saving me tons of time copying and pasting.
The problem is, i'm not really sure where to start. I have done some research so I am familiar with csv files and I already have a pretty good grasp on java. What would be the best way to approach this problem, from what I have researched, python is very helpful at these string type manipulations, but I also know that it could be done in java using buffered reads/file writes, but I feel like that could get really clunky.
Thanks
I am looking for something* to aid me in manipulating and interpreting data.
Data of the names, addresses and that sorts.
Currently, I am making heavy use of Python to find whether one piece of information relate to another, but I am noticing that a lot of my code could easily be substituted with some sort of Query Language.
Mainly, I need an environment where I can import data in any format, be it xml, html, csv, or excel or database files. And I wish for the software to read it and tell me what columns there are etc., so that I can only worry about writing code that interprets it.
Does this sound concrete enough, if so, anyone in possession of such elegant software?
*Can be a programming language, IDE, combination of those.
Have you looked at the Pandas module in Python? http://pandas.pydata.org/pandas-docs/stable/
When combined with Ipython notebook, it makes a great data manipulation platform.
I think it may let you do a lot of what you want to do. I am not sure how well it handles html, but it's built to handle csv, excel and database files