how to move hdfs files as ORC files in S3 using distcp? - apache-spark

I have a requirement to move text files in hdfs to aws s3. The files in HDFS are text files and non-partitioned.The output of the S3 files after migration should be in orc and partitioned on specific column. Finally a hive table is created on top of this data.
One way to achieve this is using spark. But I would like to know, is this possible using Distcp to copy files as ORC.
Would like to know any other best option is available to accomplish the above task.
Thanks in Advance.

DistCp is just a copy command; it doesn't do conversion of anything. You are trying to execute a query generating some ORC formatted output. You will have to use a tool like Hive, Spark or Hadoop MapReduce to do it.

Related

Transform CSV into Parquet using Apache Flume?

I have a question, is it possible to execute ETL for data using flume.
To be more specific I have flume configured on spoolDir which contains CSV files and I want to convert those files into Parquet files before storing them into Hadoop. Is it possible ?
If it's not possible would you recommend transforming them before storing in Hadoop or transform them using spark on Hadoop?
I'd probably suggest using nifi to move the files around. Here's a specific tutorial on how to do that with Parquet. I feel nifi was the replacement for Apache Flume.
Flume partial answers:(Not Parquet)
If you are flexible on format you can use an avro sink. You can use a hive sink and it will create a table in ORC format.(You can see if it also allows parquet in the definition but I have heard that ORC is the only supported format.)
You could likely use some simple script to use hive to move the data from the Orc table to a Parquet table. (Converting the files into the parquet files you asked for.)

Is there a way to read a parquet file with apache flink?

I'm new on Apache Flink and I cannot find a way to read a parquet file from the file system.
I came from Spark where a simple "spark.read.parquet("...")" did the job.
Is it possible?
Thank you in advance
Actually, it depends on the way your are going to read the parquet.
If you are trying to simply read parquet files and want to leverage a DataStream connector, this stackoverflow question can be the entry point and a working example.
If you prefer the Table API, Table & SQL Connectors - Parquet Format can be helpful to start from.

Renaming Exported files from Spark Job

We are currently using Spark Job on Databricks which do processing on our data lake which in S3.
Once the processing is done we export our result to S3 bucket using normal
df.write()
The issue is when we write dataframe to S3 the name of file is controlled by Spark, but as per our agreement we need to rename this files to a meaningful name.
Since S3 doesn't have renaming feature we are right now using boto3 to copy and paste file with expected name.
This process is very complex and not scalable with more client getting onboard.
Do we have any better solution to rename exported files from spark to S3 ?
It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. If the file is small enough to fit into memory, one work around is to convert to a pandas dataframe and save as csv from there.
df_pd = df.toPandas()
df_pd.to_csv("path")

hadoop: In which format data is stored in HDFS

I am loading data into HDFS using spark. How is the data stored in HDFS? Is it encrypt mode? Is it possible to crack the HDFS data? how about Security for existing data?
I want to know the details how the system behaves.
HDFS is a distributed file system which supports various formats like plain text format csv, tsv files. Other formats like parquet, orc, Json etc..
While saving the data in HDFS in spark you need to specify the format.
You can’t read parquet files without any parquet tools but spark can read it.
The security of HDFS is governed by Kerberos authentication. You need to set up the authentication explicitly.
But the default format of spark to read and write data is - parquet
HDFS can store data in many formats and Spark has the ability to read it (csv, json, parquet etc). While writing back specify the format that you wish to save the file in.
reading some stuff on the below commands will help you this thing:
hadoop fs -ls /user/hive/warehouse
hadoop fs -get (this till get the files from hdfs to your local file system)
hadoop fs -put (this will put the files from your local file system to hdfs)

Does presto require a hive metastore to read parquet files from S3?

I am trying to generate parquet files in S3 file using spark with the goal that presto can be used later to query from parquet. Basically, there is how it looks like,
Kafka-->Spark-->Parquet<--Presto
I am able to generate parquet in S3 using Spark and its working fine. Now, I am looking at presto and what I think I found is that it needs hive meta store to query from parquet. I could not make presto read my parquet files even though parquet saves the schema. So, does it mean at the time of creating the parquet files, the spark job has to also store metadata in hive meta store?
If that is the case, can someone help me find an example of how it's done. To add to the problem, my data schema is changing, so to handle it, I am creating a programmatic schema in spark job and applying it while creating parquet files. And, if I am creating the schema in hive metastore, it needs to be done keeping this in consideration.
Or could you shed light on it if there is any better alternative way?
You keep the Parquet files on S3. Presto's S3 capability is a subcomponent of the Hive connector. As you said, you can let Spark define tables in Spark or you can use Presto for that, e.g.
create table hive.default.xxx (<columns>)
with (format = 'parquet', external_location = 's3://s3-bucket/path/to/table/dir');
(Depending on Hive metastore version and its configuration, you might need to use s3a instead of s3.)
Technically, it should be possible to create a connector that infers tables' schemata from Parquet headers, but I'm not aware of an existing one.

Resources