Is there a way to read a parquet file with apache flink? - apache-spark

I'm new on Apache Flink and I cannot find a way to read a parquet file from the file system.
I came from Spark where a simple "spark.read.parquet("...")" did the job.
Is it possible?
Thank you in advance

Actually, it depends on the way your are going to read the parquet.
If you are trying to simply read parquet files and want to leverage a DataStream connector, this stackoverflow question can be the entry point and a working example.
If you prefer the Table API, Table & SQL Connectors - Parquet Format can be helpful to start from.

Related

Transform CSV into Parquet using Apache Flume?

I have a question, is it possible to execute ETL for data using flume.
To be more specific I have flume configured on spoolDir which contains CSV files and I want to convert those files into Parquet files before storing them into Hadoop. Is it possible ?
If it's not possible would you recommend transforming them before storing in Hadoop or transform them using spark on Hadoop?
I'd probably suggest using nifi to move the files around. Here's a specific tutorial on how to do that with Parquet. I feel nifi was the replacement for Apache Flume.
Flume partial answers:(Not Parquet)
If you are flexible on format you can use an avro sink. You can use a hive sink and it will create a table in ORC format.(You can see if it also allows parquet in the definition but I have heard that ORC is the only supported format.)
You could likely use some simple script to use hive to move the data from the Orc table to a Parquet table. (Converting the files into the parquet files you asked for.)

Most optimal method to check length of a parquet table in dbfs with pyspark?

I have a table on dbfs I can read with pyspark, but I only need to know the length of it (nrows). I know I could just read the file and do a table.count() to get it, but that would take some time.
Is there a better way to solve this?
I am afraid not.
Since you are using dbfs, I suppose you are using Delta format with Databricks. So, theoretically, you could check the metastore, but:
The metastore is not the source of truth about the latest information
of a Delta table
https://docs.delta.io/latest/delta-batch.html#control-data-location

Read file while it is being written by spark structured streaming

I am using spark structured streaming for my application. I have use case where i need to read file while it is being written.
I tried with spark structured streaming as below:
sch=StructType([StructField("ID",IntegerType(),True),StructField("COUNTRY",StringType(),True)])
df_str = spark.readStream.format("csv").schema(sch). option("header",True).option("delimiter",','). load("<Load Path>")
query = df_str.writeStream.format("parquet").outputMode("append").trigger(processingTime='10 seconds').option("path","<HDFS location>").option("checkpointLocation","<chckpoint_loc>").start()
But it is reading only file initially, after that file is not getting read incrementally. i am thinking workaround to write file in temp directory and create new file after some time and copy to directory from spark structured streaming job is reading but this is causing latency.
Is there any other way to handle this(I can not use kafka)?
Sorry if this question is not for Stackoverflow because i did not find any other place to ask this question.
Unfortunately Spark doesn't support it. The unit of file stream source is "file". Spark assumes that the files it reads are "immutable", which means the files shouldn't be changed once they're placed in source path. This makes offset management pretty much simpler (doesn't need to track file offsets), where the number of files in source path would keep increasing. Reasonable limitation, but still a limitation.

how to move hdfs files as ORC files in S3 using distcp?

I have a requirement to move text files in hdfs to aws s3. The files in HDFS are text files and non-partitioned.The output of the S3 files after migration should be in orc and partitioned on specific column. Finally a hive table is created on top of this data.
One way to achieve this is using spark. But I would like to know, is this possible using Distcp to copy files as ORC.
Would like to know any other best option is available to accomplish the above task.
Thanks in Advance.
DistCp is just a copy command; it doesn't do conversion of anything. You are trying to execute a query generating some ORC formatted output. You will have to use a tool like Hive, Spark or Hadoop MapReduce to do it.

Spark Streaming : source HBase

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
sc.newAPIHadoopRDD(..)
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
Thanks!
The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.

Resources