read binary file and store into hive using spark - apache-spark

Would like to read a binary file from local disk
and write into HVE as key value - blob using spark.
Format is as below :
table XXX
MyKey ---> BinaryBlob
By looking at other answer, I found this to read but this is not clear how to insert into HIVE.
sc.binaryFiles("file:///path/to/local/file")
http://www.informit.com/articles/article.aspx?p=2756471&seqNum=4

Related

Load Data using Spark deletes source file from S3

I'm using Spark to load data into a hive table through Pyspark, and when I load data from a path in Amazon S3, the original file is getting wiped from the Directory. The file is found, and is populating the table with data. I also tried to add the Local clause but that throws an error when looking for the file. When looking through the documentation it doesn't explicitly state that this is the intended behavior.
spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE src")

How do I load gziped json data into table, using Spark SQL's CREATE TABLE Query

I want to connect Apache Superset with Apache Spark (I have Spark 3.1.2) and Query the data on Superset's SQL Lab using Apache Spark SQL.
On spark's master, I started thrift server using this command spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
Then I added Spark cluster as a database in Superset using SQLAlchemy URI hive://hive#spark:10000/. I am able to access Spark cluster on Superset.
I can load JSON data as table using this SQL
CREATE table IF NOT EXISTS test_table
USING JSON
LOCATION "/path/to/data.json"
and I am able to Query data using simple SQL statements like SELECT * FROM test_table LIMIT 10
BUT the problem is that json data is compressed as gzipped files.
So I tried
CREATE table IF NOT EXISTS test_table
USING JSON
LOCATION "/path/to/data.json.gz"
but it did not work. I want to know how do load gzipped json data into a table
Compressed JSON storage
If you have large JSON text you can explicitly compress JSON text using built-in COMPRESS function. In the following example compressed JSON content is stored as binary data, and we have computed column that decompress JSON as original text using DECOMPRESS function:
CREATE TABLE Person
( _id int identity constraint PK_JSON_ID primary key,
data varbinary(max),
value AS CAST(DECOMPRESS(data) AS nvarchar(max))
)
INSERT INTO Person(data)
VALUES (COMPRESS(#json))
COMPRESS and DECOMPRESS functions use standard GZip compression.
Another example:
CREATE EXTENSION json_fdw;
postgres=# CREATE SERVER json_server FOREIGN DATA WRAPPER json_fdw;
postgres=# CREATE FOREIGN TABLE customer_reviews
(
customer_id TEXT,
"review.date" DATE,
"review.rating" INTEGER,
"product.id" CHAR(10),
"product.group" TEXT,
"product.title" TEXT,
"product.similar_ids" CHAR(10)[]
)
SERVER json_server
OPTIONS (filename '/home/citusdata/customer_reviews_nested_1998.json.gz');
Note: This example was taken from https://www.citusdata.com/blog/2013/05/30/run-sql-on-json-files-without-any-data-loads

How to write data to hive table with snappy compression in Spark SQL

I have an orc hive table that is created using Hive command
create table orc1(line string) stored as orcfile
I want to write some data to this table using spark sql, I use following code and want the data to be snappy compressed on HDFS
test("test spark orc file format with compression") {
import SESSION.implicits._
Seq("Hello Spark", "Hello Hadoop").toDF("a").createOrReplaceTempView("tmp")
SESSION.sql("set hive.exec.compress.output=true")
SESSION.sql("set mapred.output.compress=true")
SESSION.sql("set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec")
SESSION.sql("set io.compression.codecs=org.apache.hadoop.io.compress.SnappyCodec")
SESSION.sql("set mapred.output.compression.type=BLOCK")
SESSION.sql("insert overwrite table orc1 select a from tmp ")
}
The data is written, but it is NOT compressed with snnapy.
If I run the insert overwrite in Hive Beeline/Hive to write the data and use the above set command , then I could see that the table's files are compressed with snappy.
So, I would ask how to write data with snappy compression in Spark SQL 2.1 to orc tables that are created by Hive
You can set the compression to snappy on the create table command like so
create table orc1(line string) stored as orc tblproperties ("orc.compress"="SNAPPY");
Then any inserts to the table will be snappy compressed (I corrected orcfile to orc in the command also).

Cassandra CQL : insert data from existing file

I have a JSON file that I want to insert into a Cassandra table using CQL.
According to datastax documentation, you can insert json with the following command :
INSERT INTO data JSON '{My_Json}';
But I can't find a way to do that directly from an existing json file. Is this possible or do I need to to some Java code to do that insert ?
Note : I am using Cassandra 3.9
The only file format supported for importing is csv. It is possible to convert your json file to CSV format and import it with the copy command. If that is not an option for you, java code is needed to parse your file and insert it into Cassandra.

How can we export a cassandra table into a csv format using its snapshots file

I have taken snapshot of a cassandra table . Following are the files generated :-
manifest.json mc-10-big-Filter.db mc-10-big-TOC.txt mc-11-big-Filter.db mc-11-big-TOC.txt mc-9-big-Filter.db mc-9-big-TOC.txt
mc-10-big-CompressionInfo.db mc-10-big-Index.db mc-11-big-CompressionInfo.db mc-11-big-Index.db mc-9-big-CompressionInfo.db mc-9-big-Index.db schema.cql
mc-10-big-Data.db mc-10-big-Statistics.db mc-11-big-Data.db mc-11-big-Statistics.db mc-9-big-Data.db mc-9-big-Statistics.db
mc-10-big-Digest.crc32 mc-10-big-Summary.db mc-11-big-Digest.crc32 mc-11-big-Summary.db mc-9-big-Digest.crc32 mc-9-big-Summary.db
Is there a way to use these files to extract data of the table into a csv file .
Yes, you can do that with the sstable2json tool.
Use the tool against the *Data.db file
This outputs in JSON format. You need to convert to CSV after.

Resources