Cassandra CQL : insert data from existing file - cassandra

I have a JSON file that I want to insert into a Cassandra table using CQL.
According to datastax documentation, you can insert json with the following command :
INSERT INTO data JSON '{My_Json}';
But I can't find a way to do that directly from an existing json file. Is this possible or do I need to to some Java code to do that insert ?
Note : I am using Cassandra 3.9

The only file format supported for importing is csv. It is possible to convert your json file to CSV format and import it with the copy command. If that is not an option for you, java code is needed to parse your file and insert it into Cassandra.

Related

How do I load gziped json data into table, using Spark SQL's CREATE TABLE Query

I want to connect Apache Superset with Apache Spark (I have Spark 3.1.2) and Query the data on Superset's SQL Lab using Apache Spark SQL.
On spark's master, I started thrift server using this command spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
Then I added Spark cluster as a database in Superset using SQLAlchemy URI hive://hive#spark:10000/. I am able to access Spark cluster on Superset.
I can load JSON data as table using this SQL
CREATE table IF NOT EXISTS test_table
USING JSON
LOCATION "/path/to/data.json"
and I am able to Query data using simple SQL statements like SELECT * FROM test_table LIMIT 10
BUT the problem is that json data is compressed as gzipped files.
So I tried
CREATE table IF NOT EXISTS test_table
USING JSON
LOCATION "/path/to/data.json.gz"
but it did not work. I want to know how do load gzipped json data into a table
Compressed JSON storage
If you have large JSON text you can explicitly compress JSON text using built-in COMPRESS function. In the following example compressed JSON content is stored as binary data, and we have computed column that decompress JSON as original text using DECOMPRESS function:
CREATE TABLE Person
( _id int identity constraint PK_JSON_ID primary key,
data varbinary(max),
value AS CAST(DECOMPRESS(data) AS nvarchar(max))
)
INSERT INTO Person(data)
VALUES (COMPRESS(#json))
COMPRESS and DECOMPRESS functions use standard GZip compression.
Another example:
CREATE EXTENSION json_fdw;
postgres=# CREATE SERVER json_server FOREIGN DATA WRAPPER json_fdw;
postgres=# CREATE FOREIGN TABLE customer_reviews
(
customer_id TEXT,
"review.date" DATE,
"review.rating" INTEGER,
"product.id" CHAR(10),
"product.group" TEXT,
"product.title" TEXT,
"product.similar_ids" CHAR(10)[]
)
SERVER json_server
OPTIONS (filename '/home/citusdata/customer_reviews_nested_1998.json.gz');
Note: This example was taken from https://www.citusdata.com/blog/2013/05/30/run-sql-on-json-files-without-any-data-loads

Azure Data Factory - Google BigQuery Copy Data activity not returning nested column names

I have a copy activity in Azure Data Factory with a Google BigQuery source.
I need to import the whole table (which contains nested fields - Records in BigQuery).
Nested fields get imported as follows (a string containing only data values):
"{\"v\":{\"f\":[{\"v\":\"1\"},{\"v\":\"1\"},{\"v\":\"1\"},{\"v\":null},{\"v\":\"1\"},{\"v\":null},{\"v\":null},{\"v\":\"1\"},{\"v\":null},{\"v\":null},{\"v\":null},{\"v\":null},{\"v\":\"0\"}]}}"
Expected output would be something like:
{"nestedColName" : [{"subNestedColName": 1}, {"subNestedColName": 1}, {"subNestedColName": 1}, {"subNestedColName": null}, ...] }
I think this is a connector issue from Data Factory's side but am not sure how to proceed.
Have considered using Databricks to import data from GBQ directly and then saving the DataFrame to sink.
Have also considered querying for a subset of columns and using UNNEST where required but would rather not do this as Parquet handles both Array and Map types.
Anyone encountered this before / what did you do?
Solution used:
Databricks (Spark) connector for Google BigQuery:
https://docs.databricks.com/data/data-sources/google/bigquery.html
This preserves schemas and nested field names.
Preferring the simpler setup of ADF BigQuery connector to Databricks's BigQuery support, I opted for a solution where I extract the data in JSON and 'massage' it into Parquet using Databricks:
Use a Copy activity to get data from BigQuery with all the data packed into a single JSON string field. Output format can be Parquet or JSON (I'm using Parquet). Use a BigQuery query like this:
select TO_JSON_STRING(t) as value from `<your BigQuery table>` as t
NOTE: The name of the field must be value. The df.write.text() text file writer writes the contents of value column into each row of the text file, which is a JSON string in this case.
Run a Databrick notebook activity with code like this:
# Read data and write it out as text file to get the JSON. (Compression is optional).
dfInput=spark.read.parquet(inputpath)
dfInput.write.mode("overwrite").option("compression","gzip").text(tmppath)
# Read back as JSON to extract the correct schema.
dfTemp=spark.read.json(tmppath)
dfTemp.write.mode("overwrite").parquet(outputpath)
Use the output as is, or use a Copy activity to copy it to where you like.

How can we export a cassandra table into a csv format using its snapshots file

I have taken snapshot of a cassandra table . Following are the files generated :-
manifest.json mc-10-big-Filter.db mc-10-big-TOC.txt mc-11-big-Filter.db mc-11-big-TOC.txt mc-9-big-Filter.db mc-9-big-TOC.txt
mc-10-big-CompressionInfo.db mc-10-big-Index.db mc-11-big-CompressionInfo.db mc-11-big-Index.db mc-9-big-CompressionInfo.db mc-9-big-Index.db schema.cql
mc-10-big-Data.db mc-10-big-Statistics.db mc-11-big-Data.db mc-11-big-Statistics.db mc-9-big-Data.db mc-9-big-Statistics.db
mc-10-big-Digest.crc32 mc-10-big-Summary.db mc-11-big-Digest.crc32 mc-11-big-Summary.db mc-9-big-Digest.crc32 mc-9-big-Summary.db
Is there a way to use these files to extract data of the table into a csv file .
Yes, you can do that with the sstable2json tool.
Use the tool against the *Data.db file
This outputs in JSON format. You need to convert to CSV after.

read binary file and store into hive using spark

Would like to read a binary file from local disk
and write into HVE as key value - blob using spark.
Format is as below :
table XXX
MyKey ---> BinaryBlob
By looking at other answer, I found this to read but this is not clear how to insert into HIVE.
sc.binaryFiles("file:///path/to/local/file")
http://www.informit.com/articles/article.aspx?p=2756471&seqNum=4

Read CSV with linebreaks in pyspark

Read CSV with linebreaks in pyspark
I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:
I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.
Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?
You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)
In Spark 2.2 there was added new option - wholeFile. If you write this:
spark.read.option("wholeFile", "true").csv("file.csv")
it will read all file and handle multiline CSV.
There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison
wholeFile does not exist (anymore?) in the spark api documentation:
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
This solution will work:
spark.read.option("multiLine", "true").csv("file.csv")
From the api documentation:
multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

Resources