Spark read json with partial schema - apache-spark

I need to process quite a big json file using spark. I don't need all the fields in the json and actually would like to read only part of them (not read all fields and project).
I was wondering If I could use the json connector and give it a partial read schema with only the fields I'm interested loading.

It depends on whether your json is multi line. Currently spark only support json on single line as data frame. The next release of spark 2.3 will support multiline json.
But for your question. I don't think you can use a partial schema to read in json. You can first provide the full schema to read in as a dataframe, then select the specific column you need to construct your partial schema as a seperate dataframe. Since spark's use lazy evaluation and the sql engine is able to push down the filter, the performance won't be bad.

Related

JOOQ generator for Apache Spark parquet dataframes?

I work in a place where we use JOOQ for sql query generation in some part of the backend code. Lots of code has been written to work with it. On my side of things, I would like to map theses features into spark and especially generate queries in Spark SQL over dataframes loaded from a bunch of parquet files.
Is there any tooling to generate DSL classes from parquet (or spark) schema? I could not find any. Other approaches has been successful on this matter?
Ideally, I would like to generate tables and fields dynamically from possibly evolving schema.
I know this is a broad question and I will close it if it is deemed out of scope for SO.
jOOQ doesn't officially support Spark, but you have a variety of options to reverse engineer any schema metadata that you have in your Spark database:
Using the JDBCDatabase
Like any other jooq-meta Database implementation, you can use the JDBCDatabase that reverse engineers anything it can find through the JDBC DatabaseMetaData API, if your JDBC driver supports that.
Using files as a meta data source
As of jOOQ version 3.10, there are three different types of "offline" meta data sources that you can use to generate data:
The XMLDatabase will generate code from an XML file.
The JPADatabase will generate code from JPA-annotated entities.
The DDLDatabase will parse DDL file(s) and reverse engineer its output (this probably won't work well for Spark, as its syntax is not officially supported)
Not using the code generator
Of course, you don't have to generate any code. You can get meta data information directly from your JDBC driver (again through the DatabaseMetaData API), which is abstracted through DSLContext.meta(), or you supply the schema again dynamically to jOOQ using XML content through DSLContext.meta(InformationSchema)

How to use from_json to allow for messages to have different fields?

I am trying to process data from Kafka using Spark Structured Streaming. The code for ingesting the data is as follows:
val enriched = df.select($"value" cast "string" as "json")
.select(from_json($"json", schema) as "data")
.select("data.*")
ds is a DataFrame with the data consumed from Kafka.
The problem comes when I try to read is as JSON in order to do faster queries. the function that comes from org.apache.spark.sql.functions from_json() is asking obligatory for a schema. What if the messages have some different fields?
As #zero323 and the answer he or she referenced suggest, you are asking a contradictory question: essentially how does one impose a schema when one doesn't know the schema? One can't of course. I think the idea to use open-ended collection types is your best option.
Ultimately though, it is almost certainly true that you can represent your data with a case class even if it means using a lot of Options, strings you need to parse, and maps you need to interrogate. Invest in the effort to define that case class. Otherwise, your Spark jobs will essentially a lot of ad hoc, time-consuming busywork.

Complex json log data transformation using?

I am new to data science tools and have a use case to transform json logs into a flattened columnar data maybe considered as normal csv, I was looking into a lot of alternatives (tools) to approach this problem and found that I can easily solve this using Apache Spark Sql but the problem is my json log can be a complex data structure with hierarchical arrays i.e. I would have to explode the dataset multiple times to transform it.
The problem is I don't want to hard code the logic for data transformation as I wish to reuse the same chunk of code with different transformation logic, or to put it in a better way I want my transformation to be driven by configurations rather than code.
For the same reason I was looking into Apache Avro which provides me with liberty to define my own schema for the input, but here the problem is I am unaware if I can also define the output schema as well ? If not then it will be same as reading and filtering the avro data structure (generated) into my code logic.
One probable solution which I can think of is to define my schema along with the array fields and some flags to notify my parser to explode on them, which might be recursive as well till I transform the input schema into output i.e. generating the transformation logic based on my input and output schemas.
Is there any better approach which I am unaware of or not being able to think about ?

Avoid the use of Java data structures in Apache Spark to avoid copying the data

I have a MySQL database with a single table containing about 100 million records (~25GB, ~5 columns). Using Apache Spark, I extract this data via a JDBC connector and store it in a DataFrame.
From here, I do some pre-processing of the data (e.g. replacing the NULL values), so I absolutely need to go through each record.
Then I would like to perform dimensionality reduction and feature selection (e.g. using PCA), perform clustering (e.g. K-Means) and later on do the testing of the model on new data.
I have implemented this in Spark's Java API, but it is too slow (for my purposes) since I do a lot of copying of the data from a DataFrame to a java.util.Vector and java.util.List (to be able to iterate over all records and do the pre-processing), and later back to a DataFrame (since PCA in Spark expects a DataFrame as input).
I have tried extracting information from the database into a org.apache.spark.sql.Column but cannot find a way to iterate over it.
I also tried avoiding the use of Java data structures (such as List and Vector) by using the org.apache.spark.mllib.linalg.{DenseVector, SparseVector}, but cannot get that to work either.
Finally, I also considered using JavaRDD (by creating it from a DataFrame and a custom schema), but couldn't work it out entirely.
After a lengthy description, my question is: is there a way to do all steps mentioned in the first paragraph, without copying all the data into a Java data structure?
Maybe one of the options I tried could actually work, but I just can't seem to find out how, as the docs and literature on Spark are a bit scarce.
From the wording of your question, it seems there is some confusion about the stages of Spark processing.
First, we tell Spark what to do by specifying inputs and transformations. At this point, the only things that are known are (a) the number of partitions at various stages of processing and (b) the schema of the data. org.apache.spark.sql.Column is used at this stage to identify the metadata associated with a column. However, it doesn't contain any of the data. In fact, there is no data at all at this stage.
Second, we tell Spark to execute an action on a dataframe/dataset. This is what kicks off processing. The input is read and flows through the various transformations and into the final action operation, be it collect or save or something else.
So, that explains why you cannot "extract information from the database into" a Column.
As for the core of your question, it's hard to comment without seeing your code and knowing exactly what it is you are trying to accomplish but it is safe to say that much migrating between types is a bad idea.
Here are a couple of questions that might help guide you to a better outcome:
Why can't you perform the data transformations you need by operating directly on the Row instances?
Would it be convenient to wrap some of your transformation code into a UDF or UDAF?
Hope this helps.

Efficient way to store a JSON string in a Cassandra column?

Cassandra newbie question. I'm collecting some data from a social networking site using REST calls. So I end up with the data coming back in JSON format.
The JSON is only one of the columns in my table. I'm trying to figure out what the "best practice" is for storing the JSON string.
First I thought of using the map type, but the JSON contains a mix of strings, numerical types, etc. It doesn't seem like I can declare wildcard types for the map key/value. The JSON string can be quite large, probably over 10KB in size. I could potentially store it as a string, but it seems like that would be inefficient. I would assume this is a common task, so I'm sure there are some general guidelines for how to do this.
I know Cassandra has native support for JSON, but from what I understand, that's mostly used when the entire JSON map matches 1-1 with the database schema. That's not the case for me. The schema has a bunch of columns and the JSON string is just a sort of "payload". Is it better to store the JSON string as a blob or as text? BTW, the Cassandra version is 2.1.5.
Any hints appreciated. Thanks in advance.
In the Cassandra Storage engine there's really not a big difference between a blob and a text, since Cassandra stores text as blobs essentially. And yes the "native" JSON support you speak of is only for when your data model matches your JSON model, and it's only in Cassandra 2.2+.
I would store it as a text type, and you shouldn't have to implement anything to compress your JSON data when sending the data (or handle uncompressing). Since Cassandra's Binary Protocol supports doing transport compression. Also make sure your table is storing the data compressed with the same compression algorithm (I suggest using LZ4 since it's the fastest algo implmeneted) to save on doing compression for each read request. Thus if you configure storing the data compressed and use transport compression, you don't even have to implement either yourself.
You didn't say which Client Driver you're using, but here's the documentation on how to setup Transport Compression for Datastax Java Client Driver.
It depends on how to want to query your JSON. There are 3 possible strategies:
Store as a string
Store as a compressed blob
Store as a blob
Option 1 has the advantage of being human readable when you query your data on command line with cqlsh or if you want to debug data directly live. The drawback is the size of this JSON column (10k)
Option 2 has the advantage to keep the JSON payload small because text elements have a pretty decent compression ration. Drawbacks are: a. you need to take care of compression/decompression client side and b. it's not human readable directly
Option 3 has drawbacks of option 1 (size) and 2 (not human readable)

Resources