How to write Pyspark dataframe to json with partitions and without column

How to write Pyspark dataframe to json with partitions and without column - apache-spark

I know something similar has been asked before but I can't seem to figure it out. I partition by the listed columns to create directories in s3 but my payload column contains the actual data I need. Previous answers stated to save as text but that doesn't work since I have a MapType and need the file to be json.
Code Used:
(
df
.write
.format('json')
.partitionBy("country_code", "state_code", "size")
.mode("append")
.save('/mnt/dev/test')
)
Current Output:
{
"payload": {
"100": {
"cumulative_ttl_sold": 11,
"cumulative_ttl_returned": 1
}
}
}
Expected Output:
{
"100": {
"cumulative_ttl_sold": 11,
"cumulative_ttl_returned": 1
}
}

Related

Shareplex CDC output - complete after image possible?

Shareplex CDC offers 3 JSON sub-structs per CDC record:
meta operation type, insert, del, ...
data actual changed data with column names
key the before image, thus all fields including those that changed in "data"
This is what data engineers state and the documentation seems to state this possibility only, as well.
My question is how can we get the complete after image of the record including both changed and non-changed data? May be it is simply not possible.
{
"meta":{
"op":"upd",
"table":"BILL.PRODUCTS"
},
"data":{
"PRICE":"3599"
},
"key":{
"PRODUCT_ID":"230117",
"DESCRIPTION":"Hamsberry vintage tee, cherry",
"PRICE":"4099"
}
}
The above approach is unhandy with Spark schema's being computed in batch, or defining the complete schema in conjunction with NULL values issues, as far as I can see.

No, this is standardly not possible.
What you can do is the read the Kafka JSON, do as per below and set the after image on a new Kafka Topic and proceed:
import org.json4s._
import org.json4s.jackson.JsonMethods._
val jsonS =
"""
{
"meta":{
"op":"upd",
"table":"BILL.PRODUCTS"
},
"data":{
"PRICE":"3599"
},
"key":{
"PRODUCT_ID":"230117",
"DESCRIPTION":"Hamsberry vintage tee, cherry",
"PRICE":"4099"
}
}
""".stripMargin
val jsonNN = parse(jsonS)
val meta = jsonNN\"meta"
val data = jsonNN\"data"
val key = jsonNN\"key"
val Diff(changed, added, deleted) = key diff data
val afterImage = changed merge deleted
// Convert to JSON
println(pretty(render(afterImage)))

Spark 3.1 - java dataset filter using FilterFunction<Row>

i want to filter dataset based on some condition in spark 3.1 with java.
this is my input dataset
+------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|id |zip |fulldetails | |
+------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|123541|123541|{"id":"123","name":"aaaa","nBr":1,"type":{"id":5},"status":{"desc":"NONE"},"mainData":{"proceed":"N","savings":{"processed":{"amount":111}}},"findings":[{"newData":[{"place":{"no":1,"doorNo":2}}]}]} |
+------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Below is the value of 3rd column which is in Json fromat
{
"id":"123",
"name":"aaaa",
"nBr":1,
"type":{
"id":5
},
"status":{
"desc":"NONE"
},
"mainData":{
"proceed":"N",
"savings":{
"processed":{
"amount":111.0
}
}
},
"findings":[
{
"newData":[
{
"place":{
"no":1,
"doorNo":2
}
}
]
}
]
}
i want to filter this dataset based on the value of 3rd column.
i.e)mainData.savings.processed.amount < 100.0 and name != aaaa
If we change this Json data to Struct we can filter using normal spark dataset.filter(col('col_name').geq(lit(0)))
but iam looking normal filter method in spark3.1-java.
i saw we can do it using FilterFunction. if i use this like below how can i get each data from this Json.
dataset.filter((FilterFunction<Row>) row->{
});
Can anyone help me on this. Thanks in Advance :)

Is it possible to build spark code on fly and execute?

I am trying to create a generic function to read a csv file using databriks CSV READER.But the option's are not mandatory it can differ based on the my input json configuration file.
Example1 :
"ReaderOption":{
"delimiter":";",
"header":"true",
"inferSchema":"true",
"schema":"""some custome schema.."""
},
Example2:
"ReaderOption":{
"delimiter":";",
"schema":"""some custome schema.."""
},
Is it possible to construct options or the entire read statement in runtime and run in spark ?
like below,
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.option(options)
.load(inputPath)
readDF
}

def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.options(options)
.load(inputPath)
readDF
}
There is an options which takes key, value pair.

Filter JavaRDD into multiple JavaRDD based on Condtion

I have one JavaRdd records
I would like to create 3 JavaRdd from records depending on condition:
JavaRdd<MyClass> records1 =records1.filter(record -> “A”.equals(record.getName()));
JavaRdd<MyClass> records2 =records1.filter(record -> “B”.equals(record.getName()));
JavaRdd<MyClass> records13=records1.filter(record -> “C”.equals(record.getName()));
The problem is, that I can do like I show above, but my records may have millions record and I don’t want to scan all records 3 times.
So I want to do it in one iteration over the records.
I need something like this:
records
.forEach(record -> {
if (“A”.equals(records.getName()))
{
records1(record);
}
else if (“B”.equals(records.getName()))
{
records2(record);
}
else if (“C”.equals(records.getName()))
{
records3(record);
}
});
How can I achieve this in Spark usin JavaRDD?

In my idea you can use "MapToPair" and new a Tuple2 object in each of your if condition block. Then your key in the Tuple2 will help you to find each rdd objects type. In other words, Tuple2s key shows the type of the objects you wanted to store in one rdd and it's value is your main data.
your code would be something like below:
JavaPairRdd<String,MyClass> records1 =records.forEach(record -> {
String key = "";
if (“A”.equals(record.getName()))
{
key="A";
}
else if ("B".equals(record.getName()))
{
key="B";
}
else if ("C".equals(record.getName()))
{
key="C";
}
return new Tuple2<>(key, record);
});
the resulting pairrdd objects can be divided by different keys you have used at foreach method.

How to decide whether to use Spark RDD filter or not

I am using spark to read and analyse a data file, file contains data like following:
1,unit1,category1_1,100
2,unit1,category1_2,150
3,unit2,category2_1,200
4,unit3,category3_1,200
5,unit3,category3_2,300
The file contains around 20 million records. If user input unit or category, spark need filter the data by inputUnit or inputCategory.
Solution 1:
sc.textFile(file).map(line => {
val Array(id,unit,category,amount) = line.split(",")
if ( (StringUtils.isNotBlank(inputUnit) && unit != inputUnit ) ||
(StringUtils.isNotBlank(inputCategory) && category != inputCategory)){
null
} else {
val obj = new MyObj(id,unit,category,amount)
(id,obj)
}
}).filter(_!=null).collectAsMap()
Solution 2:
var rdd = sc.textFile(file).map(line => {
val (id,unit,category,amount) = line.split(",")
(id,unit,category,amount)
})
if (StringUtils.isNotBlank(inputUnit)) {
rdd = rdd.filter(_._2 == inputUnit)
}
if (StringUtils.isNotBlank(inputCategory)) {
rdd = rdd.filter(_._3 == inputCategory)
}
rdd.map(e => {
val obj = new MyObject(e._1, e._2, e._3, e._4)
(e._1, obj)
}).collectAsMap
I want to understand, which solution is better, or both of them are poor? If both are poor, how to make a good one? Personally, I think second one is better, but I am not quite sure whether it is nice to declare a rdd as var... (I am new to Spark, and I am using Spark 1.5.0 and Scala 2.10.4 to write the code, this is my first time asking a question in StackOverFlow, feel free to edit if it is not well formatted) Thanks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to write Pyspark dataframe to json with partitions and without column - apache-spark

Related

Shareplex CDC output - complete after image possible?

Spark 3.1 - java dataset filter using FilterFunction<Row>

Is it possible to build spark code on fly and execute?

Filter JavaRDD into multiple JavaRDD based on Condtion

How to decide whether to use Spark RDD filter or not

Categories

Resources