Merging duplicate columns in seq json hdfs files in spark - apache-spark

I am reading a seq json file from HDFS using spark like this :
val data = spark.read.json(spark.sparkContext.sequenceFile[String, String]("/prod/data/class1/20190114/2019011413/class2/part-*").map{
case (x,y) =>
(y.toString)})
data.registerTempTable("data")
val filteredData = data.filter("sourceInfo='Web'")
val explodedData = filteredData.withColumn("A", explode(filteredData("payload.adCsm.vfrd")))
val explodedDataDbg = explodedData.withColumn("B", explode(filteredData("payload.adCsm.dbg"))).drop("payload")
On which I am getting this error:
org.apache.spark.sql.AnalysisException:
Ambiguous reference to fields StructField(adCsm,ArrayType(StructType(StructField(atfComp,StringType,true), StructField(csmTot,StringType,true), StructField(dbc,ArrayType(LongType,true),true), StructField(dbcx,LongType,true), StructField(dbg,StringType,true), StructField(dbv,LongType,true), StructField(fv,LongType,true), StructField(hdr,LongType,true), StructField(hidden,StructType(StructField(duration,LongType,true), StructField(stime,StringType,true)),true), StructField(hvrx,DoubleType,true), StructField(hvry,DoubleType,true), StructField(inf,StringType,true), StructField(isP,LongType,true), StructField(ltav,StringType,true), StructField(ltdb,StringType,true), StructField(ltdm,StringType,true), StructField(lteu,StringType,true), StructField(ltfm,StringType,true), StructField(ltfs,StringType,true), StructField(lths,StringType,true), StructField(ltpm,StringType,true), StructField(ltpq,StringType,true), StructField(ltts,StringType,true), StructField(ltut,StringType,true), StructField(ltvd,StringType,true), StructField(ltvv,StringType,true), StructField(msg,StringType,true), StructField(nl,LongType,true), StructField(prerender,StructType(StructField(duration,LongType,true), StructField(stime,LongType,true)),true), StructField(pt,StringType,true), StructField(src,StringType,true), StructField(states,StringType,true), StructField(tdr,StringType,true), StructField(tld,StringType,true), StructField(trusted,BooleanType,true), StructField(tsc,LongType,true), StructField(tsd,DoubleType,true), StructField(tsz,DoubleType,true), StructField(type,StringType,true), StructField(unloaded,StructType(StructField(duration,LongType,true), StructField(stime,LongType,true)),true), StructField(vdr,StringType,true), StructField(vfrd,LongType,true), StructField(visible,StructType(StructField(duration,LongType,true), StructField(stime,StringType,true)),true), StructField(xpath,StringType,true)),true),true), StructField(adcsm,ArrayType(StructType(StructField(tdr,DoubleType,true), StructField(vdr,DoubleType,true)),true),true);
Not sure how, but ONLY SOMETIMES there are two structs with the same name "adCsm" inside "payload". Since I am interested in fields present in one of them, I need to deal with this ambiguity.
I know one way is to check for the field A and B and drop the column if the fields are absent and hence choose the other adCsm. Was wondering if there is any better way to handle this? If I can probably just merge the duplicate columns (with different data) instead of this explicit filtering?
Not sure how duplicate structs are even present in a seq "json" file
TIA!

I think, the ambiguity happened due to case sensitivity issue in spark dataframe column name. In the last part of the schema i see
StructField(adcsm,
ArrayType(StructType(
StructField(tdr,DoubleType,true),
StructField(vdr,DoubleType,true)),true),true)
So there is two same name structFields (adScm and adscm) inside plain StructType.
First enable case sensitivity of spark sql by
sqlContext.sql("set spark.sql.caseSensitive=true")
then it'll differentiate the two fields. Here is details to solve case sensitive issue solve case sensitivity issue
. Hopefully it'll help you.

Related

passing array into isin() function in Databricks

I have a requirement where I will have to filter records from a df if that is present in one array. so I have an array that is distinct values from another df's column like below.
dist_eventCodes = Event_code.select('Value').distinct().collect()
now I am passing this dist_eventCodes in a filter like below.
ADT_df_select = ADT_df.filter(ADT_df.eventTypeCode.isin(dist_eventCodes))
when I run this code I get the below error message
"AttributeError: 'DataFrame' object has no attribute '_get_object_id'"
can somebody please help me under what wrong am i doing?
Thanks in advance
If I understood correctly, you want to retain only those rows where eventTypeCode is within eventTypeCode from Event_code dataframe
Let me know if this is not the case
This can be achieved by a simple left-semi join in spark. This way you don't need to collect the dataframe, thus would be the right way in a distributed environment.
ADT_df.alias("df1").join(Event_code.select("value").distinct().alias("df2"), [F.col("df1.eventTypeCode")=F.col("df2.value")], 'leftsemi')
Or if there is a specific need to use isin, this would work (collect_set will take care of distinct):
dist_eventCodes = Event_code.select("value").groupBy(F.lit("dummy")).agg(F.collect_set("value").alias("value")).first().asDict()
ADT_df_select = ADT_df.filter(ADT_df["eventTypeCode"].isin(dist_eventCodes["value"]))
Input (ADT_df):
Event_code Dataframe:
Output:

SPARK Combining Neighbouring Records in a text file

very new to SPARK.
I need to read a very large input dataset, but I fear the format of the input files would not be amenable to read on SPARK. Format is as follows:
RECORD,record1identifier
SUBRECORD,value1
SUBRECORD2,value2
RECORD,record2identifier
RECORD,record3identifier
SUBRECORD,value3
SUBRECORD,value4
SUBRECORD,value5
...
Ideally what I would like to do is pull the lines of the file into a SPARK RDD, and then transform it into an RDD that only has one item per record (with the subrecords becoming part of their associated record item).
So if the example above was read in, I'd want to wind up with an RDD containing 3 objects: [record1,record2,record3]. Each object would contain the data from their RECORD and any associated SUBRECORD entries.
The unfortunate bit is that the only thing in this data that links subrecords to records is their position in the file, underneath their record. That means the problem is sequentially dependent and might not lend itself to SPARK.
Is there a sensible way to do this using SPARK (and if so, what could that be, what transform could be used to collapse the subrecords into their associated record)? Or is this the sort of problem one needs to do off spark?
There is a somewhat hackish way to identify the sequence of records and sub-records. This method assumes that each new "record" is identifiable in some way.
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.expressions.Window
val df = Seq(
("RECORD","record1identifier"),
("SUBRECORD","value1"),
("SUBRECORD2","value2"),
("RECORD","record2identifier"),
("RECORD","record3identifier"),
("SUBRECORD","value3"),
("SUBRECORD","value4"),
("SUBRECORD","value5")
).toDS().rdd.zipWithIndex.map(r => (r._1._1, r._1._2, r._2)).toDF("record", "value", "id")
val win = Window.orderBy("id")
val recids = df.withColumn("newrec", ($"record" === "RECORD").cast(LongType))
.withColumn("recid", sum($"newrec").over(win))
.select($"recid", $"record", $"value")
val recs = recids.where($"record"==="RECORD").select($"recid", $"value".as("recname"))
val subrecs = recids.where($"record" =!= "RECORD").select($"recid", $"value".as("attr"))
recs.join(subrecs, Seq("recid"), "left").groupBy("recname").agg(collect_list("attr").as("attrs")).show()
This snippet will first zipWithIndex to identify each row, in order, then add a boolean column that is true every time a "record" is identified, and false otherwise. We then cast that boolean to a long, and then can do a running sum, which has the neat side-effect of essentially labeling every record and it's sub-records with a common identifier.
In this particular case, we then split to get the record identifiers, re-join only the sub-records, group by the record ids, and collect the sub-record values to a list.
The above snippet results in this:
+-----------------+--------------------+
| recname| attrs|
+-----------------+--------------------+
|record1identifier| [value1, value2]|
|record2identifier| []|
|record3identifier|[value3, value4, ...|
+-----------------+--------------------+

spark save taking lot of time

I've 2 dataframes and I want to find the records with all columns equal except 2 (surrogate_key,current)
And then I want to save those records with new surrogate_key value.
Following is my code :
val seq = csvDataFrame.columns.toSeq
var exceptDF = csvDataFrame.except(csvDataFrame.as('a).join(table.as('b),seq).drop("surrogate_key","current"))
exceptDF.show()
exceptDF = exceptDF.withColumn("surrogate_key", makeSurrogate(csvDataFrame("name"), lit("ecc")))
exceptDF = exceptDF.withColumn("current", lit("Y"))
exceptDF.show()
exceptDF.write.option("driver","org.postgresql.Driver").mode(SaveMode.Append).jdbc(postgreSQLProp.getProperty("url"), tableName, postgreSQLProp)
This code gives correct results, but get stuck while writing those results to postgre.
Not sure what's the issue. Also is there any better approach for this??
Regards,
Sorabh
By Default spark-sql creates 200 partitions, which means when you are trying to save the datafrmae it will be saved in 200 parquet files. you can reduce the number of partitions for Dataframe using below techniques.
At application level. Set the parameter "spark.sql.shuffle.partitions" as follows :
sqlContext.setConf("spark.sql.shuffle.partitions", "10")
Reduce the number of partition for a particular DataFrame as follows :
df.coalesce(10).write.save(...)
Using the var for dataframe are not suggested, You should always use val and create a new Dataframe after performing some transformation in dataframe.
Please remove all the var and replace with val.
Hope this helps!

How to process tab-separated files in Spark?

I have a file which is tab separated. The third column should be my key and the entire record should be my value (as per Map reduce concept).
val cefFile = sc.textFile("C:\\text1.txt")
val cefDim1 = cefFile.filter { line => line.startsWith("1") }
val joinedRDD = cefFile.map(x => x.split("\\t"))
joinedRDD.first().foreach { println }
I am able to get the value of first column but not third. Can anyone suggest me how I could accomplish this?
After you've done the split x.split("\\t") your rdd (which in your example you called joinedRDD but I'm going to call it parsedRDD since we haven't joined it with anything yet) is going to be an RDD of Arrays. We could turn this into an array of key/value tuples by doing parsedRDD.map(r => (r(2), r)). That being said - you aren't limited to just map & reduce operations in Spark so its possible that another data structure might be better suited. Also for tab separated files, you could use spark-csv along with Spark DataFrames if that is a good fit for the eventual problem you are looking to solve.

non-ordinal access to rows returned by Spark SQL query

In the Spark documentation, it is stated that the result of a Spark SQL query is a SchemaRDD. Each row of this SchemaRDD can in turn be accessed by ordinal. I am wondering if there is any way to access the columns using the field names of the case class on top of which the SQL query was built. I appreciate the fact that the case class is not associated with the result, especially if I have selected individual columns and/or aliased them: however, some way to access fields by name rather than ordinal would be convenient.
A simple way is to use the "language-integrated" select method on the resulting SchemaRDD to select the column(s) you want -- this still gives you a SchemaRDD, and if you select more than one column then you will still need to use ordinals, but you can always select one column at a time. Example:
// setup and some data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Score(name: String, value: Int)
val scores =
sc.textFile("data.txt").map(_.split(",")).map(s => Score(s(0),s(1).trim.toInt))
scores.registerAsTable("scores")
// initial query
val original =
sqlContext.sql("Select value AS myVal, name FROM scores WHERE name = 'foo'")
// now a simple "language-integrated" query -- no registration required
val secondary = original.select('myVal)
secondary.collect().foreach(println)
Now secondary is a SchemaRDD with just one column, and it works despite the alias in the original query.
Edit: but note that you can register the resulting SchemaRDD and query it with straight SQL syntax without needing another case class.
original.registerAsTable("original")
val secondary = sqlContext.sql("select myVal from original")
secondary.collect().foreach(println)
Second edit: When processing an RDD one row at a time, it's possible to access the columns by name by using the matching syntax:
val secondary = original.map {case Row(myVal: Int, _) => myVal}
although this could get cumbersome if the right hand side of the '=>' requires access to a lot of the columns, as they would each need to be matched on the left. (This from a very useful comment in the source code for the Row companion object)

Resources