I want to combine only specific data from two files and output a file with columns for these data and how many times they appeared in both data files for an ssn. Should I be doing something like
val data = spark.read.json("file")
val ssn = data.filter(line => line.contains("ssn")).count()
val tickets = data.filter(line => line.contains("tickets")).count()
Is there a way to join the data based off the ssn and then keep incrementing up the tickets if I see the same ssn in the output file?
Also ssn is in a json key value pair in both files but on each line, they are in their own little json message and when I tried to read the files, I got a "_corrupt_record" Should I just be reading it as a plain text?
Related
very new to SPARK.
I need to read a very large input dataset, but I fear the format of the input files would not be amenable to read on SPARK. Format is as follows:
RECORD,record1identifier
SUBRECORD,value1
SUBRECORD2,value2
RECORD,record2identifier
RECORD,record3identifier
SUBRECORD,value3
SUBRECORD,value4
SUBRECORD,value5
...
Ideally what I would like to do is pull the lines of the file into a SPARK RDD, and then transform it into an RDD that only has one item per record (with the subrecords becoming part of their associated record item).
So if the example above was read in, I'd want to wind up with an RDD containing 3 objects: [record1,record2,record3]. Each object would contain the data from their RECORD and any associated SUBRECORD entries.
The unfortunate bit is that the only thing in this data that links subrecords to records is their position in the file, underneath their record. That means the problem is sequentially dependent and might not lend itself to SPARK.
Is there a sensible way to do this using SPARK (and if so, what could that be, what transform could be used to collapse the subrecords into their associated record)? Or is this the sort of problem one needs to do off spark?
There is a somewhat hackish way to identify the sequence of records and sub-records. This method assumes that each new "record" is identifiable in some way.
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.expressions.Window
val df = Seq(
("RECORD","record1identifier"),
("SUBRECORD","value1"),
("SUBRECORD2","value2"),
("RECORD","record2identifier"),
("RECORD","record3identifier"),
("SUBRECORD","value3"),
("SUBRECORD","value4"),
("SUBRECORD","value5")
).toDS().rdd.zipWithIndex.map(r => (r._1._1, r._1._2, r._2)).toDF("record", "value", "id")
val win = Window.orderBy("id")
val recids = df.withColumn("newrec", ($"record" === "RECORD").cast(LongType))
.withColumn("recid", sum($"newrec").over(win))
.select($"recid", $"record", $"value")
val recs = recids.where($"record"==="RECORD").select($"recid", $"value".as("recname"))
val subrecs = recids.where($"record" =!= "RECORD").select($"recid", $"value".as("attr"))
recs.join(subrecs, Seq("recid"), "left").groupBy("recname").agg(collect_list("attr").as("attrs")).show()
This snippet will first zipWithIndex to identify each row, in order, then add a boolean column that is true every time a "record" is identified, and false otherwise. We then cast that boolean to a long, and then can do a running sum, which has the neat side-effect of essentially labeling every record and it's sub-records with a common identifier.
In this particular case, we then split to get the record identifiers, re-join only the sub-records, group by the record ids, and collect the sub-record values to a list.
The above snippet results in this:
+-----------------+--------------------+
| recname| attrs|
+-----------------+--------------------+
|record1identifier| [value1, value2]|
|record2identifier| []|
|record3identifier|[value3, value4, ...|
+-----------------+--------------------+
I have many (relatively small) AVRO files with different schema, each in one location like this:
Object Name: A
/path/to/A
A_1.avro
A_2.avro
...
A_N.avro
Object Name: B
/path/to/B
B_1.avro
B_2.avro
...
B_N.avro
Object Name: C
/path/to/C
C_1.avro
C_2.avro
...
C_N.avro
...
and my goal is to read them in parallel via Spark and store each row as a blob in one column of the output. As a result my output data will have a consistent schema, something like the following columns:
ID, objectName, RecordDate, Data
Where the 'Data' field contains a string JSON of the original record.
My initial thought was to put the spark read statements in a loop, create the fields shown above for each dataframe, and then apply a union operation to get my final dataframe, like this:
all_df = []
for obj_name in all_object_names:
file_path = get_file_path(object_name)
df = spark.read.format(DATABRIKS_FORMAT).load(file_path)
all_df.append(df)
df_o = all_df.drop()
for df in all_df:
df_o = df_o.union(df)
# write df_o to the output
However I'm not sure if the read operations are going to be parallelized.
I also came across the sc.textFile() function to read all the AVRO files in one shot as string, but couldn't make it work.
So I have 2 questions:
Would the multiple read statements in a loop be parallelized by
Spark? Or is there a more efficient way to achieve this?
Can sc.textFile() be used to read the AVRO files as a string JSON in one column?
I'd appreciate your thoughts and suggestions.
I'm having 100,000 files, each contains data for M items, and each item has N attributes, with the format of each file is like this:
itemID,key,value
item1_1,key1,value1
item1_1,key2,value2
...
item1_1,keyN,valueN
item1_2,key1,value1
...
item1_2,keyN,valueN
...
item1_M,keyN,valueN
(1st file has data for items from item1_1 to item1_M. 2nd file has data for items from item2_1 to item2_M,...)
For further processing, I want to transform the data into a DataFrame like:
item1_2,v1,v2,...,vN
...
item1000000_M,v1,v2,...,vN
I tried to achieve this by using Spark's pivot:
val df = spark.read.csv("/path/*")
df.groupBy("itemID").pivot("key",Seq("key1","key2",...,"keyN")).agg(first("value"))
For 10GB of zipped data, it took hours to complete. I believe that using the pivot function is not a good idea here, but don't know of any alternative options yet.
What can be a better approach?
Thanks for your support.
Here is how I use Spark-SQL in a little application I am working with.
I have two Hbase tables say t1,t2.
My input being a csv file, I parse each and every line and query(SparkSQL) the table t1. I write the output to another file.
Now I parse the second file and query the second table and I apply certain functions over the result and I output the data.
the table t1 hast the purchase details and t2 has the list of items that were added to cart along with the time frame by each user.
Input -> CustomerID(list of it in a csv file)
Output - > A csv file in a particular format mentioned below.
CustomerID, Details of the item he brought,First item he added to cart,All the items he added to cart until purchase.
For a input of 1100 records, It takes two hours to complete the whole process!
I was wondering if I could speed up the process but I am struck.
Any help?
How about this DataFrame approach...
1) Create a dataframe from CSV.
how-to-read-csv-file-as-dataframe
or something like this in example.
val csv = sqlContext.sparkContext.textFile(csvPath).map {
case(txt) =>
try {
val reader = new CSVReader(new StringReader(txt), delimiter, quote, escape, headerLines)
val parsedRow = reader.readNext()
Row(mapSchema(parsedRow, schema) : _*)
} catch {
case e: IllegalArgumentException => throw new UnsupportedOperationException("converted from Arg to Op except")
}
}
2) Create Another DataFrame from Hbase data (if you are using Hortonworks) or phoenix.
3) do join and apply functions(may be udf or when othewise.. etc..) and resultant file could be a dataframe again
4) join result dataframe with second table & output data as CSV as in pseudo code as an example below...
It should be possible to prepare dataframe with custom columns and corresponding values and save as CSV file.
you can this kind in spark shell as well.
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema","true").
load("cars93.csv")
val df2=df.filter("quantity <= 4.0")
val col=df2.col("cost")*0.453592
val df3=df2.withColumn("finalcost",col)
df3.write.format("com.databricks.spark.csv").
option("header","true").
save("output-csv")
Hope this helps.. Good luck.
I am new in spark. I wanted to do spark streaming setup to retrieve key value pairs of below format files:
file: info1
Note: Each info file will have around of 1000 of these records. And our system is continuously generating these info files. Through, spark streaming i wanted to do mapping of line numbers and info files and wanted to get aggregate result.
Can we give input to spark cluster these kind of files? I am interested in the "SF" and "DA" delimiters only, "SF" corresponds to source file and "DA" corresponds the ( line number, count).
As this input data is not the line format, so is this the good idea to use these files for the spark input or should i need to do some intermediary stage where i need to clean these files to generate new files which will have each record information in line instead of blocks?
Or can we achieve this in Spark itself? What should be the right approach?
What i wanted to achieve?
I wanted to get line level information. Means, to get line (As a key) and info files (as values)
Final output i wanted is like below:
line178 -> (info1, info2, info7.................)
line 2908 -> (info3, info90, ..., ... ,)
Do let me know if my explanation is not clear or if i am missing something.
Thanks & Regards,
Vinti
You could do something like this. Having your DStream stream:
// this gives you DA & FP lines, with the line number as the key
val validLines = stream.map(_.split(":")).
filter(line => Seq("DA", "FP").contains(line._1)).
map(_._2.split(","))
map(line => (line._1, line._2))
// now you should accumulate values
val state = validLines.updateStateByKey[Seq[String]](updateFunction _)
def updateFunction(newValues: Seq[Seq[String]], runningValues: Option[Seq[String]]): Option[Seq[String]] = {
// add the new values
val newVals = runnigValues match {
case Some(list) => list :: newValues
case _ => newValues
}
Some(newVals)
}
This should accumulate for each key a sequence with the values associated, storing it in state