Changing values of a JSON with RDD - apache-spark

How do you set a value in RDD once you transform it?
I am modifying a JSON file with pyspark and I have this case:
categories: [ {alias, title}, {alias, title}, {alias, title} ]
I have made a transformation that creates a list of titles for each row:
[title, title, title].
But how do I set the result back to the key categories?
At the end I want to get:
categories: [title, title, title]
This is the transformation that I am doing:
restaurantRDD.map(lambda x: x.data).flatMap(lambda x: x).map(lambda x: [row.title for row in x.categories])
There are also multiple transformations from restaurantRDD similar to this one which are modifying some parts of the JSON. How can I apply them both at the same time and then write them in a new JSON file?
Should I use something else instead of RDD?

Related

Parse URL parameters to it's own column in Spark dataframe or Glue DynamicFrame?

For example if I have these two items on amazon:
Item 1) https://www.amazon.com/Rush-Creek-Creations-Fishing-American/dp/B07MD1VC54/?_encoding=UTF8&pd_rd_w=G1bHU&pf_rd_p=bbb6bbd8-d236-47cb-b42f-734cb0cacc1f&pf_rd_r=Q2BYAHBV16RTAGT2EB8E&pd_rd_r=af1ccd81-be7e-4412-b205-2b87b3f174fa&pd_rd_wg=yYmbt&ref_=pd_gw_ci_mcx_mi
Item 2) https://www.amazon.com/AmazonBasics-Performance-Alkaline-Batteries-Count/dp/B00LH3DMUO/ref=sr_1_1?crid=2KPI7WO52EB53&keywords=amzn1.osp&qid=1649190321&sprefix=amzn1.osp%2Caps%2C166&sr=8-1
I want to parse the url parameters into its own column:
Item 1 parameters: encoding, pd_rd_w, pf_rd_p, pf_rd_r, pd_rd_r, pd_rd_wg, ref
Item 2 parameters: crid, keywords, qid, sprefix, sr
I would like to continuously update a dataframe that keeps creating new columns whenever my ETL job reads in a stream of new rows from a source dataframe.
So my schema so far would look like this:
encoding, pd_rd_w, pf_rd_p, pf_rd_r, pd_rd_r, pd_rd_wg, ref, crid, keywords, qid, sprefix, sr
(More columns would be added when we get a new url without a parameter we've never seen before)

Apache pyspark remove stopwords and calculate

I have the following .csv file (ID, title, book title, author etc):
I want to compute all the n-combinations (from each title I want all the 4-word combinations) from the titles (column 2) of the articles (with n=4), after I remove the stopwords.
I have created the dataframe:
df_hdfs = sc.read.option('delimiter', ',').option('header', 'true')\.csv("/user/articles.csv")
I have created an rdd with the titles column:
rdd = df_hdfs.rdd.map(lambda x: (x[1]))
and it seems like this:
Now, I realize that I have to tokenize each string of RDD into words and then remove the stopwords. I would need a little help on how to do this and how to compute the combinations.
Thanks.

Spark: Read multiple AVRO files with different schema in parallel

I have many (relatively small) AVRO files with different schema, each in one location like this:
Object Name: A
/path/to/A
A_1.avro
A_2.avro
...
A_N.avro
Object Name: B
/path/to/B
B_1.avro
B_2.avro
...
B_N.avro
Object Name: C
/path/to/C
C_1.avro
C_2.avro
...
C_N.avro
...
and my goal is to read them in parallel via Spark and store each row as a blob in one column of the output. As a result my output data will have a consistent schema, something like the following columns:
ID, objectName, RecordDate, Data
Where the 'Data' field contains a string JSON of the original record.
My initial thought was to put the spark read statements in a loop, create the fields shown above for each dataframe, and then apply a union operation to get my final dataframe, like this:
all_df = []
for obj_name in all_object_names:
file_path = get_file_path(object_name)
df = spark.read.format(DATABRIKS_FORMAT).load(file_path)
all_df.append(df)
df_o = all_df.drop()
for df in all_df:
df_o = df_o.union(df)
# write df_o to the output
However I'm not sure if the read operations are going to be parallelized.
I also came across the sc.textFile() function to read all the AVRO files in one shot as string, but couldn't make it work.
So I have 2 questions:
Would the multiple read statements in a loop be parallelized by
Spark? Or is there a more efficient way to achieve this?
Can sc.textFile() be used to read the AVRO files as a string JSON in one column?
I'd appreciate your thoughts and suggestions.

SparkSQL: Am I doing in right?

Here is how I use Spark-SQL in a little application I am working with.
I have two Hbase tables say t1,t2.
My input being a csv file, I parse each and every line and query(SparkSQL) the table t1. I write the output to another file.
Now I parse the second file and query the second table and I apply certain functions over the result and I output the data.
the table t1 hast the purchase details and t2 has the list of items that were added to cart along with the time frame by each user.
Input -> CustomerID(list of it in a csv file)
Output - > A csv file in a particular format mentioned below.
CustomerID, Details of the item he brought,First item he added to cart,All the items he added to cart until purchase.
For a input of 1100 records, It takes two hours to complete the whole process!
I was wondering if I could speed up the process but I am struck.
Any help?
How about this DataFrame approach...
1) Create a dataframe from CSV.
how-to-read-csv-file-as-dataframe
or something like this in example.
val csv = sqlContext.sparkContext.textFile(csvPath).map {
case(txt) =>
try {
val reader = new CSVReader(new StringReader(txt), delimiter, quote, escape, headerLines)
val parsedRow = reader.readNext()
Row(mapSchema(parsedRow, schema) : _*)
} catch {
case e: IllegalArgumentException => throw new UnsupportedOperationException("converted from Arg to Op except")
}
}
2) Create Another DataFrame from Hbase data (if you are using Hortonworks) or phoenix.
3) do join and apply functions(may be udf or when othewise.. etc..) and resultant file could be a dataframe again
4) join result dataframe with second table & output data as CSV as in pseudo code as an example below...
It should be possible to prepare dataframe with custom columns and corresponding values and save as CSV file.
you can this kind in spark shell as well.
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema","true").
load("cars93.csv")
val df2=df.filter("quantity <= 4.0")
val col=df2.col("cost")*0.453592
val df3=df2.withColumn("finalcost",col)
df3.write.format("com.databricks.spark.csv").
option("header","true").
save("output-csv")
Hope this helps.. Good luck.

Pyspark: groupby and then count true values

My data structure is in JSON format:
"header"{"studentId":"1234","time":"2016-06-23","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-24","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-25","homeworkSubmitted":True}
"header"{"studentId":"1236","time":"2016-06-23","homeworkSubmitted":False}
"header"{"studentId":"1236","time":"2016-06-24","homeworkSubmitted":True}
....
I need to plot a histogram that shows number of homeworkSubmitted: True over all stidentIds. I wrote code that flattens the data structure, so my keys are header.studentId, header.time and header.homeworkSubmitted.
I used keyBy to group by studentId:
initialRDD.keyBy(lambda row: row['header.studentId'])
.map(lambda (k,v): (k,v['header.homeworkSubmitted']))
.map(mapTF).groupByKey().mapValues(lambda x: Counter(x)).collect()
This gives me result like this:
("1234", Counter({0:0, 1:3}),
("1236", Counter(0:1, 1:1))
I need only number of counts of 1, possibly mapped to a list so that I can plot a histogram using matplotlib. I am not sure how to proceed and filter everything.
Edit: at the end I iterated through the dictionary and added counts to a list and then plotted histogram of the list. I am wondering if there is a more elegant way to do the whole process I outlined in my code.
df = sqlContext.read.json('/path/to/your/dataset/')
df.filter(df.homeworkSubmitted == True).groupby(df.studentId).count()
Note it is not valid JSON if there is a "header" or True instead of true
I don't have Spark in front of me right now, though I can edit this tomorrow when I do.
But if I'm understanding this you have three key-value RDDs, and need to filter by homeworkSubmitted=True. I would think you turn this into a dataframe, then use:
df.where(df.homeworkSubmitted==True).count()
You could then use group by operations if you wanted to explore subsets based on the other columns.
You can filter out the false, keeping it in RDD, then count the True with counter
initialRDD.filter(lambda row : row['header.homeworkSubmitted'])
Another solution would be to sum the booleans
data = sc.parallelize([('id1',True),('id1',True),
('id2',False),
('id2',False),('id3',False),('id3',True) ])
data.reduceByKey(lambda x,y:x+y).collect()
Outputs
[('id2', 0), ('id3', 1), ('id1', 2)]

Resources