Create XML request from each record of a dataframe - apache-spark

I have tried many options including withColumn, udf, lambda, foreach, map but but not getting the expected output. At max, I am able to transform only the first record. The inputfile.json will keep on increasing and the expect op should give the xml in the desired structure. I will later on produce the expected op on Kafka.
Spark 2.3, Python 2.7. Need is to do in PySpark.
Edit 1:
I am able to add a column in the main dataframe which has the required xml. I used withColumn and functions.format_string and able to add strings(the xml structures) to columns of the dataframe.
Now my next target is to produce just the value of that new column to Kafka. I am using df.foreachPartition(send_to_kafka) and have created a function as below:
def send_to_kafka(rows):
kafka = SimpleClient('localhost:9092')
producer = SimpleProducer(kafka)
for row in rows:
producer.send_messages('test', str(row.asDict()))
But unfortunately it does two things:
a. Produces record on Kafka as {'newColumn':u'myXMLPayload'}. I do not want that. I want only myXMLPayload to be produced on Kafka.
b. It adds u' to the value for unicoding the value.
I want to get rid of these two parts and I would be good to go.
Any help would be appreciated.

Related

Write each row of a dataframe to a separate json file in s3 with pyspark

in one of my projects, I need to write each row of a dataframe into a separate S3 file in json format. In the actual implementation, map/foreach's input is a Row, though I don't seem to find any member function on Row that could transform a row into json format.
I'm using spark df and don't want to convert it to pandas (as it involves sending everything to the driver?), hence cannot use the to_json function. Is there any other way to do it? I can definitely write my own json converter based on my specific df schema, but just wondering if there is a readily available module.

GroupByKey to fill values and then ungroup apache beam

I have csv files that have missing values per groups formed by primary keys (for every group, there's only 1 value populated for 1 field, and I need that field to be populated for all records of the group). I'm processing the entire file with apache beam and therefore, I want to use GroupByKey to fill up the field for each group, and then ungroup it to restore the original data, now with filled data. The equivalent in pandas would be:
dataframe[column_to_be_filled] = dataframe.groupby(primary_key)[column_to_be_filled].ffill().bfill()
I don't know how to achieve this with apache beam. I first used apache beam dataframe, but that'd take a lot of memory.
It's better to process your elements with a pcollection instead of a dataframe to avoid memory issues.
First read your CSV as a pcollection and then you can use GroupByKey and process the grouped elements and yield the results with a separate transformation.
It could be something like this
(pcollection | 'Group by key' >> beam.GroupByKey()
| 'Process grouped elements' >> beam.ParDo(UngroupElements()))
The input pcollection should be list of tuples each one contains the key you want to group with and the element.
And the ptransformation would look like this:
class UngroupElements(beam.ParDo):
def process(element):
k, v = element
for elem in list(v):
# process your element
yield elem
You can try to use exactly the same code as Pandas in Beam: https://beam.apache.org/documentation/dsls/dataframes/overview/
You can use read_csv to read your data into a dataframe, and then apply the same code that you would use in Pandas. Not all Pandas operations are supported (https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/), but that specific case with the group by key should work.

Pyspark groupBy with custom partitioner

I want to apply some custom partitioning when working with a given DataFrame. I found that the RDD groupBy provides me with the desired functionality. Now when I say
dataframe.rdd.groupBy(lambda row: row[1:3], numPartitions, partitioner)
I end up with a PythonRDD that has a tuple as a key and a ResultIterator as the value. What I want to do next is convert this back to a DataFrame since I want to use the apply function of GroupedData. I have attempted multiple things but have been unlucky so far.
Any help would be appreciated!

Spark Python: Converting multiple lines from inside a loop into a dataframe

I have a loop that is going to create multiple rows of data which I want to convert into a dataframe.
Currently I am creating a CSV format string and inside the loop keep appending to it along separated by a newline. I am creating a CSV file so that I can also save it as a text file for other processing.
File Header:
output_str="Col1,Col2,Col3,Col4\n"
Inside for loop:
output_str += "Val1,Val2,Val3,Val4\n"
I then create an RDD by splitting it with the newline and then convert in into the dataframe as follows.
output_rdd = sc.parallelize(output_str.split("\n"))
output_df = output_rdd.map(lambda x: (x, )).toDF()
It creates a dataframe but only has 1 column. I know that is because of the map function where I am making it into a list with only 1 item in the set. What I need is a list with multiple items. So perhaps I should be calling split() function on every line to get a list. But I am getting a feeling that there should be a much more straight-forward way. Appreciate any help. Thanks.
Edit: To give more information, using Spark SQL I have filtered my dataset to those rows that contain the problem. However the rows contain information in following format (separated by '|'). And I need to extract those values from column 3 which has corresponding flag set to 1 in column 4 (Here it is 0xcd)
Field1|Field2|0xab,0xcd,0xef|0x00,0x01,0x00
So I am collecting the output at the driver and then parsing the last 2 columns after which I am left with regular strings that I want to put back in a dataframe. I am not sure if I can achieve the same using Spark SQL to parse the output in the manner I want.
Yes, indeed your current approach seems a little too complicated... Creating large string in Spark Driver and then parallelizing it with Spark is not really performant.
First of all question from where you are getting your input data? In my opinion you should use one of existing Spark readers to read it. For example you can use:
CSV -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
jdbc -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.jdbc
json -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
parquet -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet
not structured text file -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext.textFile
In next step you can preprocess it using Spark DataFrame or RDD API depending on your use case.
A bit late, but currently you're applying a map to create a tuple for each row containing the string as the first element. Instead of this, you probably want to split the string, which can easily be done inside the map step. Assuming all of your rows have the same number of elements you can replace:
output_df = output_rdd.map(lambda x: (x, )).toDF()
with
output_df = output_rdd.map(lambda x: x.split()).toDF()

spark dataset : how to get count of occurence of unique values from a column

Trying spark dataset apis which reads a CSV file and count occurrence of unique values in a particular field. One approach which i think should work is not behaving as expected. Let me know what am i overlooking. I am posted both working as well as buggy approach below.
// get all records from a column
val professionColumn = data.select("profession")
// breakdown by professions in descending order
// ***** DOES NOT WORKS ***** //
val breakdownByProfession = professionColumn.groupBy().count().collect()
// ***** WORKS ***** //
val breakdownByProfessiond = data.groupBy("profession").count().sort("count") // WORKS
println ( s"\n\nbreakdown by profession \n")
breakdownByProfession.show()
Also please let me know which approach is more efficient. My guess would be the first one ( the reason to attempt that in first place )
Also what is the best way to save output of such an operation in a text file using dataset APIs
In the first case, since there are no grouping columns specified, the entire dataset is considered as one group -- this behavior holds even though there is only one column present in the dataset. So, you should always pass the list of columns to groupBy().
Now the two options would be: data.select("profession").groupBy("profession").count vs. data.groupBy("profession").count. In most cases, the performance of these two alternatives will be exactly the same since Spark tries to push projections (i.e., column selection) down the operators as much as possible. So, even in the case of data.groupBy("profession").count, Spark first selects the profession column before it does the grouping. You can verify this by looking at the execution plan -- org.apache.spark.sql.Dataset.explain()
In groupBy transformation you need to provide column name as below
val breakdownByProfession = professionColumn.groupBy().count().collect()

Resources