Pyspark Pair RDD from Text File - apache-spark

I have a local text file kv_pair.log formatted such as that key value pairs are comma delimited and records begin and terminate with a new line:
"A"="foo","B"="bar","C"="baz"
"A"="oof","B"="rab","C"="zab"
"A"="aaa","B"="bbb","C"="zzz"
I am trying to read this to a Pair RDD using pySpark as follows:
from pyspark import SparkContext
sc=sparkContext()
# Read raw text to RDD
lines=sc.textFile('kv_pair.log')
# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))
print type(pairs)
print pairs.take(2)
I feel I am close! The output of above is:
[[u'A=foo', u'B=bar', u'C=baz'], [u'A=oof', u'B=rab', u'C=zab']]
So it looks like pairs is a list of records, which contains a list of the kv pairs as strings.
How can I use pySpark to transform this into a Pair RDD such as that the keys and values are properly separated?
Ultimate goal is to transform this Pair RDD into a DataFrame to perform SQL operations - but one step at a time, please help transforming this into a Pair RDD.

You can use flatMap with a custom function as lambda can't be used for multiple statements
def tranfrm(x):
lst = x.replace('"', '').split(",")
return [(x.split("=")[0], x.split("=")[1]) for x in lst]
pairs = lines.map(tranfrm)

This is really bad practice for a parser, but I believe your example could be done with something like this:
from pyspark import SparkContext
from pyspark.sql import Row
sc=sparkContext()
# Read raw text to RDD
lines=sc.textFile('kv_pair.log')
# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))\
.map(lambda r: Row(A=r[0].split('=')[1], B=r[1].split('=')[1], C=r[2].split('=')[1] )
print type(pairs)
print pairs.take(2)

Related

How to convert a pyspark column(pyspark.sql.column.Column) to pyspark dataframe?

I have an use case to map the elements of a pyspark column based on a condition.
Going through this documentation pyspark column, i could not find a function for pyspark column to execute map function.
So tried to use the pyspark dataFrame map function, but not being able to convert the pyspark column to a dataframe
Note: The reason i am using the pyspark column is because i get that as an input from a library(Great expectations) which i use.
#column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
return column.isin([3])
# need to replace the above logic with a map function
# like column.map(lambda x: __valid_date(x))
_spark function arguments are passed from the library
What i have,
A pyspark column with timestamp strings
What i require,
A Pyspark column with boolean(True/False) for each element based on validating the timestamp format
example for dataframe,
df.rdd.map(lambda x: __valid_date(x)).toDF()
__valid_date function returns True/False
So, i either need to convert the pyspark column into dataframe to use the above map function or is there any map function available for the pyspark column?
Looks like you need to return a column object that the framework will use for validation.
I have not used Great expectations, but maybe you can define an UDF for transforming your column. Something like this:
import pyspark.sql.functions as F
import pyspark.sql.types as T
valid_date_udf = udf(lambda x: __valid_date(x), T.BooleanType())
#column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
return valid_date_udf(column)

A quick way to get the mean of each position in large RDD

I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like
[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]
Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:
(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4
which is:
[(3745.75,2729.25,4397.0,3981.0)]
Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.
I don't think there is anything faster than calculating the mean (or sum) for each column
If you are using the DataFrame API you can simply aggregate multiple columns:
import os
import time
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
# start local spark session
spark = SparkSession.builder.getOrCreate()
# load as rdd
def localpath(path):
return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
rdd = spark._sc.textFile(localpath('myPosts/'))
# create data frame from rdd
df = spark.createDataFrame(rdd)
means_df = df.agg(*[f.avg(c) for c in df.columns])
means_dict = means_df.first().asDict()
print(means_dict)
Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command

How to convert a column in H2OFrame to a python list?

I've read the PythonBooklet.pdf by H2O.ai and the python API documentation, but still can't find a clean way to do this. I know I can do either of the following:
Convert H2OFrame to Spark DataFrame and do a flatMap + collect or collect + list comprehension.
Use H2O's get_frame_data, which gives me a string of header and data separated by \n; then convert it a list (a numeric list in my case).
Is there a better way to do this? Thank you.
You can try something like this: bring an H2OFrame into python as a pandas dataframe by calling .as_data_frame(), then call .tolist() on the column of interest.
A self contained example w/ iris
import h2o
h2o.init()
df = h2o.import_file("iris_wheader.csv")
pd = df.as_data_frame()
pd['sepal_len'].tolist()
You can (1) convert the H2o frame to pandas dataframe and (2) convert pandas dataframe to list as follows:
pd=h2o.as_list(h2oFrame)
l=pd["column"].tolist()

createDataFrame() returning a list instead of DataFrame in Spark

I am running Spark 1.5.1. On startup I have HiveContext available as sqlContext but set
sqlContext2 = SQLContext(sc)
I create a pipelined RDD by parsing a list of strings to JSON
data = points.map(lambda line: json.loads(line))
I then try to convert this into a dataframe using
DF = sqlContext2.createDataFrame(data).collect()
This runs perfectly, but then when i run type(DF) it says that it is a list.
How is this possible? How is a list coming out of a createDataFrame()
That's because when you apply collect() on a DataFrame, it return a list that contains all of the elements (Rows) in this DataFrame.
if you want just a DatFrame, df = sqlContext.createDataFrame(data) is enough.
There is no need for sqlContext2 here.

Generate single json file for pyspark RDD

I am building a Python script in which I need to generate a json file from json RDD .
Following is code snippet for saving json file.
jsonRDD.map(lambda x :json.loads(x))
.coalesce(1, shuffle=True).saveAsTextFile('examples/src/main/resources/demo.json')
But I need to write json data to a single file instead of data distributed across several partitions.
So please suggest me appropriate solution for it
Without the use of additional libraries like pandas, you could save your RDD of several jsons by reducing them to one big string of jsons, each separated by a new line:
# perform your operation
# note that you do not need a lambda expression for json.loads
jsonRDD = jsonRDD.map(json.loads).coalesce(1, shuffle=True)
# map jsons back to string
jsonRDD = jsonRDD.map(json.dumps)
# reduce to one big string with one json on each line
json_string = jsonRDD.reduce(lambda x, y: x + "\n" + y)
# write your string to a file
with open("path/to/your.json", "w") as f:
f.write(json_string.encode("utf-8"))
I have had issues with pyspark saving off JSON files once I have them in a RDD or dataframe, so what I do is convert them to a pandas dataframe and save them to a non distributed directory.
import pandas
df1 = sqlContext.createDataFrame(yourRDD)
df2 = df1.toPandas()
df2.to_json(yourpath)

Resources