Creating a Spark DataFrame from an RDD of lists - apache-spark

I have an rdd (we can call it myrdd) where each record in the rdd is of the form:
[('column 1',value), ('column 2',value), ('column 3',value), ... , ('column 100',value)]
I would like to convert this into a DataFrame in pyspark - what is the easiest way to do this?

How about use the toDF method? You only need add the field names.
df = rdd.toDF(['column', 'value'])

The answer by #dapangmao got me to this solution:
my_df = my_rdd.map(lambda l: Row(**dict(l))).toDF()

Take a look at the DataFrame documentation to make this example work for you, but this should work. I'm assuming your RDD is called my_rdd
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# You have a ton of columns and each one should be an argument to Row
# Use a dictionary comprehension to make this easier
def record_to_row(record):
schema = {'column{i:d}'.format(i = col_idx):record[col_idx] for col_idx in range(1,100+1)}
return Row(**schema)
row_rdd = my_rdd.map(lambda x: record_to_row(x))
# Now infer the schema and you have a DataFrame
schema_my_rdd = sqlContext.inferSchema(row_rdd)
# Now you have a DataFrame you can register as a table
schema_my_rdd.registerTempTable("my_table")
I haven't worked much with DataFrames in Spark but this should do the trick

In pyspark, let's say you have a dataframe named as userDF.
>>> type(userDF)
<class 'pyspark.sql.dataframe.DataFrame'>
Lets just convert it to RDD (
userRDD = userDF.rdd
>>> type(userRDD)
<class 'pyspark.rdd.RDD'>
and now you can do some manipulations and call for example map function :
newRDD = userRDD.map(lambda x:{"food":x['favorite_food'], "name":x['name']})
Finally, lets create a DataFrame from resilient distributed dataset (RDD).
newDF = sqlContext.createDataFrame(newRDD, ["food", "name"])
>>> type(ffDF)
<class 'pyspark.sql.dataframe.DataFrame'>
That's all.
I was hitting this warning message before when I tried to call :
newDF = sc.parallelize(newRDD, ["food","name"] :
.../spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py:336: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row inst warnings.warn("Using RDD of dict to inferSchema is deprecated. "
So no need to do this anymore...

Related

Convert RDD to DataFrame using pyspark

I have a file in spark with following data
Property ID|Location|Price|Bedrooms|Bathrooms|Size|Price SQ Ft|Status
i have read this file as rdd using
a=sc.textFile("/FileStore/tables/realestate.txt")
Now I need to convert this rdd into dataframe. I am using the below mentioned command
d=spark.createDataFrame(a).toDF("Property ID","Location","Price","Bedrooms","Bathrooms","Size","Price SQ Ft","Status")
But i am getting an error as
TypeError: Can not infer schema for type: <class 'str'>
You can split the column first:
d = spark.createDataFrame(a.map(lambda x: x.split('|'))).toDF("Property ID","Location","Price","Bedrooms","Bathrooms","Size","Price SQ Ft","Status")
Or equivalently, calling toDF on the RDD directly
d = a.map(lambda x: x.split('|')).toDF(["Property ID","Location","Price","Bedrooms","Bathrooms","Size","Price SQ Ft","Status"])
In fact, I'd recommend using the Spark CSV reader for this purpose, which could handle the header appropriately too:
df = spark.read.csv('/FileStore/tables/realestate.txt', header=True, inferSchema=True, sep='|')

A quick way to get the mean of each position in large RDD

I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like
[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]
Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:
(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4
which is:
[(3745.75,2729.25,4397.0,3981.0)]
Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.
I don't think there is anything faster than calculating the mean (or sum) for each column
If you are using the DataFrame API you can simply aggregate multiple columns:
import os
import time
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
# start local spark session
spark = SparkSession.builder.getOrCreate()
# load as rdd
def localpath(path):
return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
rdd = spark._sc.textFile(localpath('myPosts/'))
# create data frame from rdd
df = spark.createDataFrame(rdd)
means_df = df.agg(*[f.avg(c) for c in df.columns])
means_dict = means_df.first().asDict()
print(means_dict)
Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command

Apply a function to a single column of a csv in Spark

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?
My code
SparkContext().addPyFile("myfile.py")
spark = SparkSession\
.builder\
.appName("myApp")\
.getOrCreate()
from myfile import myFunction
df = spark.read.csv(sys.argv[1], header=True,
mode="DROPMALFORMED",)
a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()
I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].
I'm using Spark version 2.0.1
You can simply use User Defined Functions (udf) combined with a withColumn :
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df = df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider
This will add a new column to the dataframe df containing the result of myFunction(line[3]).

createDataFrame() returning a list instead of DataFrame in Spark

I am running Spark 1.5.1. On startup I have HiveContext available as sqlContext but set
sqlContext2 = SQLContext(sc)
I create a pipelined RDD by parsing a list of strings to JSON
data = points.map(lambda line: json.loads(line))
I then try to convert this into a dataframe using
DF = sqlContext2.createDataFrame(data).collect()
This runs perfectly, but then when i run type(DF) it says that it is a list.
How is this possible? How is a list coming out of a createDataFrame()
That's because when you apply collect() on a DataFrame, it return a list that contains all of the elements (Rows) in this DataFrame.
if you want just a DatFrame, df = sqlContext.createDataFrame(data) is enough.
There is no need for sqlContext2 here.

Get CSV to Spark dataframe

I'm using python on Spark and would like to get a csv into a dataframe.
The documentation for Spark SQL strangely does not provide explanations for CSV as a source.
I have found Spark-CSV, however I have issues with two parts of the documentation:
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?
With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:
df = sqlContext.read.csv("/path/to/your.csv")
Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
Employee_rdd = sc.textFile("\..\Employee.csv")
.map(lambda line: line.split(","))
Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
Employee_df.show()
for Pyspark, assuming that the first row of the csv file contains a header
spark = SparkSession.builder.appName('chosenName').getOrCreate()
df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)
Read the csv file in to a RDD and then generate a RowRDD from the original RDD.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
source: SPARK PROGRAMMING GUIDE
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
Following Spark 2.0, it is recommended to use a Spark Session:
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession \
.builder \
.appName("basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
def mapper(line):
fields = line.split(',')
return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))
lines = spark.sparkContext.textFile("file.csv")
df = lines.map(mapper)
# Infer the schema, and register the DataFrame as a table.
schemaDf = spark.createDataFrame(df).cache()
schemaDf.createOrReplaceTempView("tablename")
I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.
Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.
Hope it works.
Based on the answer by Aravind, but much shorter, e.g. :
lines = sc.textFile("/path/to/file").map(lambda x: x.split(","))
df = lines.toDF(["year", "month", "day", "count"])
With the current implementation(spark 2.X) you dont need to add the packages argument, You can use the inbuilt csv implementation
Additionally as the accepted answer you dont need to create an rdd then enforce schema that has 1 potential problem
When you read the csv as then it will mark all the fields as string and when you enforce the schema with an integer column you will get exception.
A better way to do the above would be
spark.read.format("csv").schema(schema).option("header", "true").load(input_path).show()

Resources