Pyspark: Transforming PythonRDD to Dataframe - apache-spark

Could someone guide me to convert PythonRDD to a DataFrame.
As per my understanding, reading a file should create a DF, but in my case it has created a PythonRDD. I finding it hard to convert PythonRDD to a DataFrame. Could not find CreateDataFrame() or toDF().
Please find my below code to read a tab seperated text file:
rdd1 = sparkCxt.textFile(setting.REFRESH_HDFS_DIR + "/Refresh")
rdd2 = rdd1.map(lambda row: unicode(row).lower().strip()\
if type(row) == unicode else row)
Now, I would want to convert PythonRDD to a DF.
I wanted to convert to DF to map the schema, so that I could do further processing at column level.
Also, please suggest if you think there is a better approach.
Please reply if more details are required.
Thank you.

Spark DataFrames can be created directly from a text file, but you should use sqlContext instead of sc (SparkContext), since sqlContext is an entry point for working with DataFrames.
df = sqlContext.read.text('path/to/my/file')
This will create a DataFrame with a single column named value. You can use UDF functions to split it into required columns.
Another approach would be to read the text files to an RDD, split it into columns using map, reduce, filter and other operations, and then convert the final RDD to a DataFrame.
For example, let's say we have a RDD named my_rdd with the following structure:
[(1, 'Alice', 23), (2, 'Bob', 25)]
We can easily convert it to a DataFrame:
df = sqlContext.createDataFrame(my_rdd, ['id', 'name', 'age'])
where id, name and age are names for our columns.

you can try using toPandas() although you should be cautious when doing so since converting an rdd to pandas DataFrame will be like bringing all data into memory which might cause OOM error if your distributed data is large.

I would use the Spark-csv package (Spark-csv Github) and import directly into a dataframe after defining the schema.
For example:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
customSchema = StructType([ \
StructField("year", IntegerType(), True), \
StructField("make", StringType(), True), \
StructField("model", StringType(), True), \
StructField("comment", StringType(), True), \
StructField("blank", StringType(), True)])
df = sqlContext.read \
.format('com.databricks.spark.csv') \
.options(header='true') \
.load('cars.csv', schema = customSchema)
This defaults to a comma for the delimiter, but you can change that to a tab with something like:
df = sqlContext.read \
.format('com.databricks.spark.csv') \
.options(header='true', delimiter='\t') \
.load('cars.csv', schema = customSchema)
Note that it is possible to infer the schema using another option, but this does require reading the entire file prior to loading the dataframe.

Related

Reading csv files in PySpark

I am trying to read csv file and convert into dataframe.
input.txt
4324,'Andy',43.5,20.3,53.21
2342,'Sam',22.1
3248,'Jane',11.05,12.87
6457,'Bob',32.1,75.23,71.6
Schema: Id, Name,Jan,Feb,March
As you see the csv file doesn't have "," if there are no trailing expenses.
Code:
from pyspark.sql.types import *
input1= sc.textFile('/FileStore/tables/input.txt').map(lambda x: x.split(","))
schema = StructType([StructField('Id',StringType(),True), StructField('Name',StringType(),True), StructField('Jan',StringType(),True), StructField('Feb',StringType(),True), StructField('Mar',StringType(),True)])
df3 = sqlContext.createDataFrame(input1, schema)
I get ValueError: Length of object (4) does not match with length of fields (5). How do I resolve this?
I would first import the file using pandas which should handle everything for you. From there you can then convert the pandas DataFrame to spark and do all your usual stuff. I copied your example txt file and quickly wrote up some code to confirm that it would all work:
import pandas as pd
# Reading in txt file as csv
df_pandas = pd.read_csv('<your location>/test.txt',
sep=",")
# Converting to spark dataframe and displaying
df_spark = spark.createDataFrame(df_pandas)
display(df_pandas)
Which produced the following output:
The faster method would be to import through spark:
# Importing csv file using pyspark
csv_import = sqlContext.read\
.format('csv')\
.options(sep = ',', header='true', inferSchema='true')\
.load('<your location>/test.txt')
display(csv_import)
Which gives the same output.
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()
fields = [StructField('Id', StringType(), True), StructField('Name', StringType(), True),
StructField('Jan', StringType(), True), StructField('Feb', StringType(), True),
StructField('Mar', StringType(), True)]
schema = StructType(fields)
data = spark.read.format("csv").load("test2.txt")
df3 = spark.createDataFrame(data.rdd, schema)
df3.show()
Output:
+----+------+-----+-----+-----+
| Id| Name| Jan| Feb| Mar|
+----+------+-----+-----+-----+
|4324|'Andy'| 43.5| 20.3|53.21|
|2342| 'Sam'| 22.1| null| null|
|3248|'Jane'|11.05|12.87| null|
|6457| 'Bob'| 32.1|75.23| 71.6|
+----+------+-----+-----+-----+
Here are a couple options for you to consider. These use the wildcard character, so you can loop through all folders and sub-folders, look for files with names that match a specific pattern, and merge everything into a dingle dataframe.
val myDFCsv = spark.read.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.head()
myDFCsv.count()
//////////////////////////////////////////
// If you also need to load the filename
import org.apache.spark.sql.functions.input_file_name
val myDFCsv = spark.read.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
.withColumn("file_name",input_file_name())
myDFCsv.show(false)
myDFCsv.head()
myDFCsv.count()

When reading a CSV is there an option to start on row 2 or below?

I'm reading a bunch of CSV files into a dataframe using the sample code below.
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/corp/ABC*.gz")
I'm hoping there is a way to start on row 2 or below, because row 1 contains some basic metadata about these files, and the first row has 4 pipe characters, so Spark thinks the file has 4 columns, but it actually has over 100 columns in the actual data.
I tried playing with the inferSchema and header but I couldn't get anything to work.
If the first line in CSV doesnt match actual column count and names, you may need to define your schema by hand, and then try this combination:
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","false")
.option("header","true")
.schema(mySchema)
.option("enforceSchema","true")
.load(...
Full list of CSV options.
Note that for Spark 2.3 and above, you can use a shorthand, SQL-style notation for schema definition -- simple string "column1 type1, column2 type2, ...".
If however your header has more than one line, you will probably be forced to ignore all "errors" by using additional option .option("mode","DROPMALFORMED").
You are right! You need to define a custom schema! I ended up going with this.
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.functions.input_file_name
val customSchema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", StringType, true),
StructField("field3", StringType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true),
StructField("field6", StringType, true),
StructField("field7", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("sep", "|")
.schema(customSchema)
.load("mnt/rawdata/corp/ABC*.gz")
.withColumn("file_name", input_file_name())

Convert Spark Structured DataFrame to Pandas using pandas_udf

I need to read csv files as stream and then convert that to pandas dataframe.
Here is what I have done so far
DataShema = StructType([ StructField("TimeStamp", LongType(), True), \
StructField("Count", IntegerType(), True), \
StructField("Reading", FloatType(), True) ])
group_columns = ['TimeStamp','Count','Reading']
#pandas_udf(DataShema, PandasUDFType.GROUPED_MAP)
def get_pdf(pdf):
return pd.DataFrame([pdf[group_columns]],columns=[group_columns])
# getting Surge data from the files
SrgDF = spark \
.readStream \
.schema(DataShema) \
.csv("ProcessdedData/SurgeAcc")
mydf = SrgDF.groupby(group_columns).apply(get_pdf)
qrySrg = SrgDF \
.writeStream \
.format("console") \
.start() \
.awaitTermination()
I believe from another source (Convert Spark Structure Streaming DataFrames to Pandas DataFrame) that converting structured streaming dataframe to pandas is not directly possible and it seems that pandas_udf is the right approach but cannot figure out exactly how to achieve this. I need the pandas dataframe to pass into my functions.
Edit
when I run the code (changing the query to mydf rather than SrgDF) then I get the following error: pyspark.sql.utils.StreamingQueryException: 'Writing job aborted.\n=== Streaming Query ===\nIdentifier: [id = 18a15e9e-9762-4464-b6d1-cb2db8d0ac41, runId = e3da131e-00d1-4fed-82fc-65bf377c3f99]\nCurrent Committed Offsets: {}\nCurrent Available Offsets: {FileStreamSource[file:/home/mls5/Work_Research/Codes/Misc/Python/MachineLearning_ArtificialIntelligence/00_Examples/01_ApacheSpark/01_ComfortApp/ProcessdedData/SurgeAcc]: {"logOffset":0}}\n\nCurrent State: ACTIVE\nThread State: RUNNABLE\n\nLogical Plan:\nFlatMapGroupsInPandas [Count#1], get_pdf(TimeStamp#0L, Count#1, Reading#2), [TimeStamp#10L, Count#11, Reading#12]\n+- Project [Count#1, TimeStamp#0L, Count#1, Reading#2]\n +- StreamingExecutionRelation FileStreamSource[file:/home/mls5/Work_Research/Codes/Misc/Python/MachineLearning_ArtificialIntelligence/00_Examples/01_ApacheSpark/01_ComfortApp/ProcessdedData/SurgeAcc], [TimeStamp#0L, Count#1, Reading#2]\n'
19/05/20 18:32:29 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver
/usr/local/lib/python3.6/dist-packages/pyarrow/__init__.py:152: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
warnings.warn("pyarrow.open_stream is deprecated, please use ".
EDIT-2
Here is the code to reproduce the error
import sys
from pyspark import SparkContext
from pyspark.sql import Row, SparkSession, SQLContext
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.streaming import StreamingContext
from pyspark.sql.types import *
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pyarrow as pa
import glob
#####################################################################################
if __name__ == '__main__' :
spark = SparkSession \
.builder \
.appName("RealTimeIMUAnalysis") \
.getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# reduce verbosity
sc = spark.sparkContext
sc.setLogLevel("ERROR")
##############################################################################
# using the saved files to do the Analysis
DataShema = StructType([ StructField("TimeStamp", LongType(), True), \
StructField("Count", IntegerType(), True), \
StructField("Reading", FloatType(), True) ])
group_columns = ['TimeStamp','Count','Reading']
#pandas_udf(DataShema, PandasUDFType.GROUPED_MAP)
def get_pdf(pdf):
return pd.DataFrame([pdf[group_columns]],columns=[group_columns])
# getting Surge data from the files
SrgDF = spark \
.readStream \
.schema(DataShema) \
.csv("SurgeAcc")
mydf = SrgDF.groupby('Count').apply(get_pdf)
#print(mydf)
qrySrg = mydf \
.writeStream \
.format("console") \
.start() \
.awaitTermination()
To run, you need to create a folder named SurgeAcc where the code is and create a csv file inside with the following format:
TimeStamp,Count,Reading
1557011317299,45148,-0.015494
1557011317299,45153,-0.015963
1557011319511,45201,-0.015494
1557011319511,45221,-0.015494
1557011315134,45092,-0.015494
1557011315135,45107,-0.014085
1557011317299,45158,-0.015963
1557011317299,45163,-0.015494
1557011317299,45168,-0.015024`
Your return pandas_udf dataframe is not matching with the schema specified.
Please note that input to the pandas_udf will be pandas dataframe and also returns pandas dataframe.
You can use all pandas functions inside the pandas_udf. Only thing you have to make sure is the ReturnDataShema should match with actual output of the function.
ReturnDataShema = StructType([StructField("TimeStamp", LongType(), True), \
StructField("Count", IntegerType(), True), \
StructField("Reading", FloatType(), True), \
StructField("TotalCount", FloatType(), True)])
#pandas_udf(ReturnDataShema, PandasUDFType.GROUPED_MAP)
def get_pdf(pdf):
# This following stmt is causing schema mismatch
# return pd.DataFrame([pdf[group_columns]],columns=[group_columns])
# If you want to return all the rows of pandas dataframe
# you can simply
# return pdf
# If you want to do any aggregations, you can do like the below, or use pandas query
# but make sure the return pandas dataframe complies with ReturnDataShema
total_count = pdf['Count'].sum()
return pd.DataFrame([(pdf.TimeStamp[0],pdf.Count[0],pdf.Reading[0],total_count)])

How to create spark dataframe with column name which contains dot/period?

I have data in a list and want to convert it to a spark dataframe with one of the column names containing a "."
I wrote the below code which ran without any errors.
input_data = [('retail', '2017-01-03T13:21:00', 134),
('retail', '2017-01-03T13:21:00', 100)]
rdd_schema = StructType([StructField('business', StringType(), True), \
StructField('date', StringType(), True), \
StructField("`US.sales`", FloatType(), True)])
input_mock_df = spark.createDataFrame(input_mock_rdd_map, rdd_schema)
The below code returns the column names
input_mock_df.columns
But any operations on this dataframe is giving error for example
input_mock_df.count()
How do I make a valid spark dataframe which contains a "."?
Note:
I dont give "." in the column name the code works perfectly.
I want to solve it using native spark and not use pandas etc
I have ran the below code
input_data = [('retail', '2017-01-03T13:21:00', 134),
('retail', '2017-01-03T13:21:00', 100)]
rdd_schema = StructType([StructField('business', StringType(), True), \
StructField('date', StringType(), True), \
StructField("US.sales", IntegerType(), True)])
input_mock_df = sqlContext.createDataFrame(input_data, rdd_schema)
input_mock_df.count()
and it works fine returning the count as 2. Please try and reply

How to create an empty DataFrame? Why "ValueError: RDD is empty"?

I am trying to create an empty dataframe in Spark (Pyspark).
I am using similar approach to the one discussed here enter link description here, but it is not working.
This is my code
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
This is the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 285, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 229, in _inferSchema
first = rdd.first()
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty
extending Joe Widen's answer, you can actually create the schema with no fields like so:
schema = StructType([])
so when you create the DataFrame using that as your schema, you'll end up with a DataFrame[].
>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema)
DataFrame[]
>>> empty.schema
StructType(List())
In Scala, if you choose to use sqlContext.emptyDataFrame and check out the schema, it will return StructType().
scala> val empty = sqlContext.emptyDataFrame
empty: org.apache.spark.sql.DataFrame = []
scala> empty.schema
res2: org.apache.spark.sql.types.StructType = StructType()
At the time this answer was written it looks like you need some sort of schema
from pyspark.sql.types import *
field = [StructField("field1", StringType(), True)]
schema = StructType(field)
sc = spark.sparkContext
sqlContext.createDataFrame(sc.emptyRDD(), schema)
This will work with spark version 2.0.0 or more
from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(), False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)
spark.range(0).drop("id")
This creates a DataFrame with an "id" column and no rows then drops the "id" column, leaving you with a truly empty DataFrame.
You can just use something like this:
pivot_table = sparkSession.createDataFrame([("99","99")], ["col1","col2"])
If you want an empty dataframe based on an existing one, simple limit rows to 0.
In PySpark :
emptyDf = existingDf.limit(0)
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.appName('SparkPractice').getOrCreate()
schema = StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(),schema)
df.printSchema()
This is a roundabout but simple way to create an empty spark df with an inferred schema
# Initialize a spark df using one row of data with the desired schema
init_sdf = spark.createDataFrame([('a_string', 0, 0)], ['name', 'index', 'seq_#'])
# remove the row. Leaves the schema
empty_sdf = init_sdf.where(col('name') == 'not_match')
empty_sdf.printSchema()
# Output
root
|-- name: string (nullable = true)
|-- index: long (nullable = true)
|-- seq_#: long (nullable = true)
Seq.empty[String].toDF()
This will create a empty df. Helpful for testing purposes and all. (Scala-Spark)
In Spark 3.1.2, the spark.sparkContext.emptyRDD() function throws an error. Using the schema, passing an empty list will work:
df = spark.createDataFrame([], schema)
You can do it by loading an empty file (parquet, json etc.) like this:
df = sqlContext.read.json("my_empty_file.json")
Then when you try to check the schema you'll see:
>>> df.printSchema()
root
In Scala/Java not passing a path should work too, in Python it throws an exception. Also if you ever switch to Scala/Python you can use this method to create one.
You can create an empty data frame by using following syntax in pyspark:
df = spark.createDataFrame([], ["col1", "col2", ...])
where [] represents the empty value for col1 and col2. Then you can register as temp view for your sql queries:
**df2.createOrReplaceTempView("artist")**

Resources