Enforcing a schema on RDD while converting them to DataFrame - apache-spark

I am very new Apache Spark. I am trying to load a csv file into Spark RDD and DataFrames.
I use RDD to manipulate the data and Dataframe for SQL like operations on the Data Frame.
While converting the RDD into Spark DataFrame I run into a problem. The problem statement is given below.
# to load data
dataRDD = sc.textFile(trackfilepath)
# To use it as a csv
dataRDD = testData.mapPartitions(lambda x: csv.reader(x))
# To load into data frame and capture the schema
dataDF = sqlContext.read.load(trackfilepath,
format='com.databricks.spark.csv',
header='true',
inferSchema='true')
schema = dataDF.schema
The Data Looks like
print (dataRDD.take(3))
[['Name', 'f1', 'f2', 'f3', 'f4'], ['Joe', '5', '7', '8', '3'], ['Jill', '3', '2', '2', '23']]
print (dataDF.take(3))
[Row(_c0='Name', _c1='f1', _c2='f2', _c3='f3', _c4='f4'), Row(_c0='Joe', _c1='5', _c2='7', _c3='8', _c4='3'), Row(_c0='Jill', _c1='3', _c2='2', _c3='2', _c4='23')]
print schema
StructType(List(StructField(Name,StringType,true),StructField(f1,IntegerType,true),StructField(f2,IntegerType,true),StructField(f3,IntegerType,true),StructField(f4,IntegerType,true)))
Data Manipulation
def splitWords(line):
return ['Jillwa' if item=='Jill' else item for item in line]
dataCleanRDD = dataRDD.map(splitWords)
The Problem:
Now I am trying to store the manipulated RDD into a DataFrame using the below code and the schema.
dataCleanDF = sqlContext.createDataFrame(dataCleanRDD, schema=schema)
This gives me the below error:
TypeError: IntegerType can not accept object 'f1' in type <class 'str'>
The error is due to the mismatch in datatypes of the values in the RDD and schema. The RDD treats every thing as a String and the schema has integer for field1 field2 , and so on. This is a dummy dataset, my real dataset consists of 200 columns and 100,000 rows. therefore its difficult for me to manually change the RDD values to integer.
I was wondering if there is a way to force the schema on the RDD values. Any help would be appreciated.

If you want to read csv with schema, I would suggest to do something like:
df = sqlContext.read.format("com.databricks.spark.csv")
.schema(dataSchema)
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue", "null")
.load("data.csv")
So you will have you data with schema and you can operate on them and instead of map use with column with udf inside it so you always have column name with you.
Also If you have bigger dataset, save it as parquet or ORC format first and then read it again to perform the operation, that will save you a lot of errors and your performance will be very high.

Related

Passing Schema Manually to a Spark dataframe

Question: Is there a way to just pass the Column_names to a spark df and expect the spark to infer the schema types ?
My Scenario: I'm trying to fire a spark job using Kubernetes that basically reads CSV files from AWS S3 and creates a spark df using spark.read.csv().
If there is no header for the CSV file, I need to pass the schema manually to the spark data frame, which I can achieve by the following approach.
schema = StructType([
StructField('column_name', StringType(), True),
StructField('column_name1', StringType(), True)
])
df = spark.read.csv( csv_file, header = False, schema = schema )
That's all fine.
But
Problem: I'm passing the required parameters such as S3_access_key, secret_key, column_names ... etc as environment variables to the executor pods. Refer the below snippet.
ArgoDriverV2.ArgoDriver.create_spark_job(
's3-connector',
'WriteS3',
namespace='default',
executors=2,
args={
"USER":self.user.id,
"COLUMN_SCHEMA": json.dumps(column_names),
"S3_FILE_KEYS":json.dumps(s3_file_keys),
"S3_ACCESS_KEY": params['access_key'],
"S3_SECRET_KEY": params['secret_key'],
"N_EXECUTORS":2,
})
Using the column_names, I can generate the schema in the spark job and pass it to a data frame. But I find this approach a bit complicated.
Is there a way to just pass the Column_names to a spark df and expect the spark to infer the schema types ?
You could read the csv using inferSchema=true and then simply rename the columns like this:
# let's say that we have a list of desired column names
cols = ['a', 'b', 'c']
df = spark.read.option("inferSchema", True).csv("test")
df = df.select([df[x].alias(y) for x,y in zip(df.columns, cols)])

Cast datatype from array to String for multiple column in Spark Throwing Error

I have a dataframe df that contains Three column of type array, i am trying to save output
to csv, so converted data type to string.
import org.apache.spark.sql.functions._
val df2 = df.withColumn("Total", col("total").cast("string")),
("BOOKID", col("BOOKID").cast("string"),
"PublisherID", col("PublisherID").cast("string")
.write
.csv(path="D:/pennymac/SOLUTION1/OUTPUT")
But getting error.
error as "Cannot Resolve symbol write"
Spark 2.2
Scala
Try below code.
Its not possible to add multiple columns inside withColumn function.
val df2 = df
.withColumn("Total", col("total").cast("string"))
.withColumn("BOOKID", col("BOOKID").cast("string"))
.withColumn("PublisherID", col("PublisherID").cast("string"))
.write
.csv(path="D:/pennymac/SOLUTION1/OUTPUT")

How to process pyspark dataframe as group by column value

I have a huge dataframe of different item_id and its related data, I need to process each group with the item_id serparately in parallel, I tried the to repartition the dataframe by item_id using the below code, but it seems it's still being processed as a whole not chunks
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
result = data.repartition('ITEM_ID') \
.rdd \
.mapPartitions(lambda iter: pd.DataFrame(list(iter), columns=columns))\
.mapPartitions(scan_item_best_model)\
.collect()
also is repartition is the correct approach or there is something am doing wrong ?
after looking around I found this which addresses a similar problem, finally I had to solve it like
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
df = data.select("ITEM_ID", F.struct(columns).alias("df"))
df = df.groupBy('ITEM_ID').agg(F.collect_list('df').alias('data'))
df = df.rdd.map(lambda big_df: (big_df['ITEM_ID'], pd.DataFrame.from_records(big_df['data'], columns=columns))).map(
scan_item_best_model)

Save and append a file in HDFS using PySpark

I have a data frame in PySpark called df. I have registered this df as a temptable like below.
df.registerTempTable('mytempTable')
date=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
Now from this temp table I will get certain values, like max_id of a column id
min_id = sqlContext.sql("select nvl(min(id),0) as minval from mytempTable").collect()[0].asDict()['minval']
max_id = sqlContext.sql("select nvl(max(id),0) as maxval from mytempTable").collect()[0].asDict()['maxval']
Now I will collect all these values like below.
test = ("{},{},{}".format(date,min_id,max_id))
I found that test is not a data frame but it is a str string
>>> type(test)
<type 'str'>
Now I want save this test as a file in HDFS. I would also like to append data to the same file in hdfs.
How can I do that using PySpark?
FYI I am using Spark 1.6 and don't have access to Databricks spark-csv package.
Here you go, you'll just need to concat your data with concat_ws and right it as a text:
query = """select concat_ws(',', date, nvl(min(id), 0), nvl(max(id), 0))
from mytempTable"""
sqlContext.sql(query).write("text").mode("append").save("/tmp/fooo")
Or even a better alternative :
from pyspark.sql import functions as f
(sqlContext
.table("myTempTable")
.select(f.concat_ws(",", f.first(f.lit(date)), f.min("id"), f.max("id")))
.coalesce(1)
.write.format("text").mode("append").save("/tmp/fooo"))

Creating a Spark DataFrame from an RDD of lists

I have an rdd (we can call it myrdd) where each record in the rdd is of the form:
[('column 1',value), ('column 2',value), ('column 3',value), ... , ('column 100',value)]
I would like to convert this into a DataFrame in pyspark - what is the easiest way to do this?
How about use the toDF method? You only need add the field names.
df = rdd.toDF(['column', 'value'])
The answer by #dapangmao got me to this solution:
my_df = my_rdd.map(lambda l: Row(**dict(l))).toDF()
Take a look at the DataFrame documentation to make this example work for you, but this should work. I'm assuming your RDD is called my_rdd
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# You have a ton of columns and each one should be an argument to Row
# Use a dictionary comprehension to make this easier
def record_to_row(record):
schema = {'column{i:d}'.format(i = col_idx):record[col_idx] for col_idx in range(1,100+1)}
return Row(**schema)
row_rdd = my_rdd.map(lambda x: record_to_row(x))
# Now infer the schema and you have a DataFrame
schema_my_rdd = sqlContext.inferSchema(row_rdd)
# Now you have a DataFrame you can register as a table
schema_my_rdd.registerTempTable("my_table")
I haven't worked much with DataFrames in Spark but this should do the trick
In pyspark, let's say you have a dataframe named as userDF.
>>> type(userDF)
<class 'pyspark.sql.dataframe.DataFrame'>
Lets just convert it to RDD (
userRDD = userDF.rdd
>>> type(userRDD)
<class 'pyspark.rdd.RDD'>
and now you can do some manipulations and call for example map function :
newRDD = userRDD.map(lambda x:{"food":x['favorite_food'], "name":x['name']})
Finally, lets create a DataFrame from resilient distributed dataset (RDD).
newDF = sqlContext.createDataFrame(newRDD, ["food", "name"])
>>> type(ffDF)
<class 'pyspark.sql.dataframe.DataFrame'>
That's all.
I was hitting this warning message before when I tried to call :
newDF = sc.parallelize(newRDD, ["food","name"] :
.../spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py:336: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row inst warnings.warn("Using RDD of dict to inferSchema is deprecated. "
So no need to do this anymore...

Resources