Passing Schema Manually to a Spark dataframe - apache-spark

Question: Is there a way to just pass the Column_names to a spark df and expect the spark to infer the schema types ?
My Scenario: I'm trying to fire a spark job using Kubernetes that basically reads CSV files from AWS S3 and creates a spark df using spark.read.csv().
If there is no header for the CSV file, I need to pass the schema manually to the spark data frame, which I can achieve by the following approach.
schema = StructType([
StructField('column_name', StringType(), True),
StructField('column_name1', StringType(), True)
])
df = spark.read.csv( csv_file, header = False, schema = schema )
That's all fine.
But
Problem: I'm passing the required parameters such as S3_access_key, secret_key, column_names ... etc as environment variables to the executor pods. Refer the below snippet.
ArgoDriverV2.ArgoDriver.create_spark_job(
's3-connector',
'WriteS3',
namespace='default',
executors=2,
args={
"USER":self.user.id,
"COLUMN_SCHEMA": json.dumps(column_names),
"S3_FILE_KEYS":json.dumps(s3_file_keys),
"S3_ACCESS_KEY": params['access_key'],
"S3_SECRET_KEY": params['secret_key'],
"N_EXECUTORS":2,
})
Using the column_names, I can generate the schema in the spark job and pass it to a data frame. But I find this approach a bit complicated.
Is there a way to just pass the Column_names to a spark df and expect the spark to infer the schema types ?

You could read the csv using inferSchema=true and then simply rename the columns like this:
# let's say that we have a list of desired column names
cols = ['a', 'b', 'c']
df = spark.read.option("inferSchema", True).csv("test")
df = df.select([df[x].alias(y) for x,y in zip(df.columns, cols)])

Related

Read csv that contains array of string in pyspark

I'm trying to read a csv that has the following data:
name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3
using inferSchema results in the stops field spilling over to the next columns and messing up the dataframe
If I give my own schema like:
schema = StructType([
StructField('name', StringType()),
StructField('date', TimestampType()),
StructField('win', Booleantype()),
StructField('stops', ArrayType(StringType())),
StructField('cost', DoubleType())])
results in this exception:
pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.
so how would I properly read the csv without this failure?
Since csv doesn't support array, you need to first read as string, then convert it.
# You need to set escape option to ", since it is not the default escape character (\).
df = spark.read.csv('file.csv', header=True, escape='"')
df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))
I guess this is what you are looking for:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")
dataframe.printSchema()
Let me know if it helps

Appending data to an empty dataframe

I am creating an empty dataframe and later trying to append another data frame to that. In fact I want to append many dataframes to the initially empty dataframe dynamically depending on number of RDDs coming.
the union() function works fine if I assign the value to another a third dataframe.
val df3=df1.union(df2)
But I want to keep appending to the initial dataframe (empty) I created because I want to store all the RDDs in one dataframe. The below code however does not show right counts. It seems that it simply did not append
df1.union(df2)
df1.count() // this shows 0 although df2 has some data and that is shown if I assign to third datafram.
If I do the below (I get reassignment error since df1 is val. And if I change it to var type, I get kafka multithreading not safe error.
df1=d1.union(df2)
Any idea how to add all the dynamically created dataframes to one initially created data frame?
Not sure if this is what you are looking for!
# Import pyspark functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Define your schema
field = [StructField("Col1",StringType(), True), StructField("Col2", IntegerType(), True)]
schema = StructType(field)
# Your empty data frame
df = spark.createDataFrame(sc.emptyRDD(), schema)
l = []
for i in range(5):
# Build and append to the list dynamically
l = l + [([str(i), i])]
# Create a temporary data frame similar to your original schema
temp_df = spark.createDataFrame(l, schema)
# Do the union with the original data frame
df = df.union(temp_df)
df.show()
DataFrames and other distributed data structures are immutable, therefore methods which operate on them always return new object. There is no appending, no modification in place, and no ALTER TABLE equivalent.
And if I change it to var type, I get kafka multithreading not safe error.
Without actual code is impossible to give you a definitive answer, but it is unlikely related to union code.
There is a number of known Spark bugs cause by incorrect internal implementation (SPARK-19185, SPARK-23623 to enumerate just a few).

Enforcing a schema on RDD while converting them to DataFrame

I am very new Apache Spark. I am trying to load a csv file into Spark RDD and DataFrames.
I use RDD to manipulate the data and Dataframe for SQL like operations on the Data Frame.
While converting the RDD into Spark DataFrame I run into a problem. The problem statement is given below.
# to load data
dataRDD = sc.textFile(trackfilepath)
# To use it as a csv
dataRDD = testData.mapPartitions(lambda x: csv.reader(x))
# To load into data frame and capture the schema
dataDF = sqlContext.read.load(trackfilepath,
format='com.databricks.spark.csv',
header='true',
inferSchema='true')
schema = dataDF.schema
The Data Looks like
print (dataRDD.take(3))
[['Name', 'f1', 'f2', 'f3', 'f4'], ['Joe', '5', '7', '8', '3'], ['Jill', '3', '2', '2', '23']]
print (dataDF.take(3))
[Row(_c0='Name', _c1='f1', _c2='f2', _c3='f3', _c4='f4'), Row(_c0='Joe', _c1='5', _c2='7', _c3='8', _c4='3'), Row(_c0='Jill', _c1='3', _c2='2', _c3='2', _c4='23')]
print schema
StructType(List(StructField(Name,StringType,true),StructField(f1,IntegerType,true),StructField(f2,IntegerType,true),StructField(f3,IntegerType,true),StructField(f4,IntegerType,true)))
Data Manipulation
def splitWords(line):
return ['Jillwa' if item=='Jill' else item for item in line]
dataCleanRDD = dataRDD.map(splitWords)
The Problem:
Now I am trying to store the manipulated RDD into a DataFrame using the below code and the schema.
dataCleanDF = sqlContext.createDataFrame(dataCleanRDD, schema=schema)
This gives me the below error:
TypeError: IntegerType can not accept object 'f1' in type <class 'str'>
The error is due to the mismatch in datatypes of the values in the RDD and schema. The RDD treats every thing as a String and the schema has integer for field1 field2 , and so on. This is a dummy dataset, my real dataset consists of 200 columns and 100,000 rows. therefore its difficult for me to manually change the RDD values to integer.
I was wondering if there is a way to force the schema on the RDD values. Any help would be appreciated.
If you want to read csv with schema, I would suggest to do something like:
df = sqlContext.read.format("com.databricks.spark.csv")
.schema(dataSchema)
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue", "null")
.load("data.csv")
So you will have you data with schema and you can operate on them and instead of map use with column with udf inside it so you always have column name with you.
Also If you have bigger dataset, save it as parquet or ORC format first and then read it again to perform the operation, that will save you a lot of errors and your performance will be very high.

Get CSV to Spark dataframe

I'm using python on Spark and would like to get a csv into a dataframe.
The documentation for Spark SQL strangely does not provide explanations for CSV as a source.
I have found Spark-CSV, however I have issues with two parts of the documentation:
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?
With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:
df = sqlContext.read.csv("/path/to/your.csv")
Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
Employee_rdd = sc.textFile("\..\Employee.csv")
.map(lambda line: line.split(","))
Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
Employee_df.show()
for Pyspark, assuming that the first row of the csv file contains a header
spark = SparkSession.builder.appName('chosenName').getOrCreate()
df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)
Read the csv file in to a RDD and then generate a RowRDD from the original RDD.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
source: SPARK PROGRAMMING GUIDE
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
Following Spark 2.0, it is recommended to use a Spark Session:
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession \
.builder \
.appName("basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
def mapper(line):
fields = line.split(',')
return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))
lines = spark.sparkContext.textFile("file.csv")
df = lines.map(mapper)
# Infer the schema, and register the DataFrame as a table.
schemaDf = spark.createDataFrame(df).cache()
schemaDf.createOrReplaceTempView("tablename")
I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.
Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.
Hope it works.
Based on the answer by Aravind, but much shorter, e.g. :
lines = sc.textFile("/path/to/file").map(lambda x: x.split(","))
df = lines.toDF(["year", "month", "day", "count"])
With the current implementation(spark 2.X) you dont need to add the packages argument, You can use the inbuilt csv implementation
Additionally as the accepted answer you dont need to create an rdd then enforce schema that has 1 potential problem
When you read the csv as then it will mark all the fields as string and when you enforce the schema with an integer column you will get exception.
A better way to do the above would be
spark.read.format("csv").schema(schema).option("header", "true").load(input_path).show()

Creating a Spark DataFrame from an RDD of lists

I have an rdd (we can call it myrdd) where each record in the rdd is of the form:
[('column 1',value), ('column 2',value), ('column 3',value), ... , ('column 100',value)]
I would like to convert this into a DataFrame in pyspark - what is the easiest way to do this?
How about use the toDF method? You only need add the field names.
df = rdd.toDF(['column', 'value'])
The answer by #dapangmao got me to this solution:
my_df = my_rdd.map(lambda l: Row(**dict(l))).toDF()
Take a look at the DataFrame documentation to make this example work for you, but this should work. I'm assuming your RDD is called my_rdd
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# You have a ton of columns and each one should be an argument to Row
# Use a dictionary comprehension to make this easier
def record_to_row(record):
schema = {'column{i:d}'.format(i = col_idx):record[col_idx] for col_idx in range(1,100+1)}
return Row(**schema)
row_rdd = my_rdd.map(lambda x: record_to_row(x))
# Now infer the schema and you have a DataFrame
schema_my_rdd = sqlContext.inferSchema(row_rdd)
# Now you have a DataFrame you can register as a table
schema_my_rdd.registerTempTable("my_table")
I haven't worked much with DataFrames in Spark but this should do the trick
In pyspark, let's say you have a dataframe named as userDF.
>>> type(userDF)
<class 'pyspark.sql.dataframe.DataFrame'>
Lets just convert it to RDD (
userRDD = userDF.rdd
>>> type(userRDD)
<class 'pyspark.rdd.RDD'>
and now you can do some manipulations and call for example map function :
newRDD = userRDD.map(lambda x:{"food":x['favorite_food'], "name":x['name']})
Finally, lets create a DataFrame from resilient distributed dataset (RDD).
newDF = sqlContext.createDataFrame(newRDD, ["food", "name"])
>>> type(ffDF)
<class 'pyspark.sql.dataframe.DataFrame'>
That's all.
I was hitting this warning message before when I tried to call :
newDF = sc.parallelize(newRDD, ["food","name"] :
.../spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py:336: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row inst warnings.warn("Using RDD of dict to inferSchema is deprecated. "
So no need to do this anymore...

Resources