Get CSV to Spark dataframe - apache-spark

I'm using python on Spark and would like to get a csv into a dataframe.
The documentation for Spark SQL strangely does not provide explanations for CSV as a source.
I have found Spark-CSV, however I have issues with two parts of the documentation:
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?

With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:
df = sqlContext.read.csv("/path/to/your.csv")
Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.

from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
Employee_rdd = sc.textFile("\..\Employee.csv")
.map(lambda line: line.split(","))
Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
Employee_df.show()

for Pyspark, assuming that the first row of the csv file contains a header
spark = SparkSession.builder.appName('chosenName').getOrCreate()
df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)

Read the csv file in to a RDD and then generate a RowRDD from the original RDD.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
source: SPARK PROGRAMMING GUIDE

If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])

Following Spark 2.0, it is recommended to use a Spark Session:
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession \
.builder \
.appName("basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
def mapper(line):
fields = line.split(',')
return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))
lines = spark.sparkContext.textFile("file.csv")
df = lines.map(mapper)
# Infer the schema, and register the DataFrame as a table.
schemaDf = spark.createDataFrame(df).cache()
schemaDf.createOrReplaceTempView("tablename")

I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.
Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.
Hope it works.

Based on the answer by Aravind, but much shorter, e.g. :
lines = sc.textFile("/path/to/file").map(lambda x: x.split(","))
df = lines.toDF(["year", "month", "day", "count"])

With the current implementation(spark 2.X) you dont need to add the packages argument, You can use the inbuilt csv implementation
Additionally as the accepted answer you dont need to create an rdd then enforce schema that has 1 potential problem
When you read the csv as then it will mark all the fields as string and when you enforce the schema with an integer column you will get exception.
A better way to do the above would be
spark.read.format("csv").schema(schema).option("header", "true").load(input_path).show()

Related

Read csv that contains array of string in pyspark

I'm trying to read a csv that has the following data:
name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3
using inferSchema results in the stops field spilling over to the next columns and messing up the dataframe
If I give my own schema like:
schema = StructType([
StructField('name', StringType()),
StructField('date', TimestampType()),
StructField('win', Booleantype()),
StructField('stops', ArrayType(StringType())),
StructField('cost', DoubleType())])
results in this exception:
pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.
so how would I properly read the csv without this failure?
Since csv doesn't support array, you need to first read as string, then convert it.
# You need to set escape option to ", since it is not the default escape character (\).
df = spark.read.csv('file.csv', header=True, escape='"')
df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))
I guess this is what you are looking for:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")
dataframe.printSchema()
Let me know if it helps

Defining Schema for a large dataframe in spark

I am trying to load a large dataset from a txt file (1000 columns, > 1M Rows) into the spark environment and my dataset has no headers, as a concequence I have run into this error:
TypeError: Can not infer schema for type:
The challenge: looking at the examples given in the documentation, both examples show how to infer the schema using reflection and programmatically demonstrate the idea with few (two) columns that can be easily typed. No special column names are needed since the data represents a matrix.
How would I go about inferring from a larger set of columns, hopefully w/o typing out. Or can the data be loaded in an alternative way that does not require these definitions.
PS: Spark newbie and using pyspark
EDITED (Added information)
dataset = "./data.txt"
conf = (SparkConf()
.setAppName("myApp")
.setMaster("host")
.set("spark.cores.max", "15")
.set("spark.rdd.compress", "true")
.set("spark.broadcast.compress", "true"))
sc = SparkContext(conf=conf)
spark = SparkSession \
.builder \
.appName("myApp") \
.config(conf=SparkConf()) \
.getOrCreate()
data = sc.textFile(dataset)
df = spark.createDataFrame(data)
data.txt contains 1Mn rows and 1000 columns similar to what would be obtained for example by the following code:
np.random.randint(20, size=(1000000, 1000))

PySpark How to read CSV into Dataframe, and manipulate it

I'm quite new to pyspark and am trying to use it to process a large dataset which is saved as a csv file.
I'd like to read CSV file into spark dataframe, drop some columns, and add new columns.
How should I do that?
I am having trouble getting this data into a dataframe. This is a stripped down version of what I have so far:
def make_dataframe(data_portion, schema, sql):
fields = data_portion.split(",")
return sql.createDateFrame([(fields[0], fields[1])], schema=schema)
if __name__ == "__main__":
sc = SparkContext(appName="Test")
sql = SQLContext(sc)
...
big_frame = data.flatMap(lambda line: make_dataframe(line, schema, sql))
.reduce(lambda a, b: a.union(b))
big_frame.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://<...>") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("append") \
.save()
sc.stop()
This produces an error TypeError: 'JavaPackage' object is not callable at the reduce step.
Is it possible to do this? The idea with reducing to a dataframe is to be able to write the resulting data to a database (Redshift, using the spark-redshift package).
I have also tried using unionAll(), and map() with partial() but can't get it to work.
I am running this on Amazon's EMR, with spark-redshift_2.10:2.0.0, and Amazon's JDBC driver RedshiftJDBC41-1.1.17.1017.jar.
Update - answering also your question in comments:
Read data from CSV to dataframe:
It seems that you only try to read CSV file into a spark dataframe.
If so - my answer here: https://stackoverflow.com/a/37640154/5088142 cover this.
The following code should read CSV into a spark-data-frame
import pyspark
sc = pyspark.SparkContext()
sql = SQLContext(sc)
df = (sql.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/path/to_csv.csv"))
// these lines are equivalent in Spark 2.0 - using [SparkSession][1]
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
spark.read.format("csv").option("header", "true").load("/path/to_csv.csv")
spark.read.option("header", "true").csv("/path/to_csv.csv")
drop column
you can drop column using "drop(col)"
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
drop(col)
Returns a new DataFrame that drops the specified column.
Parameters: col – a string name of the column to drop, or a Column to drop.
>>> df.drop('age').collect()
[Row(name=u'Alice'), Row(name=u'Bob')]
>>> df.drop(df.age).collect()
[Row(name=u'Alice'), Row(name=u'Bob')]
>>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect()
[Row(age=5, height=85, name=u'Bob')]
>>> df.join(df2, df.name == df2.name, 'inner').drop(df2.name).collect()
[Row(age=5, name=u'Bob', height=85)]
add column
You can use "withColumn"
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
withColumn(colName, col)
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
Parameters:
colName – string, name of the new column.
col – a Column expression for the new column.
>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
Note: spark has a lot of other functions which can be used (e.g. you can use "select" instead of "drop")

Can I read a CSV represented as a string into Apache Spark using spark-csv

I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible?
Update : Starting from Spark 2.2.x
there is finally a proper way to do it using Dataset.
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate()
import spark.implicits._
val csvData: Dataset[String] = spark.sparkContext.parallelize(
"""
|id, date, timedump
|1, "2014/01/01 23:00:01",1499959917383
|2, "2014/11/31 12:40:32",1198138008843
""".stripMargin.lines.toList).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.show()
frame.printSchema()
Old spark versions
Actually you can, though it's using library internals and not widely advertised. Just create and use your own CsvParser instance.
Example that works for me on spark 1.6.0 and spark-csv_2.10-1.4.0 below
import com.databricks.spark.csv.CsvParser
val csvData = """
|userid,organizationid,userfirstname,usermiddlename,userlastname,usertitle
|1,1,user1,m1,l1,mr
|2,2,user2,m2,l2,mr
|3,3,user3,m3,l3,mr
|""".stripMargin
val rdd = sc.parallelize(csvData.lines.toList)
val csvParser = new CsvParser()
.withUseHeader(true)
.withInferSchema(true)
val csvDataFrame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
You can parse your string into a csv using, e.g. scala-csv:
val myCSVdata : Array[List[String]] =
myCSVString.split('\n').flatMap(CSVParser.parseLine(_))
Here you can do a bit more processing, data cleaning, verifying that every line parses well and has the same number of fields, etc ...
You can then make this an RDD of records:
val myCSVRDD : RDD[List[String]] = sparkContext.parallelize(msCSVdata)
Here you can massage your lists of Strings into a case class, to reflect the fields of your csv data better. You should get some inspiration from the creations of Persons in this example:
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
I omit this step.
You can then convert to a DataFrame:
import spark.implicits._
myCSVDataframe = myCSVRDD.toDF()
The accepted answer wasn't working for me in spark 2.2.0 but lead me to what I needed with csvData.lines.toList
val fileUrl = getClass.getResource(s"/file_in_resources.csv")
val stream = fileUrl.getContent.asInstanceOf[InputStream]
val streamString = Source.fromInputStream(stream).mkString
val csvList = streamString.lines.toList
spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(csvList.toDS())
.as[SomeCaseClass]

Creating a Spark DataFrame from an RDD of lists

I have an rdd (we can call it myrdd) where each record in the rdd is of the form:
[('column 1',value), ('column 2',value), ('column 3',value), ... , ('column 100',value)]
I would like to convert this into a DataFrame in pyspark - what is the easiest way to do this?
How about use the toDF method? You only need add the field names.
df = rdd.toDF(['column', 'value'])
The answer by #dapangmao got me to this solution:
my_df = my_rdd.map(lambda l: Row(**dict(l))).toDF()
Take a look at the DataFrame documentation to make this example work for you, but this should work. I'm assuming your RDD is called my_rdd
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# You have a ton of columns and each one should be an argument to Row
# Use a dictionary comprehension to make this easier
def record_to_row(record):
schema = {'column{i:d}'.format(i = col_idx):record[col_idx] for col_idx in range(1,100+1)}
return Row(**schema)
row_rdd = my_rdd.map(lambda x: record_to_row(x))
# Now infer the schema and you have a DataFrame
schema_my_rdd = sqlContext.inferSchema(row_rdd)
# Now you have a DataFrame you can register as a table
schema_my_rdd.registerTempTable("my_table")
I haven't worked much with DataFrames in Spark but this should do the trick
In pyspark, let's say you have a dataframe named as userDF.
>>> type(userDF)
<class 'pyspark.sql.dataframe.DataFrame'>
Lets just convert it to RDD (
userRDD = userDF.rdd
>>> type(userRDD)
<class 'pyspark.rdd.RDD'>
and now you can do some manipulations and call for example map function :
newRDD = userRDD.map(lambda x:{"food":x['favorite_food'], "name":x['name']})
Finally, lets create a DataFrame from resilient distributed dataset (RDD).
newDF = sqlContext.createDataFrame(newRDD, ["food", "name"])
>>> type(ffDF)
<class 'pyspark.sql.dataframe.DataFrame'>
That's all.
I was hitting this warning message before when I tried to call :
newDF = sc.parallelize(newRDD, ["food","name"] :
.../spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py:336: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row inst warnings.warn("Using RDD of dict to inferSchema is deprecated. "
So no need to do this anymore...

Resources