Apache SPARK with SQLContext:: IndexError - apache-spark

I am trying to execute a basic example provided in Inferring the Schema Using Reflection segment of Apache SPARK documentation.
I'm doing this on Cloudera Quickstart VM(CDH5)
The example I'm trying to execute is as below ::
# sc is an existing SparkContext.
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# Load a text file and convert each line to a Row.
lines = sc.textFile("/user/cloudera/analytics/book6_sample.csv")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
# Infer the schema, and register the DataFrame as a table.
schemaPeople = sqlContext.createDataFrame(people)
# SQL can be run over DataFrames that have been registered as a table.
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
# The results of SQL queries are RDDs and support all the normal RDD operations.
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
I ran the code exactly as shown as above but always getting the error "IndexError: list index out of range" when I execute the last command(the for loop).
The input file book6_sample is available at
I ran the code exactly as shown as above but always getting the error "IndexError: list index out of range" when I execute the last command(the for loop).
Please suggest pointers on where I'm going wrong.
Thanks in advance.

Your file has one empty line at the end which is causing this error.Open your file in text editor and remove that line hope it will work


Pyspark dataframe: load from csv and then remove the first line

I am able to load csv file from Azure datalake into pyspark dataframe.
How to remove the first line and make the second line as my header?
I have seen some RDD solution. But I am not able to load the file and I get error using the following code as "RDD is empty"
items = sc.textFile(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/tmp/items.csv")
Hence I prefer to load using standard spark as below. I could display the dataframe contents. I have to drop or remove the first line and make 2nrd row as header. Thanks.
items= spark.read.format("csv").load(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/tmp/items.csv", header=True)
Try this:
it's not an optimized solution but will solve the requirement.
df = spark.createDataFrame([(1,2,3),(4,5,6),(7,8,9)],['a','b','c'])
df1 = df.rdd.zipWithIndex().toDF().where(F.col('_2') > 0).drop('_2')
for each_col in df.columns:
df1 = df1.withColumn(each_col, F.col('_1.'+each_col))

How to avoid excessive dataframe query

Consider there is spark job has multiple dataframe transitions
val baseDF1 = spark.sql(s"select * from db.table1 where condition1='blah'")
val baseDF2 = spark.sql(s"select * from db.table2 where condition2='blah'")
val df3 = basedDF1.join(baseDF12, basedDF1("col1") <=> basedDF1("col2"))
val df4 = df3.withcolumn("col3").withColumnRename("col4", "newcol4")
val df5 = df4.groupBy("groupbycol").agg(expr("coalesce(first(col5, false))"))
val df6 = df5.withColumn("level1", col("coalesce(first(col5, false))")(0))
.withColumn("level2", col("coalesce(first(col5, false))")(1))
.withColumn("level3", col("coalesce(first(col5, false))")(2))
.withColumn("level4", col("coalesce(first(col5, false))")(3))
.withColumn("level5", col("coalesce(first(col5, false))")(4))
.drop("coalesce(first(col5, false))")
I just wondering how Spark generate the spark SQL logic, is it going to generate the query-like transaction for each data frame, i.e
df1 = select * ....
df2 = select * ....
df3 = df1.join.df2 // spark takes content from df1/df2 instead run each query again for joining
df6 = ...
or generate large query by the end of the last dataframe
df6 = select coalesce(first(col5, false)).. from ((select * from table1) join (select * from table2 ) on blah ) group by blah 2...
All I trying to figure out, is how to avoid Spark generate huge query-like logic instead I can let Spark "Commit" somewhere to avoid huge long transaction
the reason behind the inquiry is because current spark job threw following exception
19/12/17 10:57:55 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 567, Column 28: Redefinition of parameter "agg_expr_21"
Spark has two operations - transformation and action.
Transformation happens when a DF is being built using various operations like - select, join, filter etc. It is read to be executed but has not done any work yet, it is being lazy. These transformations can be composed to make new transformation which you do while operating on predefined dataframes, like basedDF1.join(baseDF12, basedDF1("col1") <=> basedDF1("col2")). But again nothing has run.
Action happens when certain operations are called like save, collect, show etc. This is when real work happens. Here each and every 'transformation' that was defined before with be either executed or retrieved from cache. You can save a lot of work for Spark if you can cache some of the complex steps. This can also simplify the plan.
val baseDF1 = spark.sql(s"select * from db.table1 where condition1='blah'")
val baseDF2 = spark.sql(s"select * from db.table2 where condition2='blah'")
val df3 = basedDF1.join(baseDF12, basedDF1("col1") <=> basedDF1("col2"))
val df4 = baseDF1.join(baseDF12, basedDF1("col2") === basedDF1("col3"))// different join
When df4 is executed after df3, it won't be selecting from db.table1 and db.table2 but rather reading baseDF1 and baseDF2 from cache. The plan will look simpler too.
if some reason cache is gone then Spark will recompute baseDF1 and baseDF2 as they were defined, so it knows its lineage but didn't execute it.
You can also use checkpoint to break up the lineage of overall execution, hence simplify it. I think this can help your case.
I have also saved intermediate dataframe to a temporary file and read It back as a dataframe and use it down the line. This breaks up the complexity at the cost of extra io. I won’t recommend it unless other methods didn’t work.
I am not sure about the error you are getting.

Facing Issue while Loading Data in through PySpark and performing joins

My problem is as follows:
I have a large dataframe called customer_data_pk containing 230M rows and the other one containing 200M rows named customer_data_pk_isb.
Both have a column callID on which I would like to do a left join, the left dataframe being customer_data_pk.
What is the best possible way to achieve the join operation?
What have I tried?
The simple join i.e.
customer_data_pk.join(customer_data_pk_isb, customer_data_pk.btn ==
customer_data_pk_isb.btn, 'left')
gives out of memory or (just times out with Error: Removing executor driver with no recent heartbeats: 468990 ms exceeds timeout 120000 ms).
After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals. If someone could shed light on this, it would be great.
I have tried this as well but didn't work and code gets stuck:
Further more from configuration end I am using: --conf spark.sql.shuffle.partitions=5000
My complete code is as under:
from pyspark import SparkContext
from pyspark import SQLContext
import time
import pyspark
sc = SparkContext("local", "Example")
sqlContext = SQLContext(sc);
customer_data_pk = sqlContext.read.format('jdbc').options(
customer_data_pk_isb = sqlContext.read.format('jdbc').options(
print('###########################', customer_data_pk.join(customer_data_pk_isb, customer_data_pk.btn == customer_data_pk_isb.btn, 'left').count(),

Pyspark sql count returns different number of rows than pure sql

I've started using pyspark in one of my projects. I was testing different commands to explore functionalities of the library and I found something that I don't understand.
Take this code:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.dataframe import Dataframe
sc = SparkContext(sc)
hc = HiveContext(sc)
hc.sql("use test_schema")
the last count() operation returns 53941 records. If I run instead a select count(*) from diamonds in Hive I got 53940.
Is that pyspark count including the header?
I've tried to look into:
df = hc.sql("select * from diamonds").collect()
to see if header was included:
df[0] --> Row(carat=None, cut='cut', color='color', clarity='clarity', depth=None, table=None, price=None, x=None, y=None, z=None)
df[1] -- > Row(carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55, price=326, x=3.95, y=3.98, z=2.43)
The 0th element doesn't look like the header.
Anyone has an explanation for this?
Hive can give incorrect counts when stale statistics are used to speed up calculations. To see if this is the problem, in Hive try:
SET hive.compute.query.using.stats=false;
SELECT COUNT(*) FROM diamonds;
Alternatively, refresh the statistics. If your table is not partitioned:
SELECT COUNT(*) FROM diamonds;
If it is partitioned:
Also take another look at your first row (df[0] in your question). It does look like an improperly formatted header row.

Get CSV to Spark dataframe

I'm using python on Spark and would like to get a csv into a dataframe.
The documentation for Spark SQL strangely does not provide explanations for CSV as a source.
I have found Spark-CSV, however I have issues with two parts of the documentation:
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?
With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:
df = sqlContext.read.csv("/path/to/your.csv")
Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
Employee_rdd = sc.textFile("\..\Employee.csv")
.map(lambda line: line.split(","))
Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
for Pyspark, assuming that the first row of the csv file contains a header
spark = SparkSession.builder.appName('chosenName').getOrCreate()
df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)
Read the csv file in to a RDD and then generate a RowRDD from the original RDD.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
Following Spark 2.0, it is recommended to use a Spark Session:
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession \
.builder \
.appName("basic example") \
.config("spark.some.config.option", "some-value") \
def mapper(line):
fields = line.split(',')
return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))
lines = spark.sparkContext.textFile("file.csv")
df = lines.map(mapper)
# Infer the schema, and register the DataFrame as a table.
schemaDf = spark.createDataFrame(df).cache()
I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.
Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.
Hope it works.
Based on the answer by Aravind, but much shorter, e.g. :
lines = sc.textFile("/path/to/file").map(lambda x: x.split(","))
df = lines.toDF(["year", "month", "day", "count"])
With the current implementation(spark 2.X) you dont need to add the packages argument, You can use the inbuilt csv implementation
Additionally as the accepted answer you dont need to create an rdd then enforce schema that has 1 potential problem
When you read the csv as then it will mark all the fields as string and when you enforce the schema with an integer column you will get exception.
A better way to do the above would be
spark.read.format("csv").schema(schema).option("header", "true").load(input_path).show()
