DataFrame object has no attribute 'col' - apache-spark

In Spark: The Definitive Guide it says:
If you need to refer to a specific DataFrame’s column, you can use the
col method on the specific DataFrame.
For example (in Python/Pyspark):
df.col("count")
However, when I run the latter code on a dataframe containing a column count I get the error 'DataFrame' object has no attribute 'col'. If I try column I get a similar error.
Is the book wrong, or how should I go about doing this?
I'm on Spark 2.3.1. The dataframe was created with the following:
df = spark.read.format("json").load("/Users/me/Documents/Books/Spark-The-Definitive-Guide/data/flight-data/json/2015-summary.json")

The book you're referring to describes Scala / Java API. In PySpark use []
df["count"]

The book combines the Scala and PySpark API's.
In Scala / Java API, df.col("column_name") or df.apply("column_name") return the Column.
Whereas in pyspark use the below to get the column from DF.
df.colName
df["colName"]

Applicable to Python Only
Given a DataFrame such as
>>> df
DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]
You can access any column with dot notation
>>> df.DEST_COUNTRY_NAME
Column<'DEST_COUNTRY_NAME'>
You can also use key based indexing to do the same
>>> df['DEST_COUNTRY_NAME']
Column<'DEST_COUNTRY_NAME'>
However, in case your column name and a method name on DataFrame clashes,
your column name will be shadowed when using dot notation.
>>> df['count']
Column<'count'>
>>> df.count
<bound method DataFrame.count of DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]>

from pyspark.sql.functions import col
... then continue

In PySpark col can be used in this way:
df.select(col("count")).show()

Related

How to convert str to list in Pyspark dataframe?

My input is
In pandas dataframe I used literal_eval to convert 'hashtags' column from str to list, but I don't know how to do that in Spark Dataframe. Any solution for my case?
Tks all supports for my case!
You can use:
df1.withColumn("list_hashtags", expr("from_json(hashtags, 'array<string>')"))
Therefore, list_hashtags[0] will give you #ucl.
Hope this is what you want!

How to convert a pyspark column(pyspark.sql.column.Column) to pyspark dataframe?

I have an use case to map the elements of a pyspark column based on a condition.
Going through this documentation pyspark column, i could not find a function for pyspark column to execute map function.
So tried to use the pyspark dataFrame map function, but not being able to convert the pyspark column to a dataframe
Note: The reason i am using the pyspark column is because i get that as an input from a library(Great expectations) which i use.
#column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
return column.isin([3])
# need to replace the above logic with a map function
# like column.map(lambda x: __valid_date(x))
_spark function arguments are passed from the library
What i have,
A pyspark column with timestamp strings
What i require,
A Pyspark column with boolean(True/False) for each element based on validating the timestamp format
example for dataframe,
df.rdd.map(lambda x: __valid_date(x)).toDF()
__valid_date function returns True/False
So, i either need to convert the pyspark column into dataframe to use the above map function or is there any map function available for the pyspark column?
Looks like you need to return a column object that the framework will use for validation.
I have not used Great expectations, but maybe you can define an UDF for transforming your column. Something like this:
import pyspark.sql.functions as F
import pyspark.sql.types as T
valid_date_udf = udf(lambda x: __valid_date(x), T.BooleanType())
#column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
return valid_date_udf(column)

How to remove special characters from dataframe using udf function

I am a learner in spark sql. Could anyone please help with below scenario?
package name: sparksql,class name:custommethod, method name:removespecialchar
create custom method in scala which takes 1 string as argument and 1 return on type string
Method has to remove all special characters numbers 0 to 9 - ? , / _ ( ) [ ] from dataframe one column using replaceall function.
input: windows-X64 (os system)
output : windows x os system
I have a dataframe called df1 with 6 columns inside another class called sparksql2
3.Import the package, instantiate the custommethod method inside sparksql2 class and register the method generated in above step as a udf for invoking spark sql dataframe.
Call the above udf in the DSL by passing single columnname as an argument to get the special characters removed from dataframe and save the result as json into hdfs location
You don't need UDFs for that you can just use plain spark and define it in a function with regexp_replace.
take this example:
import org.apache.spark.sql.{SparkSession,DataFrame}
import org.apache.spark.sql.functions.regexp_replace
def removeFromColumn(spark: SparkSession, columnName: String, df: DataFrame) =
df.select(regexp_replace(
df(columnName),
"[0-9]|\\[|\\]|\\-|\\?|\\(|\\)|\\,|_|/",
""
).as(columnName))
with this you can use it on a DataFrame without going into the trouble of registering a UDF:
import spark.implicits._
val df = Seq("2res012-?,/_()[]ult").toDF("columnName")
removeFromColumn(spark, "columnName", df)
Output:
+----------+
|columnName|
+----------+
| result|
+----------+

Python Spark na.fill does not work

I'm working with spark 1.6 and Python.
I merged 2 dataframe:
df = df_1.join(df_2, df_1.id == df_2.id, 'left').drop(df_2.id)
I get new data frame with correct value and "Null" when the key don't match.
I would like to replace all "Null" values in my dataframe.
I used this function but it does not replace null value:
new_df = df.na.fill(0.0)
Does someone know why it does not work?
Many thanks for your answer.

Creating a Spark DataFrame from an RDD of lists

I have an rdd (we can call it myrdd) where each record in the rdd is of the form:
[('column 1',value), ('column 2',value), ('column 3',value), ... , ('column 100',value)]
I would like to convert this into a DataFrame in pyspark - what is the easiest way to do this?
How about use the toDF method? You only need add the field names.
df = rdd.toDF(['column', 'value'])
The answer by #dapangmao got me to this solution:
my_df = my_rdd.map(lambda l: Row(**dict(l))).toDF()
Take a look at the DataFrame documentation to make this example work for you, but this should work. I'm assuming your RDD is called my_rdd
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# You have a ton of columns and each one should be an argument to Row
# Use a dictionary comprehension to make this easier
def record_to_row(record):
schema = {'column{i:d}'.format(i = col_idx):record[col_idx] for col_idx in range(1,100+1)}
return Row(**schema)
row_rdd = my_rdd.map(lambda x: record_to_row(x))
# Now infer the schema and you have a DataFrame
schema_my_rdd = sqlContext.inferSchema(row_rdd)
# Now you have a DataFrame you can register as a table
schema_my_rdd.registerTempTable("my_table")
I haven't worked much with DataFrames in Spark but this should do the trick
In pyspark, let's say you have a dataframe named as userDF.
>>> type(userDF)
<class 'pyspark.sql.dataframe.DataFrame'>
Lets just convert it to RDD (
userRDD = userDF.rdd
>>> type(userRDD)
<class 'pyspark.rdd.RDD'>
and now you can do some manipulations and call for example map function :
newRDD = userRDD.map(lambda x:{"food":x['favorite_food'], "name":x['name']})
Finally, lets create a DataFrame from resilient distributed dataset (RDD).
newDF = sqlContext.createDataFrame(newRDD, ["food", "name"])
>>> type(ffDF)
<class 'pyspark.sql.dataframe.DataFrame'>
That's all.
I was hitting this warning message before when I tried to call :
newDF = sc.parallelize(newRDD, ["food","name"] :
.../spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py:336: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row inst warnings.warn("Using RDD of dict to inferSchema is deprecated. "
So no need to do this anymore...

Resources