Pyspark create dataframe from pyspark.sql.column.Column - apache-spark

I have a data whose type is pyspark.sql.column.Column.
I would like to change the type of data into dataframe.
When I run createDataFrame(column), the error came out
TypeError: Column is not iterable
How can I solve it?

Related

Unexpected error when using selectExpr in databricks

I already have a pyspark dataframe. I'm passing the variable to selectExpr in databricks to create new dataframe with column names that I need.
When I pass direct column names with aliases into selectExpr I don't have and error as a result my new dataframe successfully has been created. On the screenshot I have my columns with aliases
But when I try to pass into selectExpr exactly the same column but with variable I got en error:
Where did am I missed something?
Assuming that a sql_pattern is a string of all column aliases, you can split it into a list of strings, using split function:
pivotDF = clean_df.selectExpr(sql_pattern.split(","))

In Python, how do I apply astype(str) to the entire dataframe

So I got my api data pulled and sorted. I'm looping thru my url like a breeze. Have all the data going to .csv file then to mysql data using df.to_sql. On a roll, then things changed.
goT this error during the push to db:
TypeError: sequence item 0: expected str instance, dict found
After researching/googling - i found tat the data type needs to be adjusted.
I want to set the datatype of the entire dataFrame to (str), but I can't figure out how to apply it to the code below that send everything to my db.
df.to_sql("contractList",engine, if_exists="replace", dtype=None, index=False)
How do I apply this df.astype(str) to the line of code above?
Thank you in advance.

Pyspark most reliable way to verify column type

If I read data from a CSV, all the columns will be of "String" type by default. Generally I inspect the data using the following functions which gives an overview of the data and its types
df.dtypes
df.show()
df.printSchema()
df.distinct().count()
df.describe().show()
But, if there is a column that I believe is of a particular type e.g. Double, I cannot be sure if all the values are double if I don't have business knowledge and because
1- I cannot see all the values (millions of unique values)
2- If I explicitly cast it to double type, spark quietly converts the type without throwing any exception and the values which are not double are converted to "null" - for example
from pyspark.sql.types import DoubleType.
changedTypedf = df_original.withColumn('label', df_control_trip['id'].cast(DoubleType()))
What could be the best way to confirm the type of column then?
In Scala Dataframe has field "schema", guess, in Python the same:
df.schema.fields.find( _.name=="label").get.dataType

Is there a way to get the column data type in pyspark?

Has been discussed that the way to find the column datatype in pyspark is using df.dtypes get datatype of column using pyspark. The problem with this is that for datatypes like an array or struct you get something like array<string> or array<integer>.
Question: Is there a native way to get the pyspark data type? Like ArrayType(StringType,true)
Just use schema:
df.schema[column_name].dataType

Unable to read timstamp value in pyspark from Hive (spark 1.6.1)

I am trying to read a Hive Table having a date column with datatype as timestamp, length=9.
my code looks something like the following -
df = hc.sql("select * from schema.table")
It can read all other columns (datatype = varchar).
It either reads null or gives none in date column
I have printed the df.dtypes and found that the dataframe schema was inferred correctly and date columns have timestamp datatype
surprisingly the same code works in notebook, only fails in spark-shell environment.
Can someone guide me what could be wrong or limitation that renders this error and how could I rectify it?
I have seen this problem in spark, Where It will display as null when the datatype is timestamp, Its a bug. There is a way to get around it where you have to read that date column as string using something like this to_char(Column_name ,'YYYY-MM-DD HH-MM-SS') as column_name, and then cast it to timestamp. If you can tell me the source type and the tool you used to pull the data like sqoop or are you getting the data in some form of files ?? I can help you better.

Resources