How to convert str to list in Pyspark dataframe? - python-3.x

My input is
In pandas dataframe I used literal_eval to convert 'hashtags' column from str to list, but I don't know how to do that in Spark Dataframe. Any solution for my case?
Tks all supports for my case!

You can use:
df1.withColumn("list_hashtags", expr("from_json(hashtags, 'array<string>')"))
Therefore, list_hashtags[0] will give you #ucl.
Hope this is what you want!

Related

spark convert datetime to timestamp

I have a column in pyspark dataframe which is in the format 2021-10-28T22:19:03.0030059Z (string datatype). How to convert this into a timestamp datatype in pyspark?
I'm using the code snippet below but this returns nulls, as it's unable to convert it. Can someone please recommend on how to convert this?
df3.select(to_timestamp(df.DateTime, 'yyyy-MM-ddHH:mm:ss:SSS').alias('dt'),col('DateTime')).show()
You have to escape (put it in '') T and Z:
import pyspark.sql.functions as F
df = spark.createDataFrame([{"DateTime": "2021-10-28T22:19:03.0030059Z"}])
df.select(F.to_timestamp(df.DateTime, "yyyy-MM-dd'T'HH:mm:ss.SSSSSSS'Z'").alias('dt'),F.col('DateTime')).show(truncate = False)`

Python Dask: Cannot convert non-finite values (NA or inf) to integer

I am trying to capture a very large structured table from a postregres table. It has approximately: 200,000,000 records. I am using dask instead of pandas, because it is faster. When I am loading the data into df it is significantly faster than pandas.
I am trying to convert dask DataFrame into Pandas dataframe using compute, it keeps giving me ValueError NA/inf.
I have passed dtype='object', but it is not working. Any way to fix it?
df = dd.read_sql_table('mytable1',
index_col='mytable_id', schema='books',
uri='postgresql://myusername:mypassword#my-address-here12345678.us-west-1.rds.amazonaws.com:12345/BigDatabaseName')
pandas_df = df.compute(dtype='object')
Gives error:
ValueError: Cannot convert non-finite values (NA or inf) to integer
I would guess that one of your columns has nulls but dask inferred it as an integer. Dask looks at a sample of the data to infer dtypes so may not pick up sporadic nulls. Before you call compute you can inspect the dtypes and convert the column type using astype to object for the column you think may be the issue.
Here is the code that works for unknown column types!
my_cols = ['a', 'b',...]
meta_dict = dict(zip(my_cols, [object]*len(my_cols)))
ddf = dd.read_sql_table(..., meta=meta_dict, ....)
df = ddf.compute()
df['a_int'] = df['a'].astype('int64', errors='ignore')

DataFrame object has no attribute 'col'

In Spark: The Definitive Guide it says:
If you need to refer to a specific DataFrame’s column, you can use the
col method on the specific DataFrame.
For example (in Python/Pyspark):
df.col("count")
However, when I run the latter code on a dataframe containing a column count I get the error 'DataFrame' object has no attribute 'col'. If I try column I get a similar error.
Is the book wrong, or how should I go about doing this?
I'm on Spark 2.3.1. The dataframe was created with the following:
df = spark.read.format("json").load("/Users/me/Documents/Books/Spark-The-Definitive-Guide/data/flight-data/json/2015-summary.json")
The book you're referring to describes Scala / Java API. In PySpark use []
df["count"]
The book combines the Scala and PySpark API's.
In Scala / Java API, df.col("column_name") or df.apply("column_name") return the Column.
Whereas in pyspark use the below to get the column from DF.
df.colName
df["colName"]
Applicable to Python Only
Given a DataFrame such as
>>> df
DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]
You can access any column with dot notation
>>> df.DEST_COUNTRY_NAME
Column<'DEST_COUNTRY_NAME'>
You can also use key based indexing to do the same
>>> df['DEST_COUNTRY_NAME']
Column<'DEST_COUNTRY_NAME'>
However, in case your column name and a method name on DataFrame clashes,
your column name will be shadowed when using dot notation.
>>> df['count']
Column<'count'>
>>> df.count
<bound method DataFrame.count of DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]>
from pyspark.sql.functions import col
... then continue
In PySpark col can be used in this way:
df.select(col("count")).show()

pyspark split stringtype from integertypes for explorative analysis

I would like to split the stringtypes of my df from integers in PySpark and execute some descriptive analyses for the integertypes. I wrote this function, is there a more efficient way?
for item in df.columns:
if df.dtypes[item][1] =='string':
print("this column is a string ")
else:
df.agg(F.min(df[item])).show()
max= df.agg(F.max(df[item]))
max.show()
To analyze a dataframe's structure you could use:
df.describe()
df.schema()
df.toPandas().info()
df.cube("col1"[, "col2"]).count().show()
And last but not least DataFrameStatFunctions

How to convert a column in H2OFrame to a python list?

I've read the PythonBooklet.pdf by H2O.ai and the python API documentation, but still can't find a clean way to do this. I know I can do either of the following:
Convert H2OFrame to Spark DataFrame and do a flatMap + collect or collect + list comprehension.
Use H2O's get_frame_data, which gives me a string of header and data separated by \n; then convert it a list (a numeric list in my case).
Is there a better way to do this? Thank you.
You can try something like this: bring an H2OFrame into python as a pandas dataframe by calling .as_data_frame(), then call .tolist() on the column of interest.
A self contained example w/ iris
import h2o
h2o.init()
df = h2o.import_file("iris_wheader.csv")
pd = df.as_data_frame()
pd['sepal_len'].tolist()
You can (1) convert the H2o frame to pandas dataframe and (2) convert pandas dataframe to list as follows:
pd=h2o.as_list(h2oFrame)
l=pd["column"].tolist()

Resources