Pyspark convertion toPandas() problem, ValueError: ordinal must be >= 1 - python-3.x

Hello every everyone !
I am reading data from DataLake (that holds a database tables) using PySpark and of applying some filters I put them in Spark DataFrame, but when I convert it to Pandas Data frame using toPandas(), I get this error: ErrorValue: ordinal must be >= 1 on jupyter.
all_columns = list(df.columns)
df = spark_df.select(all_columns)
new_df = df.toPandas()
ValueError: ordinal must be >= 1
Is there anyone has an idea how to fix this bug please !
Thank you in advance !
I tried sparkDataFrame.toPandas()
I expected to get a pandas DataFrame

Check out this StackOverflow question. Double-check if there are strange date values in your PySpark Dataframe before transforming to pandas. You can check out the MIN and MAX dates for pandas dataframes here.

Related

spark convert datetime to timestamp

I have a column in pyspark dataframe which is in the format 2021-10-28T22:19:03.0030059Z (string datatype). How to convert this into a timestamp datatype in pyspark?
I'm using the code snippet below but this returns nulls, as it's unable to convert it. Can someone please recommend on how to convert this?
df3.select(to_timestamp(df.DateTime, 'yyyy-MM-ddHH:mm:ss:SSS').alias('dt'),col('DateTime')).show()
You have to escape (put it in '') T and Z:
import pyspark.sql.functions as F
df = spark.createDataFrame([{"DateTime": "2021-10-28T22:19:03.0030059Z"}])
df.select(F.to_timestamp(df.DateTime, "yyyy-MM-dd'T'HH:mm:ss.SSSSSSS'Z'").alias('dt'),F.col('DateTime')).show(truncate = False)`

Python Dask: Cannot convert non-finite values (NA or inf) to integer

I am trying to capture a very large structured table from a postregres table. It has approximately: 200,000,000 records. I am using dask instead of pandas, because it is faster. When I am loading the data into df it is significantly faster than pandas.
I am trying to convert dask DataFrame into Pandas dataframe using compute, it keeps giving me ValueError NA/inf.
I have passed dtype='object', but it is not working. Any way to fix it?
df = dd.read_sql_table('mytable1',
index_col='mytable_id', schema='books',
uri='postgresql://myusername:mypassword#my-address-here12345678.us-west-1.rds.amazonaws.com:12345/BigDatabaseName')
pandas_df = df.compute(dtype='object')
Gives error:
ValueError: Cannot convert non-finite values (NA or inf) to integer
I would guess that one of your columns has nulls but dask inferred it as an integer. Dask looks at a sample of the data to infer dtypes so may not pick up sporadic nulls. Before you call compute you can inspect the dtypes and convert the column type using astype to object for the column you think may be the issue.
Here is the code that works for unknown column types!
my_cols = ['a', 'b',...]
meta_dict = dict(zip(my_cols, [object]*len(my_cols)))
ddf = dd.read_sql_table(..., meta=meta_dict, ....)
df = ddf.compute()
df['a_int'] = df['a'].astype('int64', errors='ignore')

How to convert a column in H2OFrame to a python list?

I've read the PythonBooklet.pdf by H2O.ai and the python API documentation, but still can't find a clean way to do this. I know I can do either of the following:
Convert H2OFrame to Spark DataFrame and do a flatMap + collect or collect + list comprehension.
Use H2O's get_frame_data, which gives me a string of header and data separated by \n; then convert it a list (a numeric list in my case).
Is there a better way to do this? Thank you.
You can try something like this: bring an H2OFrame into python as a pandas dataframe by calling .as_data_frame(), then call .tolist() on the column of interest.
A self contained example w/ iris
import h2o
h2o.init()
df = h2o.import_file("iris_wheader.csv")
pd = df.as_data_frame()
pd['sepal_len'].tolist()
You can (1) convert the H2o frame to pandas dataframe and (2) convert pandas dataframe to list as follows:
pd=h2o.as_list(h2oFrame)
l=pd["column"].tolist()

Python Spark na.fill does not work

I'm working with spark 1.6 and Python.
I merged 2 dataframe:
df = df_1.join(df_2, df_1.id == df_2.id, 'left').drop(df_2.id)
I get new data frame with correct value and "Null" when the key don't match.
I would like to replace all "Null" values in my dataframe.
I used this function but it does not replace null value:
new_df = df.na.fill(0.0)
Does someone know why it does not work?
Many thanks for your answer.

Timestamp parsing in pyspark

df1:
Timestamp:
1995-08-01T00:00:01.000+0000
Is there a way to separate the day of the month in the timestamp column of the data frame using pyspark. Not able to provide the code, I am new to spark. I do not have a clue on how to proceed.
You can parse this timestamp using unix_timestamp:
from pyspark.sql import functions as F
format = "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
df2 = df1.withColumn('Timestamp2', F.unix_timestamp('Timestamp', format).cast('timestamp'))
Then, you can use dayofmonth in the new Timestamp column:
df2.select(F.dayofmonth('Timestamp2'))
More detials about these functions can be found in the pyspark functions documentation.
Code:
df1.select(dayofmonth('Timestamp').alias('day'))

Resources