Python Dask: Cannot convert non-finite values (NA or inf) to integer - python-3.x

I am trying to capture a very large structured table from a postregres table. It has approximately: 200,000,000 records. I am using dask instead of pandas, because it is faster. When I am loading the data into df it is significantly faster than pandas.
I am trying to convert dask DataFrame into Pandas dataframe using compute, it keeps giving me ValueError NA/inf.
I have passed dtype='object', but it is not working. Any way to fix it?
df = dd.read_sql_table('mytable1',
index_col='mytable_id', schema='books',
uri='postgresql://myusername:mypassword#my-address-here12345678.us-west-1.rds.amazonaws.com:12345/BigDatabaseName')
pandas_df = df.compute(dtype='object')
Gives error:
ValueError: Cannot convert non-finite values (NA or inf) to integer

I would guess that one of your columns has nulls but dask inferred it as an integer. Dask looks at a sample of the data to infer dtypes so may not pick up sporadic nulls. Before you call compute you can inspect the dtypes and convert the column type using astype to object for the column you think may be the issue.

Here is the code that works for unknown column types!
my_cols = ['a', 'b',...]
meta_dict = dict(zip(my_cols, [object]*len(my_cols)))
ddf = dd.read_sql_table(..., meta=meta_dict, ....)
df = ddf.compute()
df['a_int'] = df['a'].astype('int64', errors='ignore')

Related

Pyspark convertion toPandas() problem, ValueError: ordinal must be >= 1

Hello every everyone !
I am reading data from DataLake (that holds a database tables) using PySpark and of applying some filters I put them in Spark DataFrame, but when I convert it to Pandas Data frame using toPandas(), I get this error: ErrorValue: ordinal must be >= 1 on jupyter.
all_columns = list(df.columns)
df = spark_df.select(all_columns)
new_df = df.toPandas()
ValueError: ordinal must be >= 1
Is there anyone has an idea how to fix this bug please !
Thank you in advance !
I tried sparkDataFrame.toPandas()
I expected to get a pandas DataFrame
Check out this StackOverflow question. Double-check if there are strange date values in your PySpark Dataframe before transforming to pandas. You can check out the MIN and MAX dates for pandas dataframes here.

Concatenate Excel Files using Dask

I have 20 Excel files and need to concatenate them using Dask (I have already done it using pandas, but it will grow in the future). I have used the following solution found here: Reading multiple Excel files with Dask
But throws me an error: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
What I am assuming is that it does not create a Dataframe, tried the following code:
df = pd.DataFrame()
files = glob.glob(r"D:\XX\XX\XX\XX\XXX\*.xlsx")
# note we are wrapping in delayed only the function, not the arguments
delayeds = [dask.delayed(pd.read_excel)(i, skiprows=0) for i in files]
# the line below launches actual computations
results = dask.compute(delayeds)
# after computation is over the results object will
# contain a list of pandas dataframes
df = pd.concat(results, ignore_index=True)
The original solution did not include df=pd.DataFrame(). Where is the mistake?
Thank you!
Using the following solution: Build a dask dataframe from a list of dask delayed objects
Realized that the last line was not using dask but pandas. Changed the data to a numpy array to pandas.
Here is the code:
files = glob.glob(r"D:\XX\XX\XX\XX\XXX\*.xlsx")
# note we are wrapping in delayed only the function, not the arguments
delayeds = [dask.delayed(pd.read_excel)(i, skiprows=0) for i in files]
# the line below launches actual computations
results = dask.compute(delayeds)
# after computation is over the results object will
# contain a list of pandas dataframes
dask_array = dd.from_delayed(delayeds) # here instead of pd.concat
dask_array.compute().to_csv(r"D:\XX\XX\XX\XX\XXX\*.csv") # Please be aware of the dtypes on your Excel.

A quick way to get the mean of each position in large RDD

I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like
[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]
Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:
(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4
which is:
[(3745.75,2729.25,4397.0,3981.0)]
Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.
I don't think there is anything faster than calculating the mean (or sum) for each column
If you are using the DataFrame API you can simply aggregate multiple columns:
import os
import time
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
# start local spark session
spark = SparkSession.builder.getOrCreate()
# load as rdd
def localpath(path):
return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
rdd = spark._sc.textFile(localpath('myPosts/'))
# create data frame from rdd
df = spark.createDataFrame(rdd)
means_df = df.agg(*[f.avg(c) for c in df.columns])
means_dict = means_df.first().asDict()
print(means_dict)
Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command

DataFrame constructor not properly called when trying to create a dataframe from two datasets

Greetings data scientists.
I am currently trying to create a DataFrame by extracting one column from a dataset and combine it with my predicted data. My predicted data is in a form of ndarray and I can't just concatenate the two I have to create a dataframe, my code for creating this dataframe is:
df = pd.DataFrame(data = (test_set['SK_ID_CURR_x'],pred_prob), columns = ['SK_ID_CURR','TARGET'])
I am currently getting this error, I need help, please.
ValueError: DataFrame constructor not properly called!
If length pred_prob is same with test_set, use DataFrame.assign:
df = test_set[['SK_ID_CURR_x']].assign(TARGET = pred_prob)

How to convert a column in H2OFrame to a python list?

I've read the PythonBooklet.pdf by H2O.ai and the python API documentation, but still can't find a clean way to do this. I know I can do either of the following:
Convert H2OFrame to Spark DataFrame and do a flatMap + collect or collect + list comprehension.
Use H2O's get_frame_data, which gives me a string of header and data separated by \n; then convert it a list (a numeric list in my case).
Is there a better way to do this? Thank you.
You can try something like this: bring an H2OFrame into python as a pandas dataframe by calling .as_data_frame(), then call .tolist() on the column of interest.
A self contained example w/ iris
import h2o
h2o.init()
df = h2o.import_file("iris_wheader.csv")
pd = df.as_data_frame()
pd['sepal_len'].tolist()
You can (1) convert the H2o frame to pandas dataframe and (2) convert pandas dataframe to list as follows:
pd=h2o.as_list(h2oFrame)
l=pd["column"].tolist()

Resources