How to convert a column in H2OFrame to a python list? - apache-spark

I've read the PythonBooklet.pdf by H2O.ai and the python API documentation, but still can't find a clean way to do this. I know I can do either of the following:
Convert H2OFrame to Spark DataFrame and do a flatMap + collect or collect + list comprehension.
Use H2O's get_frame_data, which gives me a string of header and data separated by \n; then convert it a list (a numeric list in my case).
Is there a better way to do this? Thank you.

You can try something like this: bring an H2OFrame into python as a pandas dataframe by calling .as_data_frame(), then call .tolist() on the column of interest.
A self contained example w/ iris
import h2o
h2o.init()
df = h2o.import_file("iris_wheader.csv")
pd = df.as_data_frame()
pd['sepal_len'].tolist()

You can (1) convert the H2o frame to pandas dataframe and (2) convert pandas dataframe to list as follows:
pd=h2o.as_list(h2oFrame)
l=pd["column"].tolist()

Related

Concatenate Excel Files using Dask

I have 20 Excel files and need to concatenate them using Dask (I have already done it using pandas, but it will grow in the future). I have used the following solution found here: Reading multiple Excel files with Dask
But throws me an error: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
What I am assuming is that it does not create a Dataframe, tried the following code:
df = pd.DataFrame()
files = glob.glob(r"D:\XX\XX\XX\XX\XXX\*.xlsx")
# note we are wrapping in delayed only the function, not the arguments
delayeds = [dask.delayed(pd.read_excel)(i, skiprows=0) for i in files]
# the line below launches actual computations
results = dask.compute(delayeds)
# after computation is over the results object will
# contain a list of pandas dataframes
df = pd.concat(results, ignore_index=True)
The original solution did not include df=pd.DataFrame(). Where is the mistake?
Thank you!
Using the following solution: Build a dask dataframe from a list of dask delayed objects
Realized that the last line was not using dask but pandas. Changed the data to a numpy array to pandas.
Here is the code:
files = glob.glob(r"D:\XX\XX\XX\XX\XXX\*.xlsx")
# note we are wrapping in delayed only the function, not the arguments
delayeds = [dask.delayed(pd.read_excel)(i, skiprows=0) for i in files]
# the line below launches actual computations
results = dask.compute(delayeds)
# after computation is over the results object will
# contain a list of pandas dataframes
dask_array = dd.from_delayed(delayeds) # here instead of pd.concat
dask_array.compute().to_csv(r"D:\XX\XX\XX\XX\XXX\*.csv") # Please be aware of the dtypes on your Excel.

How to convert str to list in Pyspark dataframe?

My input is
In pandas dataframe I used literal_eval to convert 'hashtags' column from str to list, but I don't know how to do that in Spark Dataframe. Any solution for my case?
Tks all supports for my case!
You can use:
df1.withColumn("list_hashtags", expr("from_json(hashtags, 'array<string>')"))
Therefore, list_hashtags[0] will give you #ucl.
Hope this is what you want!

How to convert a pyspark column(pyspark.sql.column.Column) to pyspark dataframe?

I have an use case to map the elements of a pyspark column based on a condition.
Going through this documentation pyspark column, i could not find a function for pyspark column to execute map function.
So tried to use the pyspark dataFrame map function, but not being able to convert the pyspark column to a dataframe
Note: The reason i am using the pyspark column is because i get that as an input from a library(Great expectations) which i use.
#column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
return column.isin([3])
# need to replace the above logic with a map function
# like column.map(lambda x: __valid_date(x))
_spark function arguments are passed from the library
What i have,
A pyspark column with timestamp strings
What i require,
A Pyspark column with boolean(True/False) for each element based on validating the timestamp format
example for dataframe,
df.rdd.map(lambda x: __valid_date(x)).toDF()
__valid_date function returns True/False
So, i either need to convert the pyspark column into dataframe to use the above map function or is there any map function available for the pyspark column?
Looks like you need to return a column object that the framework will use for validation.
I have not used Great expectations, but maybe you can define an UDF for transforming your column. Something like this:
import pyspark.sql.functions as F
import pyspark.sql.types as T
valid_date_udf = udf(lambda x: __valid_date(x), T.BooleanType())
#column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
return valid_date_udf(column)

Python Dask: Cannot convert non-finite values (NA or inf) to integer

I am trying to capture a very large structured table from a postregres table. It has approximately: 200,000,000 records. I am using dask instead of pandas, because it is faster. When I am loading the data into df it is significantly faster than pandas.
I am trying to convert dask DataFrame into Pandas dataframe using compute, it keeps giving me ValueError NA/inf.
I have passed dtype='object', but it is not working. Any way to fix it?
df = dd.read_sql_table('mytable1',
index_col='mytable_id', schema='books',
uri='postgresql://myusername:mypassword#my-address-here12345678.us-west-1.rds.amazonaws.com:12345/BigDatabaseName')
pandas_df = df.compute(dtype='object')
Gives error:
ValueError: Cannot convert non-finite values (NA or inf) to integer
I would guess that one of your columns has nulls but dask inferred it as an integer. Dask looks at a sample of the data to infer dtypes so may not pick up sporadic nulls. Before you call compute you can inspect the dtypes and convert the column type using astype to object for the column you think may be the issue.
Here is the code that works for unknown column types!
my_cols = ['a', 'b',...]
meta_dict = dict(zip(my_cols, [object]*len(my_cols)))
ddf = dd.read_sql_table(..., meta=meta_dict, ....)
df = ddf.compute()
df['a_int'] = df['a'].astype('int64', errors='ignore')

Pyspark Pair RDD from Text File

I have a local text file kv_pair.log formatted such as that key value pairs are comma delimited and records begin and terminate with a new line:
"A"="foo","B"="bar","C"="baz"
"A"="oof","B"="rab","C"="zab"
"A"="aaa","B"="bbb","C"="zzz"
I am trying to read this to a Pair RDD using pySpark as follows:
from pyspark import SparkContext
sc=sparkContext()
# Read raw text to RDD
lines=sc.textFile('kv_pair.log')
# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))
print type(pairs)
print pairs.take(2)
I feel I am close! The output of above is:
[[u'A=foo', u'B=bar', u'C=baz'], [u'A=oof', u'B=rab', u'C=zab']]
So it looks like pairs is a list of records, which contains a list of the kv pairs as strings.
How can I use pySpark to transform this into a Pair RDD such as that the keys and values are properly separated?
Ultimate goal is to transform this Pair RDD into a DataFrame to perform SQL operations - but one step at a time, please help transforming this into a Pair RDD.
You can use flatMap with a custom function as lambda can't be used for multiple statements
def tranfrm(x):
lst = x.replace('"', '').split(",")
return [(x.split("=")[0], x.split("=")[1]) for x in lst]
pairs = lines.map(tranfrm)
This is really bad practice for a parser, but I believe your example could be done with something like this:
from pyspark import SparkContext
from pyspark.sql import Row
sc=sparkContext()
# Read raw text to RDD
lines=sc.textFile('kv_pair.log')
# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))\
.map(lambda r: Row(A=r[0].split('=')[1], B=r[1].split('=')[1], C=r[2].split('=')[1] )
print type(pairs)
print pairs.take(2)

Resources