DataFrame constructor not properly called when trying to create a dataframe from two datasets - python-3.x

Greetings data scientists.
I am currently trying to create a DataFrame by extracting one column from a dataset and combine it with my predicted data. My predicted data is in a form of ndarray and I can't just concatenate the two I have to create a dataframe, my code for creating this dataframe is:
df = pd.DataFrame(data = (test_set['SK_ID_CURR_x'],pred_prob), columns = ['SK_ID_CURR','TARGET'])
I am currently getting this error, I need help, please.
ValueError: DataFrame constructor not properly called!

If length pred_prob is same with test_set, use DataFrame.assign:
df = test_set[['SK_ID_CURR_x']].assign(TARGET = pred_prob)

Related

Concatenate Excel Files using Dask

I have 20 Excel files and need to concatenate them using Dask (I have already done it using pandas, but it will grow in the future). I have used the following solution found here: Reading multiple Excel files with Dask
But throws me an error: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
What I am assuming is that it does not create a Dataframe, tried the following code:
df = pd.DataFrame()
files = glob.glob(r"D:\XX\XX\XX\XX\XXX\*.xlsx")
# note we are wrapping in delayed only the function, not the arguments
delayeds = [dask.delayed(pd.read_excel)(i, skiprows=0) for i in files]
# the line below launches actual computations
results = dask.compute(delayeds)
# after computation is over the results object will
# contain a list of pandas dataframes
df = pd.concat(results, ignore_index=True)
The original solution did not include df=pd.DataFrame(). Where is the mistake?
Thank you!
Using the following solution: Build a dask dataframe from a list of dask delayed objects
Realized that the last line was not using dask but pandas. Changed the data to a numpy array to pandas.
Here is the code:
files = glob.glob(r"D:\XX\XX\XX\XX\XXX\*.xlsx")
# note we are wrapping in delayed only the function, not the arguments
delayeds = [dask.delayed(pd.read_excel)(i, skiprows=0) for i in files]
# the line below launches actual computations
results = dask.compute(delayeds)
# after computation is over the results object will
# contain a list of pandas dataframes
dask_array = dd.from_delayed(delayeds) # here instead of pd.concat
dask_array.compute().to_csv(r"D:\XX\XX\XX\XX\XXX\*.csv") # Please be aware of the dtypes on your Excel.

Python Dask: Cannot convert non-finite values (NA or inf) to integer

I am trying to capture a very large structured table from a postregres table. It has approximately: 200,000,000 records. I am using dask instead of pandas, because it is faster. When I am loading the data into df it is significantly faster than pandas.
I am trying to convert dask DataFrame into Pandas dataframe using compute, it keeps giving me ValueError NA/inf.
I have passed dtype='object', but it is not working. Any way to fix it?
df = dd.read_sql_table('mytable1',
index_col='mytable_id', schema='books',
uri='postgresql://myusername:mypassword#my-address-here12345678.us-west-1.rds.amazonaws.com:12345/BigDatabaseName')
pandas_df = df.compute(dtype='object')
Gives error:
ValueError: Cannot convert non-finite values (NA or inf) to integer
I would guess that one of your columns has nulls but dask inferred it as an integer. Dask looks at a sample of the data to infer dtypes so may not pick up sporadic nulls. Before you call compute you can inspect the dtypes and convert the column type using astype to object for the column you think may be the issue.
Here is the code that works for unknown column types!
my_cols = ['a', 'b',...]
meta_dict = dict(zip(my_cols, [object]*len(my_cols)))
ddf = dd.read_sql_table(..., meta=meta_dict, ....)
df = ddf.compute()
df['a_int'] = df['a'].astype('int64', errors='ignore')

Databricks Create a list of dataFrames with their size

I'm working on Databricks and I want to have a list of all my dataframes with their number of observations.
Is it possible to have the size (number of rows) for each dataframe in the DataLake?
I found how to list all dataframmes:
display(dbutils.fs.ls("dbfs:/mnt/adls/fraud/qal/landing"))*
I know how to count it.
Is it possible to have a list of my dataframes and the size?
Thank you,
You can create a DataFrame from the file listing and the row counts. The following code assumes all your tables are in Parquet format. If that's not the case, you need to change the reading code.
def namesAndRowCounts(root: String) =
spark.createDataFrame(
dbutils.fs.ls(root).map { info =>
(info.name, spark.read.load(info.path).count)
}
).toDF("name", "rows").orderBy('name)
display(namesAndRowCounts("/mnt/adls/fraud/qal/landing"))

Appending data to an empty dataframe

I am creating an empty dataframe and later trying to append another data frame to that. In fact I want to append many dataframes to the initially empty dataframe dynamically depending on number of RDDs coming.
the union() function works fine if I assign the value to another a third dataframe.
val df3=df1.union(df2)
But I want to keep appending to the initial dataframe (empty) I created because I want to store all the RDDs in one dataframe. The below code however does not show right counts. It seems that it simply did not append
df1.union(df2)
df1.count() // this shows 0 although df2 has some data and that is shown if I assign to third datafram.
If I do the below (I get reassignment error since df1 is val. And if I change it to var type, I get kafka multithreading not safe error.
df1=d1.union(df2)
Any idea how to add all the dynamically created dataframes to one initially created data frame?
Not sure if this is what you are looking for!
# Import pyspark functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Define your schema
field = [StructField("Col1",StringType(), True), StructField("Col2", IntegerType(), True)]
schema = StructType(field)
# Your empty data frame
df = spark.createDataFrame(sc.emptyRDD(), schema)
l = []
for i in range(5):
# Build and append to the list dynamically
l = l + [([str(i), i])]
# Create a temporary data frame similar to your original schema
temp_df = spark.createDataFrame(l, schema)
# Do the union with the original data frame
df = df.union(temp_df)
df.show()
DataFrames and other distributed data structures are immutable, therefore methods which operate on them always return new object. There is no appending, no modification in place, and no ALTER TABLE equivalent.
And if I change it to var type, I get kafka multithreading not safe error.
Without actual code is impossible to give you a definitive answer, but it is unlikely related to union code.
There is a number of known Spark bugs cause by incorrect internal implementation (SPARK-19185, SPARK-23623 to enumerate just a few).

how to convert Dataset Row type to Dataset String type

I am using spark 2.2 with java 8. I have a dataset in Rowtype and I want to used this dataset into ML model so I want to convert Dataset into Dataset when I used Dataset into model it's shown below error.
Type mismatch: cannot convert from Dataset to Dataset
I fond below solution for scala but I want to do this into java.
df.map(row => row.mkString())
val strings = df.map(row => row.mkString()).collect
first convert Row dataset into list and then convert that list into String dataset. Try this
Dataset<Row> df= spark.read()...
List<String> list = df.as(Encoders.STRING()).collectAsList();
Dataset<String> df1 = session.createDataset(list, Encoders.STRING());
If you are planning to read the dataset line by line, then you can use the iterator over the dataset:
Dataset<Row>csv=session.read().format("csv").option("sep",",").option("inferSchema",true).option("escape, "\"").option("header", true).option("multiline",true).load(users/abc/....);
for(Iterator<Row> iter = csv.toLocalIterator(); iter.hasNext();) {
String item = (iter.next()).toString();
System.out.println(item.toString());
}

Resources