databricks.koalas has no attribute 'qcut' for decile - databricks

I am using koalas in databricks and trying to decile the data.
Therefore I used
df['Decile']= ks.qcut(df['Id'], q = 10, labels = False)
I am getting AttributeError: module 'databricks.koalas' has no attribute 'qcut'
Is there a work around?

koalas does not support qcut. Though you can do it natively in PySpark, see What are alternative methods for pandas quantile and cut in pyspark 1.6.

Related

Changing the magic tags in same cell - Azure Databricks

I am working on Azure Databricks and have fetched a Spark data frame, and need to convert it to R data.frame. I am getting a syntax error when I am using as.data.frame in the same cell for the same.
When tried in different cells, after the initiation of magic tag (%r), and using the same command- it is throwing different errors that object is not found.
You can register the Spark DataFrame as a TempView using createOrReplaceTempView
Registering a Temp View
sparkDF.createOrReplaceTempView('TempView')
Once you have done this, TempView would be accessible throughout your Notebook.
Further using %r you can create a DataFrame out of it
SparkR
%r
library(SparkR)
sparkR <- sql('select * from TempView')
R DataFrame
%r
library(SparkR)
sparkR <- collect(sql('select * from TempView'))

Loading Data from Azure Synapse Database into a DataFrame with Notebook

I am attempting to load data from Azure Synapse DW into a dataframe as shown in the image.
However, I'm getting the following error:
AttributeError: 'DataFrameReader' object has no attribute 'sqlanalytics'
Traceback (most recent call last):
AttributeError: 'DataFrameReader' object has no attribute 'sqlanalytics'
Any thoughts on what I'm doing wrong?
That particular method has changed its name to synapsesql (as per the notes here) and is Scala only currently as I understand it. The correct syntax would therefore be:
%%spark
val df = spark.read.synapsesql("yourDb.yourSchema.yourTable")
It is possible to share the Scala dataframe with Python via the createOrReplaceTempView method, but I'm not sure how efficient that is. Mixing and matching is described here. So for your example you could mix and match Scala and Python like this:
Cell 1
%%spark
// Get table from dedicated SQL pool and assign it to a dataframe with Scala
val df = spark.read.synapsesql("yourDb.yourSchema.yourTable")
// Save the dataframe as a temp view so it's accessible from PySpark
df.createOrReplaceTempView("someTable")
Cell 2
%%pyspark
## Scala dataframe is now accessible from PySpark
df = spark.sql("select * from someTable")
## !!TODO do some work in PySpark
## ...
The above linked example shows how to write the dataframe back to the dedicated SQL pool too if required.
This is a good article for importing / export data with Synpase notebooks and the limitation is described in the Constraints section:
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export#constraints

Are Apache Spark 2.0 parquet files incompatible with Apache Arrow?

The problem
I have written an Apache Spark DataFrame as a parquet file for a deep learning application in a Python environment ; I am currently experiencing issues in implementing basic examples of both petastorm (following this notebook) and horovod frameworks, in reading the aforementioned file namely. The DataFrame has the following type : DataFrame[features: array<float>, next: int, weight: int] (much like in DataBricks' notebook, I had features be a VectorUDT, which I converted to an array).
In both cases, Apache Arrow throws an ArrowIOError : Invalid parquet file. Corrupt footer. error.
What I found until now
I discovered in this question and in this PR that as of version 2.0, Spark doesn't write _metadata or _common_metadata files, unless spark.hadoop.parquet.enable.summary-metadata is set to true in Spark's configuration ; those files are indeed missing.
I thus tried rewriting my DataFrame with this environment, still no _common_metadata file. What also works is to explicitely pass a schema to petastorm when constructing a reader (passing schema_fields to make_batch_reader for instance ; which is a problem with horovod as there is no such parameter in horovod.spark.keras.KerasEstimator's constructor).
How would I be able, if at all possible, to either make Spark output those files, or in Arrow to infer the schema, just like Spark seems to be doing ?
Minimal example with horovod
# Saving df
print(spark.config.get('spark.hadoop.parquet.enable.summary-metadata')) # outputs 'true'
df.repartition(10).write.mode('overwrite').parquet(path)
# ...
# Training
import horovod.spark.keras as hvd
from horovod.spark.common.store import Store
model = build_model()
opti = Adadelta(learning_rate=0.015)
loss='sparse_categorical_crossentropy'
store = Store().create(prefix_path=prefix_path,
train_path=train_path,
val_path=val_path)
keras_estimator = hvd.KerasEstimator(
num_proc=16,
store=store,
model=model,
optimizer=opti,
loss=loss,
feature_cols=['features'],
label_cols=['next'],
batch_size=auto_steps_per_epoch,
epochs=auto_nb_epochs,
sample_weight_col='weight'
)
keras_model = keras_estimator.fit_on_parquet() # Fails here with ArrowIOError
The problem is solved in pyarrow 0.14+ (issues.apache.org/jira/browse/ARROW-4723), be sure to install the updated version with pip (up until Databricks Runtime 6.5, the included version is 0.13).
Thanks to #joris' comment for pointing this out.

Pyspark UDF AttributeError: 'NoneType' object has no attribute '_jvm'

I have a udf function:
#staticmethod
#F.udf("array<int>")
def create_users_array(val):
""" Takes column of ints, returns column of arrays containing ints. """
return [val for _ in range(val)]
I call it like so:
df.withColumn("myArray", create_users_array(df["myNumber"]))
I pass it a dataframe column of integers, and it returns an array of that integer.
E.g.
4 --> [4,4,4,4]
It was working until we upgraded from Python 2.7, and upgraded our EMR version (which I believe uses Pyspark 2.3)
Anyone know what is causing this?
Looks like this had something to do with the improvements made to UDFs in the newer version (or rather, deprecation of old syntax). Changing the udf decorator worked for me. #F.udf("array<int>") --> #F.udf(ArrayType(IntegerType()))

Existing column can't be found by DataFrame#filter in PySpark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.
From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)
What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Resources