I am trying to convert my date column in my spark dataframe from date to np.datetime64 , how can I achieve that?
# this snippet convert string to date format
df1 = df.withColumn("data_date",to_date(col("data_date"),"yyyy-MM-dd"))
As you can see in the docs of spark https://spark.apache.org/docs/latest/sql-reference.html, the only types supported by times variables are TimestampType and DateType. Spark do not know how to handle a np.datetime64 type (think about what could spark know about numpy?-nothing).
You have already convert your string to a date format that spark know. My advise is, from there you should work with it as date which is how spark will understand and do not worry there is a whole amount of built-in functions to deal with this type. Anything you can do with np.datetime64 in numpy you can in spark. Take a look at this post for more detail: https://mungingdata.com/apache-spark/dates-times/
why do you want to do this . spark does not support the data type datetime64 and the provision of creating a User defined datatype is not available any more .Probably u can create a pandas Df and then do this conversion . Spark wont support it
Related
I'm currently working on a research of heart disease detection and want to use spark to process big data as it is a part of a solution of my work. But i'm having difficulty in using spark with python because i cannot grasp how to use spark. Converting csv file to RDD and then i don't understand how to work with RDD to implement classification algorithms like knn, logistic Regression etc.
So i would really appreciate it if anyone can help me in anyway.
I have tried to understand pyspark on internet but there are very few codes available and some which are available are too easy or too hard to understand. I cannot find any proper example of classification on pyspark.
To read the csv into a dataframe you can just call spark.read.option('header', 'true').csv('path/to/csv').
The dataframe will contain the columns and rows of your csv, and you can convert it into a RDD of rows with df.rdd.
This question already has answers here:
Difference between DataFrame, Dataset, and RDD in Spark
(14 answers)
Closed 4 years ago.
In spark,there always be operation like this:
hiveContext.sql("select * from demoTable").show()
When I look up the show() method in Spark Official API,the result is like this:
enter image description here
And when I change the key word to 'Dataset',I Find that the method used on DataFrame belongs to Dataset. How does it happen? Is there any implication?
According to the documentation:
A Dataset is a distributed collection of data.
And:
A DataFrame is a Dataset organized into named columns.
So, technically:
DataFrame is equivalent to Dataset<Row>
And one last quote:
In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset to represent a DataFrame.
In short, a the concrete type is Dataset.
I have a very large dataframe having a shape (16 million, 147). I am trying to persist that dataframe using to_pickle but program is neither serializing that dataframe nor throwing any exception. Does anyone have any idea about it? Suggest me any other format for persisting dataframe. I have tried hdf5 but I am having mixed type object hence I cannot use that format.
How should I serialize a DataSet? Is there a way to use the Encoder to create a binary file, or should I convert it to a DataFrame and then save it as parquet?
How should I serialize a DataSet?
dataset.toDF().write.parquet("")
I believe it would adhere to the schema that is being used by the dataset automatically.
Is there a way to use the Encoder to create a binary file
Based on the source code of Encoder (for 1.6.0), it is designed to convert an input datasource into Dataset (to and from InternalRow to be precise but that's a very low-level detail). The default implementation matches every column from a dataframe into a case-class (for scala) or tuple or primitive so as to generate a Dataset.
I think you are using Java or Scala, right? Because PySpark doesn't have support for Dataset yet. In my experience the best you can do is to save your data as parquet file in HDFS, because I have noticed that the time required to read the file gets reduced comparing it with other formats like csv and others.
Sorry for my digression, but I thought it was important. As you can see in the documentation of Dataset class, you can't notice any method to save the data, therefore my suggestion is to use toDF method from Dataset and then using write method from DataFrame. Or also use the DataFrameWriter final class, using the parquet method.
This question already has answers here:
Stratified sampling in Spark
(2 answers)
Closed 4 years ago.
I'm in Spark 1.3.0 and my data is in DataFrames.
I need operations like sampleByKey(), sampleByKeyExact().
I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157).
That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames.
Thanks & Regards
MK
Spark 1.1 added stratified sampling routines SampleByKey and SampleByKeyExact to Spark Core, so since then they are available without MLLib dependencies.
These two functions are PairRDDFunctions and belong to key-value RDD[(K,T)]. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:
val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)
Note that sample is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df.