Using to_pickle for very large dataframes - python-3.x

I have a very large dataframe having a shape (16 million, 147). I am trying to persist that dataframe using to_pickle but program is neither serializing that dataframe nor throwing any exception. Does anyone have any idea about it? Suggest me any other format for persisting dataframe. I have tried hdf5 but I am having mixed type object hence I cannot use that format.

Related

Converting between datetime64 and datetime in pyspark

I am trying to convert my date column in my spark dataframe from date to np.datetime64 , how can I achieve that?
# this snippet convert string to date format
df1 = df.withColumn("data_date",to_date(col("data_date"),"yyyy-MM-dd"))
As you can see in the docs of spark https://spark.apache.org/docs/latest/sql-reference.html, the only types supported by times variables are TimestampType and DateType. Spark do not know how to handle a np.datetime64 type (think about what could spark know about numpy?-nothing).
You have already convert your string to a date format that spark know. My advise is, from there you should work with it as date which is how spark will understand and do not worry there is a whole amount of built-in functions to deal with this type. Anything you can do with np.datetime64 in numpy you can in spark. Take a look at this post for more detail: https://mungingdata.com/apache-spark/dates-times/
why do you want to do this . spark does not support the data type datetime64 and the provision of creating a User defined datatype is not available any more .Probably u can create a pandas Df and then do this conversion . Spark wont support it

How to convert csv to RDD and use RDD in pyspark for some detection?

I'm currently working on a research of heart disease detection and want to use spark to process big data as it is a part of a solution of my work. But i'm having difficulty in using spark with python because i cannot grasp how to use spark. Converting csv file to RDD and then i don't understand how to work with RDD to implement classification algorithms like knn, logistic Regression etc.
So i would really appreciate it if anyone can help me in anyway.
I have tried to understand pyspark on internet but there are very few codes available and some which are available are too easy or too hard to understand. I cannot find any proper example of classification on pyspark.
To read the csv into a dataframe you can just call spark.read.option('header', 'true').csv('path/to/csv').
The dataframe will contain the columns and rows of your csv, and you can convert it into a RDD of rows with df.rdd.

Using Spark DataFrame directly in Keras (databricks)

I have some text that I am looking to classify with keras. I have created a pipeline that takes the text and does some transformations on it and eventually one hot encodes it.
Now, I want to pass that OneHotEncoded column directly into keras in databricks along with the label column, but I cannot seem to do it. All of the examples that I see seem to start with a pandas dataframe and then convert to to a numpy array. But it seems counterproductive to take my pyspark dataframe and convert it.
model.fit(trainingData.select('featuresFirst'), trainingData.select('label'))
gives me:
AttributeError: 'DataFrame' object has no attribute 'values'
model.fit(trainingData.select('featuresFirst').collect(), trainingData.select('label').collect())
gives me:
AttributeError: ndim
What am I missing here?

How to serialize Dataset to binary file/parquet?

How should I serialize a DataSet? Is there a way to use the Encoder to create a binary file, or should I convert it to a DataFrame and then save it as parquet?
How should I serialize a DataSet?
dataset.toDF().write.parquet("")
I believe it would adhere to the schema that is being used by the dataset automatically.
Is there a way to use the Encoder to create a binary file
Based on the source code of Encoder (for 1.6.0), it is designed to convert an input datasource into Dataset (to and from InternalRow to be precise but that's a very low-level detail). The default implementation matches every column from a dataframe into a case-class (for scala) or tuple or primitive so as to generate a Dataset.
I think you are using Java or Scala, right? Because PySpark doesn't have support for Dataset yet. In my experience the best you can do is to save your data as parquet file in HDFS, because I have noticed that the time required to read the file gets reduced comparing it with other formats like csv and others.
Sorry for my digression, but I thought it was important. As you can see in the documentation of Dataset class, you can't notice any method to save the data, therefore my suggestion is to use toDF method from Dataset and then using write method from DataFrame. Or also use the DataFrameWriter final class, using the parquet method.

How does spark handle missing values?

Apache spark support sparse data.
For example, we can use MLUtils.loadLibSVMFile(...) to load data into an RDD.
I was wondering how does spark deal with those missing values.
Spark creates an RDD of Labeled points, and each labeled point has a label and a vector of features. Note that this is a Spark Vector which does support sparse elements (currently Sparse vectors are represented by an array of non-indices and a second array of doubles for each of the non-null value).

Resources