Does MLLib only accept the libsvm data format? - apache-spark

I have train set table in Hive . There are 600 columns and 0~599 columns are features such as age, gender..... and the last column is the label of 0 and 1.
I read the table as df and the df also has 600 columns.
But I find that in docs in spark(python), the model like random forest and only accept libsvm format data.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
So I wonder whether MLLib only accept libsvm data format ?
If so, how can I transform my dataset to libsvm format, since my dataset is distribution data that stored as hive table.
Thanks

if your data is stored in hive , you could read them by spark sql and you get the dataframe, then you can train the dataframe by sqark. The example code could be found enter link description here

Related

Spark SQL `sampleBy` function

The Spark DataFrame class has a sampleBy method which can perform stratified sampling on a column given a dictionary of weights, with the keys corresponding to values in the given column. Is there an equivalent way to do this sampling using raw Spark SQL?

Spark iteration logic to write the dataset filtered by date to parquet format failing OOM

I have a scenario where i have dataset with date column and later i use the dataset in iteration to save the dataset into multiple partition files in parquet format. I do iterate the date list and while writing to parquet format with that date partition folder i do filter the dataset with date.
I was able to write for certain iterations but after that its failing with Spark out of memory exceptions.
Whats the best way to optimise this to persist the data with OOM.
dataset = dataset with some transformations
for date in date-list
pd.write_part_file("part-data-file", dataset.filter(archive_date==date))
The code looks like above.

how to add a Incremental column ID for a table in spark SQL

I'm working on a spark mllib algorithm. The dataset I have is in this form
Company":"XXXX","CurrentTitle":"XYZ","Edu_Title":"ABC","Exp_mnth":.(there are more values similar to these)
Im trying to raw code String values to Numeric values. So, I tried using zipwithuniqueID for unique value for each of the string values.For some reason I'm not able to save the modified dataset to the disk. Can I do this in any way using spark SQL? or what would be the better approach for this?
Scala
import org.apache.spark.sql.functions.monotonically_increasing_id
val dataFrame1 = dataFrame0.withColumn("index",monotonically_increasing_id())
Java
Import org.apache.spark.sql.functions;
Dataset<Row> dataFrame1 = dataFrame0.withColumn("index",functions.monotonically_increasing_id());

how to create a dataframe from relational database with sparse columns in Spark?

I have to read some data from a relational database to do some machine learning in Spark. However, the table I have to read has some sparse columns. Also, it has a column called "SpecialPurposeColumns" which contains non-zero data in XML format, like:
<Age>76</Age><ID>78</ID><Income>87000</Income> ... <ZIP>96733</ZIP>
what is the good way to create a dataframe in spark using this data?
Thanks in advance

Appending a transformed column in pyspark

I am running a logistic regression on data frame, and as logistic regression function in spark does not take in categorical vriable I am transforming it.
I am using string indexer transformer.
indexer=StringIndexer(inputCol="classname",outputCol="ClassCategory")
I want to append this transform column back to dataframe.
df.withColumn does not let me do that because object indexer is not a column.
Is there a way to transform and append.
As can be seen in the examples of the Spark ML Documentation, you can try the following:
// Original data is in "df"
indexer = StringIndexer(inputCol="classname",outputCol="ClassCategory")
indexed = indexer.fit(df).transform(df)
indexed.show()
The indexed object will be a dataframe with a new column called "ClassCategory" (the name passed as outputCol).

Resources