Appending a transformed column in pyspark - apache-spark

I am running a logistic regression on data frame, and as logistic regression function in spark does not take in categorical vriable I am transforming it.
I am using string indexer transformer.
indexer=StringIndexer(inputCol="classname",outputCol="ClassCategory")
I want to append this transform column back to dataframe.
df.withColumn does not let me do that because object indexer is not a column.
Is there a way to transform and append.

As can be seen in the examples of the Spark ML Documentation, you can try the following:
// Original data is in "df"
indexer = StringIndexer(inputCol="classname",outputCol="ClassCategory")
indexed = indexer.fit(df).transform(df)
indexed.show()
The indexed object will be a dataframe with a new column called "ClassCategory" (the name passed as outputCol).

Related

Spark SQL `sampleBy` function

The Spark DataFrame class has a sampleBy method which can perform stratified sampling on a column given a dictionary of weights, with the keys corresponding to values in the given column. Is there an equivalent way to do this sampling using raw Spark SQL?

Create XML request from each record of a dataframe

I have tried many options including withColumn, udf, lambda, foreach, map but but not getting the expected output. At max, I am able to transform only the first record. The inputfile.json will keep on increasing and the expect op should give the xml in the desired structure. I will later on produce the expected op on Kafka.
Spark 2.3, Python 2.7. Need is to do in PySpark.
Edit 1:
I am able to add a column in the main dataframe which has the required xml. I used withColumn and functions.format_string and able to add strings(the xml structures) to columns of the dataframe.
Now my next target is to produce just the value of that new column to Kafka. I am using df.foreachPartition(send_to_kafka) and have created a function as below:
def send_to_kafka(rows):
kafka = SimpleClient('localhost:9092')
producer = SimpleProducer(kafka)
for row in rows:
producer.send_messages('test', str(row.asDict()))
But unfortunately it does two things:
a. Produces record on Kafka as {'newColumn':u'myXMLPayload'}. I do not want that. I want only myXMLPayload to be produced on Kafka.
b. It adds u' to the value for unicoding the value.
I want to get rid of these two parts and I would be good to go.
Any help would be appreciated.

Mapping words to vector sparkml CountVectorizerModel

Used CountVectorizerModel in spark ml and got the td-idf of the data.
Output column of a df looks like:
(63709,[0,1,2,3,6,7,8,10,11,13],[0.6095235999680518,0.9946971867717818,0.5151611294911758,0.4371112749198506,3.4968901993588046,0.06806241719930584,1.1156025996012633,3.0425756717399217,0.3760235829400124])
Wanted to get top n words which are mapped with this ranking.

Does MLLib only accept the libsvm data format?

I have train set table in Hive . There are 600 columns and 0~599 columns are features such as age, gender..... and the last column is the label of 0 and 1.
I read the table as df and the df also has 600 columns.
But I find that in docs in spark(python), the model like random forest and only accept libsvm format data.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
So I wonder whether MLLib only accept libsvm data format ?
If so, how can I transform my dataset to libsvm format, since my dataset is distribution data that stored as hive table.
Thanks
if your data is stored in hive , you could read them by spark sql and you get the dataframe, then you can train the dataframe by sqark. The example code could be found enter link description here

how to add a Incremental column ID for a table in spark SQL

I'm working on a spark mllib algorithm. The dataset I have is in this form
Company":"XXXX","CurrentTitle":"XYZ","Edu_Title":"ABC","Exp_mnth":.(there are more values similar to these)
Im trying to raw code String values to Numeric values. So, I tried using zipwithuniqueID for unique value for each of the string values.For some reason I'm not able to save the modified dataset to the disk. Can I do this in any way using spark SQL? or what would be the better approach for this?
Scala
import org.apache.spark.sql.functions.monotonically_increasing_id
val dataFrame1 = dataFrame0.withColumn("index",monotonically_increasing_id())
Java
Import org.apache.spark.sql.functions;
Dataset<Row> dataFrame1 = dataFrame0.withColumn("index",functions.monotonically_increasing_id());

Resources