Spark assign a number for each word in collect - apache-spark

I have an collect data of dataFrame column in spark
temp = df.select('item_code').collect()
Result:
[Row(item_code=u'I0938'),
Row(item_code=u'I0009'),
Row(item_code=u'I0010'),
Row(item_code=u'I0010'),
Row(item_code=u'C0723'),
Row(item_code=u'I1097'),
Row(item_code=u'C0117'),
Row(item_code=u'I0009'),
Row(item_code=u'I0009'),
Row(item_code=u'I0009'),
Row(item_code=u'I0010'),
Row(item_code=u'I0009'),
Row(item_code=u'C0117'),
Row(item_code=u'I0009'),
Row(item_code=u'I0596')]
And now i would like assign a number for each word, if words is duplicate, it have the same number.
I using Spark, RDD , not Pandas
Please help me resolve this problem!

You could create a new dataframe which has distinct values.
val data = temp.distinct()
Now you can assigne a unique id using
import org.apache.spark.sql.functions._
val dataWithId = data.withColumn("uniqueID",monotonicallyIncreasingId)
Now you can join this new dataframe with the original dataframe and select the unique id.
val tempWithId = temp.join(dataWithId, "item_code").select("item_code", "uniqueID")
The code is assuming scala. But something similar should exist for pyspark as well. Just consider this as a pointer.

Related

How to groupBy few columns from a dataset while keeping full column selection?

I want to groupBy a dataset using groupByAttributes List, i do it like so :
Dataset<Row> groupedRows = initDataset.select(initDataset.col("*")).groupBy(groupByAttributes.toArray(new Column[groupByAttributes .size()])).agg(count("*"));
How to return groupedRows with all columns of initDataset ?
PS : joins are not of a great help.
In scala it would be like:
val groupedRows = initDataset.withColumn("count", count().over(Window.partitionBy(groupByAttributes: _*)))
Should be more or less similar in java.

Save and append a file in HDFS using PySpark

I have a data frame in PySpark called df. I have registered this df as a temptable like below.
df.registerTempTable('mytempTable')
date=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
Now from this temp table I will get certain values, like max_id of a column id
min_id = sqlContext.sql("select nvl(min(id),0) as minval from mytempTable").collect()[0].asDict()['minval']
max_id = sqlContext.sql("select nvl(max(id),0) as maxval from mytempTable").collect()[0].asDict()['maxval']
Now I will collect all these values like below.
test = ("{},{},{}".format(date,min_id,max_id))
I found that test is not a data frame but it is a str string
>>> type(test)
<type 'str'>
Now I want save this test as a file in HDFS. I would also like to append data to the same file in hdfs.
How can I do that using PySpark?
FYI I am using Spark 1.6 and don't have access to Databricks spark-csv package.
Here you go, you'll just need to concat your data with concat_ws and right it as a text:
query = """select concat_ws(',', date, nvl(min(id), 0), nvl(max(id), 0))
from mytempTable"""
sqlContext.sql(query).write("text").mode("append").save("/tmp/fooo")
Or even a better alternative :
from pyspark.sql import functions as f
(sqlContext
.table("myTempTable")
.select(f.concat_ws(",", f.first(f.lit(date)), f.min("id"), f.max("id")))
.coalesce(1)
.write.format("text").mode("append").save("/tmp/fooo"))

First element of each dataframe partition Spark 2.0

I need to retrieve the first element of each dataframe partition.
I know that I need to use mapPartitions but it is not clear for me how to use it.
Note: I am using Spark2.0, the dataframe is sorted.
I believe it should look something like following:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
...
implicit val encoder = RowEncoder(df.schema)
val newDf = df.mapPartitions(iterator => iterator.take(1))
This will take 1 element from each partition in DataFrame. Then you can collect all the data to your driver i.e.:
nedDf.collect()
This will return you an array with a number of elements equal to number of your partitions.
UPD updated in order to support Spark 2.0

How to convert a table into a Spark Dataframe

In Spark SQL, a dataframe can be queried as a table using this:
sqlContext.registerDataFrameAsTable(df, "mytable")
Assuming what I have is mytable, how can I get or access this as a DataFrame?
The cleanest way:
df = sqlContext.table("mytable")
Documentation
Well you can query it and save the result into a variable. Check that SQLContext's method sql returns a DataFrame.
df = sqlContext.sql("SELECT * FROM mytable")

Python Spark na.fill does not work

I'm working with spark 1.6 and Python.
I merged 2 dataframe:
df = df_1.join(df_2, df_1.id == df_2.id, 'left').drop(df_2.id)
I get new data frame with correct value and "Null" when the key don't match.
I would like to replace all "Null" values in my dataframe.
I used this function but it does not replace null value:
new_df = df.na.fill(0.0)
Does someone know why it does not work?
Many thanks for your answer.

Resources