read percentage values in spark - apache-spark

I have a xlsx file which has a single column ;
percentage
30%
40%
50%
-10%
0.00%
0%
0.10%
110%
99.99%
99.98%
-99.99%
-99.98%
when i read this using Apache-Spark out put i get is,
|percentage|
+----------+
| 0.3|
| 0.4|
| 0.5|
| -0.1|
| 0.0|
| 0.0|
| 0.001|
| 1.1|
| 0.9999|
| 0.9998|
+----------+
expected output is ,
+----------+
|percentage|
+----------+
| 30%|
| 40%|
| 50%|
| -10%|
| 0.00%|
| 0%|
| 0.10%|
| 110%|
| 99.99%|
| 99.98%|
+----------+
My code -
val spark = SparkSession
.builder
.appName("trimTest")
.master("local[*]")
.getOrCreate()
val df = spark.read
.format("com.crealytics.spark.excel").
option("header", "true").
option("maxRowsInMemory", 1000).
option("inferSchema", "true").
load("data/percentage.xlsx")
df.printSchema()
df.show(10)
I Don't want to use casting or turning inferschema to false, i want a way to read percentage value as percentage not as double or string.

Well, percentage ARE double: 30% = 0.3
The only difference is the way it is displayed and, as #Artem_Aliev wrote in comment, there is no percentage type in spark that would print out as you expect. But once again: percentage are double, same thing, different notation.
The question is, what do you want to do with those percentage?
to "apply" them on something else, i.e. use multiply, then just use the double type column
to have a nice print, convert to the suitable string before printing:
val percentString = format_string("%.2f%%", $"percentage" * 100)
ds.withColumn("percentage", percentString).show()

Related

show first occurence(s) of a column

I want to use pyspark to create new dataframe based on input where it prints out the first occurrence of each different value column. Would rownumber() work or window(). Not sure best way approach this or would sparksql be best. Basically the second table is what I want output to be where it prints out just the first occurrence of a value column from input. I only interested in first occurrence of the "value" column. If a value is repeated only show the first one seen.
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|20 |TUES | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
Here's how I'd do this without using window. It will likely perform better on large data sets as it can use more of the cluster to do the work. You would need to use 'VALUE' as Department and 'Salary' as 'DATE' in your case.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
unGroupedDf = df.select( \
df["Department"], \
f.struct(*[\ # Make a struct with all the record elements.
df["Salary"].alias("Salary"),\ #will be sorted on Salary first
df["Department"].alias("Dept"),\
df["Name"].alias("Name")] )\
.alias("record") )
unGroupedDf.groupBy("Department")\ #group
.agg(f.collect_list("record")\ #Gather all the element in a group
.alias("record"))\
.select(\
f.reverse(\ #Make the sort Descending
f.array_sort(\ #Sort the array ascending
f.col("record")\ #the struct
)\
)[0].alias("record"))\ #grab the "Max element in the array
).select( f.col("record.*") ).show() # use struct as Columns
.show()
+---------+------+-------+
| Dept|Salary| Name|
+---------+------+-------+
| Sales| 4600|Michael|
| Finance| 3900| Jen|
|Marketing| 3000| Jeff|
+---------+------+-------+
Appears to me you want to drop duplicated items by VALUE. if so, use dropDuplicates
df.dropDuplicates(['VALUE']).show()
+-----+---+-----+
|VALUE|DAY|Color|
+-----+---+-----+
| 20|MON| BLUE|
| 30|WED| BLUE|
+-----+---+-----+
Here's how to do it with a window. In this example they us salary as the example. In your case I think you'd use 'DAY' for orderBy and 'Value' for partitionBy.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.show()
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
w2 = Window.partitionBy("department").orderBy(col("salary"))
df.withColumn("row",row_number().over(w2)) \
.filter(col("row") == 1).drop("row") \
.show()
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
| James| Sales| 3000|
| Maria| Finance| 3000|
| Kumar| Marketing| 2000|
+-------------+----------+------+
Yes, you'd need to develop a way of ordering days, but I think you get that it's possible and you picked the correct tool. I always like to warn people, this uses a window and they suck all the data to 1 executor to complete the work. This is not particularly efficient. On small datasets this is likely performant. On larger data sets it may take way too long to complete.

PySpark: Error in describe() function when summarizing distribution of negative numbers -- min and max values flipped

I'm trying to perform some exploratory data analysis by summarizing the distribution of measurements within my dataset using the PySpark describe() function. However, for the measurements that have a negative distribution, the min and max values appear to be flipped.
chicago_crime.describe('latitude', 'longitude').show()
+-------+-------------------+--------------------+
|summary| latitude| longitude|
+-------+-------------------+--------------------+
| count| 6811141| 6811141|
| mean| 41.84203025139101| -87.67177837500668|
| stddev|0.08994460772003067|0.062086304377221284|
| min| 36.619446395| -87.524529378|
| max| 42.022910333| -91.686565684|
+-------+-------------------+--------------------+
The longitude measurement has a negative distribution. I expected the min for longitude to be -91.686565684 and the max to be -87.524529378.
Has anyone else noticed this error? Can the PySpark developers correct this error?
As per request below, here is the printSchema() output.
chicago_crime.printSchema()
root
|-- latitude: string (nullable = true)
|-- longitude: string (nullable = true)
And converting to float then shows the expected result.
chicago_crime = chicago_crime.withColumn('latitude', chicago_crime.latitude.astype('float'))
chicago_crime = chicago_crime.withColumn('longitude', chicago_crime.longitude.astype('float'))
chicago_crime.describe('latitude', 'longitude').show()
+-------+-------------------+--------------------+
|summary| latitude| longitude|
+-------+-------------------+--------------------+
| count| 6810978| 6810978|
| mean| 41.84215369600549| -87.6716834892099|
| stddev|0.08628712634075986|0.058938763393995654|
| min| 41.644585| -87.934326|
| max| 42.02291| -87.52453|
+-------+-------------------+--------------------+
I tried below code:
from pyspark.sql import Row
df = spark.sparkContext.parallelize([Row(-1),Row(-2), Row(-3)]).toDF()
df.describe().show()
I got expected result as below:
+-------+----+
|summary| _1|
+-------+----+
| count| 3|
| mean|-2.0|
| stddev| 1.0|
| min| -3|
| max| -1|
+-------+----+

How do i change string to HH:mm:ss only in spark

I am getting the time as string like 134455 and I need to convert into 13:44:55 using spark sql how can we get this in right format
You can try the regexp_replace function.
scala> val df = Seq((134455 )).toDF("ts_str")
df: org.apache.spark.sql.DataFrame = [ts_str: int]
scala> df.show(false)
+------+
|ts_str|
+------+
|134455|
+------+
scala> df.withColumn("ts",regexp_replace('ts_str,"""(\d\d)""","$1:")).show(false)
+------+---------+
|ts_str|ts |
+------+---------+
|134455|13:44:55:|
+------+---------+
scala> df.withColumn("ts",trim(regexp_replace('ts_str,"""(\d\d)""","$1:"),":")).show(false)
+------+--------+
|ts_str|ts |
+------+--------+
|134455|13:44:55|
+------+--------+
scala>
val df = Seq("133456").toDF
+------+
| value|
+------+
|133456|
+------+
df.withColumn("value", unix_timestamp('value, "HHmmss"))
.withColumn("value", from_unixtime('value, "HH:mm:ss"))
.show
+--------+
| value|
+--------+
|13:34:56|
+--------+
Note that a unix timestamp is stored as the number of seconds since 00:00:00, 1 January 1970. If you try to convert a time with millisecond accuracy to a timestamp, you will lose the millisecond part of the time. For times including milliseconds, you will need to use a different approach.

MySQL sum over a window that contains a null value returns null

I am trying to get the sum of Revenue over the last 3 Month rows (excluding the current row) for each Client. Minimal example with current attempt in Databricks:
cols = ['Client','Month','Revenue']
df_pd = pd.DataFrame([['A',201701,100],
['A',201702,101],
['A',201703,102],
['A',201704,103],
['A',201705,104],
['B',201701,201],
['B',201702,np.nan],
['B',201703,203],
['B',201704,204],
['B',201705,205],
['B',201706,206],
['B',201707,207]
])
df_pd.columns = cols
spark_df = spark.createDataFrame(df_pd)
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql
""")
df_out.show()
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| NaN| 201.0|
| B|201703| 203.0| NaN|
| B|201704| 204.0| NaN|
| B|201705| 205.0| NaN|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
As you can see, if a null value exists anywhere in the 3 month window, a null value is returned. I would like to treat nulls as 0, hence the ifnull attempt, but this does not seem to work. I have also tried a case statement to change NULL to 0, with no luck.
Just coalesce outside sum:
df_out = sqlContext.sql("""
select *, coalesce(sum(Revenue) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)), 0) as Total_Sum3
from df_sql
""")
It is Apache Spark, my bad! (am working in Databricks and I thought it was MySQL under the hood). Is it too late to change the title?
#Barmar, you are right in that IFNULL() doesn't treat NaN as null. I managed to figure out the fix thanks to #user6910411 from here: SO link. I had to change the numpy NaNs to spark nulls. The correct code from after the sample df_pd is created:
spark_df = spark.createDataFrame(df_pd)
from pyspark.sql.functions import isnan, col, when
#this converts all NaNs in numeric columns to null:
spark_df = spark_df.select([
when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c
for c, t in spark_df.dtypes])
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql order by Client,Month
""")
df_out.show()
which then gives the desired:
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| null| 201.0|
| B|201703| 203.0| 201.0|
| B|201704| 204.0| 404.0|
| B|201705| 205.0| 407.0|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
Is sqlContext the best way to approach this or would it be better / more elegant to achieve the same result via pyspark.sql.window?

Calculate quantile on grouped data in spark Dataframe

I have the following Spark dataframe :
agent_id|payment_amount|
+--------+--------------+
| a| 1000|
| b| 1100|
| a| 1100|
| a| 1200|
| b| 1200|
| b| 1250|
| a| 10000|
| b| 9000|
+--------+--------------+
my desire output would be something like
agen_id 95_quantile
a whatever is 95 quantile for agent a payments
b whatever is 95 quantile for agent b payments
for each group of agent_id I need to calculate the 0.95 quantile, I take the following approach:
test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)
but I take the following error:
'GroupedData' object has no attribute 'approxQuantile'
I need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes
I am using Spark 2.0.0
One solution would be to use percentile_approx :
>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")
>>> df2.show()
# +--------+-----------------+
# |agent_id| approxQuantile|
# +--------+-----------------+
# | a|8239.999999999998|
# | b|7449.999999999998|
# +--------+-----------------+
Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext.
Note 2 : approxQuantile isn't available in Spark < 2.0 for pyspark.
Note 3 : percentile returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.
EDIT : From Spark 2+, HiveContext is not required.

Resources