Spark dataframe decimal precision - apache-spark

I have one dataframe:
val groupby = df.groupBy($"column1",$"Date")
.agg(sum("amount").as("amount"))
.orderBy($"column1",desc("cob_date"))
When applyin the window function for adding new column difference:
val windowspec= Window.partitionBy("column1").orderBy(desc("DATE"))
groupby.withColumn("diffrence" ,lead($"amount", 1,0).over(windowspec)).show()
+--------+------------+-----------+--------------------------+
| Column | Date | Amount | Difference |
+--------+------------+-----------+--------------------------+
| A | 3/31/2017 | 12345.45 | 3456.540000000000000000 |
+--------+------------+-----------+--------------------------+
| A | 2/28/2017 | 3456.54 | 34289.430000000000000000 |
+--------+------------+-----------+--------------------------+
| A | 1/31/2017 | 34289.43 | 45673.987000000000000000 |
+--------+------------+-----------+--------------------------+
| A | 12/31/2016 | 45673.987 | 0.00E+00 |
+--------+------------+-----------+--------------------------+
I'm getting decimal as with trailing zeros .When I did printSchema() for the above dataframe getting the datatype for difference: decimal(38,18).Can some one tell me how to change the datatype to decimal(38,2) or remove the trailing zeros

You can cast the data with the specific decimal size like below,
lead($"amount", 1,0).over(windowspec).cast(DataTypes.createDecimalType(32,2))

In pure SQL, you can use the well known technique:
SELECT ceil(100 * column_name_double)/100 AS cost ...

from pyspark.sql.types import DecimalType
df=df.withColumn(column_name, df[column_name].cast(DecimalType(10,2)))

Related

How to add two arrays in a dataframe array of same schema and add the values within the column for int/long values?

I have a dataframe as below.
Input dataframe -
+-------+----------+----------------------------+----------------------------+
| name| Age | Marks_1 | Marks_2 |
+-------+----------+----------+-----------------+----------+-----------------+
|Harry | | Physics - [50,30] | Physics - [40,40] |
| | | Math - [70,30] | Math - [20,40] |
+-------+----------+----------+-----------------+----------------------------+
Expected Output -
+-------+----------+-------------------+---------------------------------------+
| name| Age | Marks_1 | Marks_2 | Mark_3 |
+-------+----------+-------------------+-------------------+-------------------+
|Harry | 25 | Physics - [50,30] | Physics - [40,40] | Physics - [90,70] |
| | | Math - [70,30] | Math - [20,40] | Math - [90,70] |
+-------+----------+----------+-----------------+------------------------------+
Basically wish to sum two arrays (Marks_1 and Marks_2) of same schema and create a new array (Marks_3) with the sum of the two arrays in Spark Scala.
Tried below approach, but it fails with type mismatch.
df.withColumn(col("marks_3"),col("marks_1.physics") + col("marks_2.physics"))
Can anyone help with the best approach for this?
Thanks in advance!

Explode a dataframe column of csv text into columns

+-------------+-------------------+--------------------+----------------------+
|serial_number| test_date| s3_path| table_csv_data |
+-------------+-------------------+--------------------+----------------------+
| 1050D1B0|2019-05-07 15:41:11|s3://test-bucket-...|col1,col2,col3,col4...|
| 1050D1B0|2019-05-07 15:41:11|s3://test-bucket-...|col1,col2,col3,col4...|
| 1050D1BE|2019-05-08 09:26:55|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-25 06:54:28|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:07:21|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:07:21|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-25 00:19:52|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-24 22:24:40|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-09-12 22:15:19|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:27:56|s3://test-bucket-...|col1,col2,col3,col4...|
+-------------+-------------------+--------------------+----------------------+
sample table_csv_data column contains:
timestamp,partition,offset,key,value
1625218801350,97,33009,2CKXTKAT_20210701193302_6400_UCMP,458969040
1625218801349,41,33018,3FGW9S6T_20210701193210_6400_UCMP,17569160
Trying to achieve the final dataframe as below, please help
+-------------+-------------------+--------------------+-----------------+-----------+-----------------------------------+--------------+
|serial_number| test_date| timestamp| partition | offset | key | value |
+-------------+-------------------+--------------------+-----------------+-----------+-----------------------------------+--------------+
| 1050D1B0|2019-05-07 15:41:11| 1625218801350 | 97 | 33009 | 2CKXTKAT_20210701193302_6400_UCMP | 458969040 |
| 1050D1B0|2019-05-07 15:41:11| 1625218801349 | 41 | 33018 | 3FGW9S6T_20210701193210_6400_UCMP | 17569160 |
..
..
..
I cannot think of an approach, kindly help with some suggestions.
As I alternative, I converted the sting csv data into list using csv_reader as well as below, but post that I have been blocked
[[timestamp,partition,offset,key,value],
[1625218801350, 97, 33009, 2CKXTKAT_20210701193302_6400_UCMP, 458969040]
[1625218801349, 41,33018, 3FGW9S6T_20210701193210_6400_UCMP, 17569160]]
You just need to use split :
from pyspark.sql import functions as F
df = df.withColumn("table_csv_data", F.split("table_csv_data", ",")).select(
"serial_number",
"test_date",
F.col("table_csv_data").getItem(0).alias("timestamp"),
F.col("table_csv_data").getItem(1).alias("partition"),
... # Do the same for all the columns you need
)

Split column on condition in dataframe

The data frame I am working on has a column named "Phone" and I want to split in on / or , in a way such that I get the data frame as shown below in separate columns. For example, the first row is 0674-2537100/101 and I want to split it on "/" into two columns having values as 0674-2537100 and 0674-2537101.
Input:
+-------------------------------+
| Phone |
+-------------------------------+
| 0674-2537100/101 |
| 0674-2725627 |
| 0671 – 2647509 |
| 2392229 |
| 2586198/2583361 |
| 0663-2542855/2405168 |
| 0674 – 2563832/0674-2590796 |
| 0671-6520579/3200479 |
+-------------------------------+
Output:
+-----------------------------------+
| Phone | Phone1 |
+-----------------------------------+
| 0674-2537100 | 0674-2537101 |
| 0674-2725627 | |
| 0671 – 2647509 | |
| 2392229 | |
| 2586198 | 2583361 |
| 0663-2542855 | 0663-2405168 |
| 0674 – 2563832 | 0674-2590796 |
| 0671-6520579 | 0671-3200479 |
+-----------------------------------+
Here I came up with a solution where I can take out the length of strings on both sides of the separator(/). Take out their difference. Copy the substring from the first column from character position [:difference-1] to the second column.
So far my progress is,
df['Phone'] = df['Phone'].str.replace(' ', '')
df['Phone'] = df['Phone'].str.replace('–', '-')
df[['Phone','Phone1']] = df['Phone'].str.split("/",expand=True)
df["Phone1"].fillna(value=np.nan, inplace=True)
m2 = (df["Phone1"].str.len() < 12) & (df["Phone"].str.len() > 7)
m3 = df["Phone"].str.len() - df["Phonenew"].str.len()
df.loc[m2, "Phone1"] = df["Phone"].str[:m3-1] + df["Phonenew"]
It gives an error and the column has only nan values after I run this. PLease help me out here.
Considering you're only going to have 2 '/' in the 'Phone' column. Here's what you can do:
'''
This fucntion takes in rows of a dataframe as an input and returns row with appropriate values.
'''
def split_phone_number(row):
split_str=row['Phone'].split('/')
# Considering that you're only going to have 2 or lesser values, update
# the passed row's columns with appropriate values.
if len(split_str)>1:
row['Phone']=split_str[0]
row['Phone1']=split_str[1]
else:
row['Phone']=split_str[0]
row['Phone1']=''
# Return the updated row.
return row
# Making a dummy dataframe.
d={'Phone':['0674-2537100/101','0674-257349','0671-257349','257349','257349/100','101/100','5688343/438934']}
dataFrame= pd.DataFrame(data=d)
# Considering you're only going to have one extra column. adding that column to dataframe.
dataFrame=dataFrame.assign(Phone1=['' for i in range(dataFrame.shape[0])])
# applying the split_phone_number function to dataframe.
dataFrame=dataFrame.apply(split_phone_number,axis=1)
# Prinitng dataframe.
print(dataFrame)
Input:
+---------------------+
| Phone |
+---------------------+
| 0 0674-2537100/101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349/100 |
| 5 101/100 |
| 6 5688343/438934 |
+---------------------+
Output:
+----------------------------+
| Phone Phone1 |
+----------------------------+
| 0 0674-2537100 101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349 100 |
| 5 101 100 |
| 6 5688343 438934 |
+----------------------------+
For further reading:
dataframe.apply()
Hope this helps. Cheers!

How to plot values associated to a string array in a pandas df?

I think my question is easy to solve.
I have a simple dataframe with this shape:
+------------+-----------+----------+
| Age_Group | Gene_Name | Degree |
+------------+-----------+----------+
| pediatric | JAK2 | 17 |
| adult | JAK2 | 14 |
| AYA | JAK2 | 11 |
| pediatric | ETV6 | 52 |
| adult | ETV6 | 7 |
| AYA | ETV6 | 4 |
Then it continues repeating for others genes.
My goal is to plot the degree values on the y-axis with different colors depends on the Age Group and the gene names on the x-axis but I have no idea how to make gene names suitable for python plotting function.
You can pivot the data frame and plot. If you want to rename gene names, that can be done beforehand using replace or map.
df.pivot(index = 'Gene_Name', columns = 'Age_Group',values = 'Degree').plot.bar()

PySpark getting distinct values over a wide range of columns

I have data with a large number of custom columns, the content of which I poorly understand. The columns are named evar1 to evar250. What I'd like to get is a single table with all distinct values, and a count how often these occur and the name of the column.
------------------------------------------------
| columnname | value | count |
|------------|-----------------------|---------|
| evar1 | en-GB | 7654321 |
| evar1 | en-US | 1234567 |
| evar2 | www.myclient.com | 123 |
| evar2 | app.myclient.com | 456 |
| ...
The best way I can think of doing this feels terrible, as I believe I have to read this data once per column (there are actually about 400 such columns.
i = 1
df_evars = None
while i <= 30:
colname = "evar" + str(i)
df_temp = df.groupBy(colname).agg(fn.count("*").alias("rows"))\
.withColumn("colName", fn.lit(colname))
if df_evars:
df_evars = df_evars.union(df_temp)
else:
df_evars = df_temp
display(df_evars)
Am I missing a better solution?
Update
This has been marked as a duplicate but the two responses IMO only solve part of my question.
I am looking at potentially very wide tables with potentially a large number of values. I need a simple way (ie. 3 columns that show the source column, the value and the count of the value in the source column.
The first of the responses only gives me an approximation of the number of distinct values. Which is pretty useless to me.
The second response seems less relevant than the first. To clarify, source data like this:
-----------------------
| evar1 | evar2 | ... |
|---------------|-----|
| A | A | ... |
| B | A | ... |
| B | B | ... |
| B | B | ... |
| ...
Should result in the output
--------------------------------
| columnname | value | count |
|------------|-------|---------|
| evar1 | A | 1 |
| evar1 | B | 3 |
| evar2 | A | 2 |
| evar2 | B | 2 |
| ...
Using melt borrowed from here:
from pyspark.sql.functions import col
melt(
df.select([col(c).cast("string") for c in df.columns]),
id_vars=[], value_vars=df.columns
).groupBy("variable", "value").count()
Adapted from the answer by user6910411.

Resources