How to remove scientific notation while generating CSV files as output? - azure

I had a delta table which is my input after reading that and generating a output as CSV file i see the scientific notation being displayed if the no of digits exceeds more than 7.
Eg:delta table has column value
a = 22409595 which is a double data type
O/P : CSV is being generated as
a= 2.2409595E7.
I have tried all the possible methods such as format_number,casting etc but unfortunately I haven't succeeded.Using format_number is only working if i had a single record in my output it is not working for multiple records.
Any help on this will appreciated ☺️ thanks in advance.

I reproduced this and able to remove the scientific notation in the dataframe by converting it to decimal in dataframe.
Please follow the below demonstration below:
This is my Delta table:
You can see, I have the numbers more than 7 in all columns.
Generating scientific notation in the dataframe:
Cast this to Decimal type in the dataframe by which you can specify the count of precision.
Give the count of maximum digits. Here, I have given 10, as my maximum number of digits of number is 10.
Save this dataframe as csv by which you can get the desired values.
My Source code:
%sql
CREATE TABLE table1 (col1 double,col2 double,col3 double);
insert into table1 values (22409595,12241226,17161224),(191919213,191919213,191919213);
%python
sqldf=spark.sql("select * from table1")
from pyspark.sql.types import *
for col in sqldf.columns:
sqldf=sqldf.withColumn(col, sqldf[col].cast(DecimalType(10, 0)))
sqldf.show()

Related

generate the 6 digit random as a new column in the my data frame in pyspark

I have huge data in my dataframe now i want to try to insert the new column with 6 digit random.
i've tried with lit(randrange(99999)) is not working as expected, it's producing lesser than the 6 digit as well static value been produced to entire dataframe.
you can use built-in spark function rand() to get desired result.
spark.sql("select ceil(rand() * 1000000)").show()
To add this to new column:
Assuming df as your dataframe:
df.withColumn("random6digit",ceil(rand() * 1000000))

Azure SQL: join of 2 tables with 2 unicode fields returns empty when matching records exist

I have a table with a few key columns created with nvarchar(80) => unicode.
I can list the full dataset with SELECT * statement (Table1) and can confirm the values I need to filter are there.
However, I can't get any results from that table if I filter rows by using as input an alphabet char on any column.
Columns in table1 stores values in cyrilic characters.
I know it must have to do with character encoding => what I see in the result list is not what I use as input characters.
Unicode nvarchar type should resolve automatically this character type mismatch.
What do you suggest me to do in order to get results?
Thank you very much.
Paulo

Order by ascending utcstamp not working -- missing zero from behind the numbers (Pyspark)

I need to order a pyspark sql dataframe by ascending order of day and month. However, due to the format of the UTC stamp, this is happening:
How can I add the zero behind the single numbers and solve this? I'm programming in pyspark. This is the code I used:
data_grouped = data.groupby('month','day').agg(mean('parameter')).orderBy(["month", "day"], ascending=[1, 1])
data_grouped.show()
You can cast the ordering columns to integer:
import pyspark.sql.functions as F
data_grouped = data.groupby('month','day').agg(F.mean('parameter')) \
.orderBy(F.col("month").cast("int"), F.col("day").cast("int"))

spark - get average of past N records excluding the current record

Given a Spark dataframe that I have
val df = Seq(
("2019-01-01",100),
("2019-01-02",101),
("2019-01-03",102),
("2019-01-04",103),
("2019-01-05",102),
("2019-01-06",99),
("2019-01-07",98),
("2019-01-08",100),
("2019-01-09",47)
).toDF("day","records")
I want to add a new column to this so that I get an average value of last N records on a given day. For example, if N=3, then on a given day, that value should be average of last 3 values EXCLUDING the current record
For example, for day 2019-01-05, it would be (103+102+101)/3
How I can use efficiently use over() clause in order to do this in Spark?
PySpark solution.
Window definition should be 3 PRECEDING AND 1 FOLLOWING which translates to positions (-3,-1) with both boundaries included.
from pyspark.sql import Window
from pyspark.sql.functions import avg
w = Window.orderBy(df.day)
df_with_rsum = df.withColumn("rsum_prev_3_days",avg(df.records).over(w).rowsBetween(-3, -1))
df_with_rsum.show()
The solution assumes there is one row per date in the dataframe without missing dates in between. If not, aggregate the rows by date before applying the window function.

Read CSV file without scientific notation

I have two dataframes with one column, values as
df1:
0000D3447E
0000D3447E39S
5052722161014
5052722161021
5052722161038
00000E2377
00000E2378
0000F2892E
0000F2892EDI1
8718934652999
8718934653095
8718934653002
8718934653118
8718934653019
8718934653125
8718934653132
00000E2387
df2:
0000D3447E
0000D3447E39S
5052722161014
5052722161021
5052722161038
00000E2377
00000E2378
0000F2892E
00000E2387
when I merge these two dataframes I don't get the merged values which starts with "00000E"
I used:
df1[['PartNumber']] = df2[['PartNumber']].astype(str)
To convert the column dtype to string. But, still I'm not able get the merged values.
my question is how can I get these values which start with 00000E when I merge the two dataframes.

Resources