How do I achieve the following pivot in Spark SQL? - apache-spark

I have data of the following form:
Date
User_ID
Data_Desc
Data_Value
2022-01-01
1
Submission Time
124600
2022-01-01
1
E-mail Address
john#doe
2022-01-02
2
Submission Time
142200
2022-01-02
3
Phone Number
000-000-0000
I would like to pivot the data in such a way that the 'Data_Desc' fields become columns and the corresponding 'Data_Value' as the column value. I'm not an expert in pivots and only have used them in textbook cases with aggregate values so having some trouble figuring out how to do this.
Key point to note is that if a particular 'Date' and 'User ID' combination doesn't have a row for a particular field, I'd just want the value to be null. So the above example would look like:
Date
User_ID
Submission Time
E-mail Address
Phone Number
2022-01-01
1
124600
john#doe
(null)
2022-01-02
2
142200
(null)
(null)
2022-01-02
3
(null)
(null)
000-000-0000
How can this be achieved using a Spark SQL expression?

Related

Replacing Null Values with Mean Value of the Column in Grid DB

So, I was working with GridDB NodeJs Connector, I know the query to find out the null values which shows the records/rows:
SELECT * FROM employees where employee_salary = NaN;
But I want to replace the null values of the column with the mean value of the column, in order to maintain the data consistency for data analysis. How do I do that in GridDB?
The Employee table looks like the following:
employee_id employee_salary first_name department
---------------+---------------+--------------+--------------
0 John Sales
1 60000 Lisa Development
2 45000 Richard Sales
3 50000 Lina Marketing
4 55000 Anderson Development

LEAD function with date scenario

I have multiple files, but lets consider 2 files which have filename and start dates columns.
Start_Date
FileName
2022-01-01
product 1
2022-02-02
product 2
please consider both rows as a separate files data.
Now I wanted to generate a dim table which have below requirement.
1st time when I read the file1 I am looking for dim table like below.
Start_date
End_date
file_name
2022-01-01
null
product 1
for 2nd time when I read the file I am looking for dim table like below.
Start_date
End_date
file_name
2022-01-01
2022-02-01
product 1
2022-02-02
null
product 2
basically I want to change the above row null to 2nd file start_date -1
please help
what I am planning, 1st time I am Inserting the data using below query, and 2nd time load also I am using the same query to insert data first then by using Lead function I am planning to update the table
select first_value(Start_date) as EFFECTIVE_START_DATE,
null as End_date,first_value(Source) as SRCE_FILE_NAME from pqdf_view
I am able to write the lead function which is working but How can I update the already inserted column using Lead function. using below query I am able to create a new Column but I want to use the already existing column which is already I inserted using above query data.
select *, LEAD(date_sub(EFFECTIVE_START_DATE,1)) OVER(ORDER BY PRODUCT_QUALITY_SK ASC) as EFFECTIVE_END_DATE from edp_silver.dim_product_quality

finding if n out of m columns are null over each row using calculated column functions in Spotfire

I have the following table and I would like to get the number of nulls for each SEQ_ID
SEQ_ID zScore for 7d zScore for 14d zScore for 21d zScore for 28d zScore for 35d
456 11.353 13.2922 9.0162 8.8533
789 8.5991 8.8244 5.7394
So for SEQ_ID 456 I would have 1 null
For SEQ_ID 789 I would have 2 nulls
Is there a way to do this without writing complicated case statements with brute force combinations in the Calculated column area using Spotfire
I guess you are looking for a Spotfire custom expression not involving R.
This would give you the number of columns that are not null. If you know the total number of columns, you can easily turn it into the number of null columns
Len(RXReplace(Concatenate($map("[yourtable].$esc($csearch([yourtable],"*"))",",'-',")),'\\w+','Z','g')) -
Len(RXReplace(Concatenate($map("[yourtable].$esc($csearch([yourtable],"*"))",",'-',")),'\\w+','','g'))
[yourtable] would be the name of your data table. This acts on all columns.

lookup within date range excel

table 1.
date and time the booking was made
table 2.
The values remain in place until the next date/time its modified.
What I want to know: what was the value from table 2 at the time the booking was made.
Result: I want value '1' from the table 2 because the booking was made on the 22/06/21 11:00, and at that time, the value '1' from Table 2 was in place until the 23/06.
=INDEX(Table2[value],MATCH(LARGE(IF(Table2[date modified]<[#[Date Time]],Table2[date modified]),1),Table2[date modified],0))

How to aggregate a column's dates into a list of dates per person with Python Pandas?

I have the following data, with each row per ID and DATE. A person with the same ID can occupy multiple rows, hence multiple dates. I want to aggregate it into one person (or ID) per row, and the dates will be aggregated into a list of date
From this
ID DATE
1 2012-03-04
1 2013-04-15
1 2019-01-09
2 2013-04-09
2 2016-01-01
2 2018-05-09
To this
ID DATE
1 [2012-03-04, 2013-04-15, 2019-01-09]
2 [2013-04-09, 2016-01-01, 2018-05-09]
Here is my attempt
df.sort_values(by=['ID', 'DATE'], ascending=True, inplace=True)
df = df[['ID', 'DATE']]
df_pivot = df.groupby('ID').aggregate(lambda tdf: tdf.unique().tolist())
df_pivot = pd.DataFrame(df_pivot.to_records())
The problem is it returns something like this
ID DATE
1 [1375228800000000000, 1411948800000000000, 1484524800000000000]
2 [1524528000000000000, 1529539200000000000, 1529542200000000000]
What kind of date format is this? I can't seem to find the right function to convert it back to the typical date format.
If need unique values in lists use DataFrame.drop_duplicates before aggregate lists:
df = (df.sort_values(by=['ID', 'DATE'], ascending=True)
.drop_duplicates(['ID', 'DATE'])
.groupby('ID')['DATE']
.agg(list))
In your solution should working, but it is slow:
df_pivot = df.groupby('ID')['DATE'].aggregate(lambda tdf: tdf.drop_duplicates().tolist())
What kind of date format is this?
If is native datetimes, called also unix datetime in nanoseconds.
Many ways... agg preferred because apply can be very slow
df.groupby('ID')['DATE'].agg(list)
Or
df.groupby('ID')['DATE'].apply(lambda x: x.to_list())
Simply use groupby() and apply() method:
result=df.groupby('ID')['DATE'].apply(list)
OR
result=df.groupby('ID')['DATE'].agg(list)
Now If you print result you will get your desired output:
ID
1 [ 2012-03-04, 2013-04-15, 2019-01-09]
2 [ 2013-04-09, 2016-01-01, 2018-05-09]
Name: DATE, dtype: object
The above code is giving you Series,If you want Dataframe Then use:
result=df.groupby('ID')['DATE'].apply(list).reset_index()

Resources