How to insert value in a column using the lit in Spark dataframe? - apache-spark

I have Spark dataframe. I am trying to insert value in new column using lit, but the value is not being inserted.
Example:
I am trying below code:
df:
+--------------------+----------+---------+
| Programname|Projectnum| Drug|
+--------------------+----------+---------+
|Non-Oncology Phar...|SR0480-000|Invokamet|
+--------------------+----------+---------+
from pyspark.sql.functions import lit
df=df.withColumn("CDE_rec_crdt_dt", lit([str(x.CDE_rec_crdt_dt) for x in df_active.select('CDE_rec_crdt_dt').distinct().collect()][0]))
The value of -
[str(x.CDE_rec_crdt_dt) for x in df_active.select('CDE_rec_crdt_dt').distinct().collect()][0] ---'2020-12-03'
Desired output:
df:
+--------------------+----------+---------+----------------+
| Programname|Projectnum| Drug|CDE_rec_crdt_dt |
+--------------------+----------+---------+----------------+
|Non-Oncology Phar...|SR0480-000|Invokamet|2020-12-03 |
+--------------------+----------+---------+----------------+

val = str(df_active.select('CDE_rec_crdt_dt').distinct().collect()[0][0])
df = df.withColumn(
"CDE_rec_crdt_dt",
lit(val)
)

Related

Split column with JSON string to columns each containing one key-value pair from the string

I have a data frame that looks like this (one column named "value" with a JSON string in it). I send it to an Event Hub using Kafka API and then I want to read that data from the Event Hub and apply some transformations to it. The data is in received in binary format, as described in the Kafka documentation.
Here are a few columns in CSV format:
value
"{""id"":""e52f247c-f46c-4021-bc62-e28e56db1ad8"",""latitude"":""34.5016064725731"",""longitude"":""123.43996453687777""}"
"{""id"":""32782100-9b59-49c7-9d56-bb4dfc368a86"",""latitude"":""49.938541626415144"",""longitude"":""111.88360885971986""}"
"{""id"":""a72a600f-2b99-4c41-a388-9a24c00545c0"",""latitude"":""4.988768300413497"",""longitude"":""-141.92727675177588""}"
"{""id"":""5a5f056a-cdfd-4957-8e84-4d5271253509"",""latitude"":""41.802942545247134"",""longitude"":""90.45164573613573""}"
"{""id"":""d00d0926-46eb-45dd-9e35-ab765804340d"",""latitude"":""70.60161063520081"",""longitude"":""20.566520665122482""}"
"{""id"":""dda14397-6922-4bb6-9be3-a1546f08169d"",""latitude"":""68.400462882435"",""longitude"":""135.7167027587489""}"
"{""id"":""c7f13b8a-3468-4bc6-9db4-e0b1b34bf9ea"",""latitude"":""26.04757722355835"",""longitude"":""175.20227554031783""}"
"{""id"":""97f8f1cf-3aa0-49bb-b3d5-05b736e0c883"",""latitude"":""35.52624182094499"",""longitude"":""-164.18066699972852""}"
"{""id"":""6bed49bc-ee93-4ed9-893f-4f51c7b7f703"",""latitude"":""-24.319581484353847"",""longitude"":""85.27338980948076""}"
What I want to do is to apply a transformation and create a data frame with 3 columns one with id, one with latitude and one with longitude.
This is what I tried but the result is not what I expected:
from pyspark.sql.types import StructType
from pyspark.sql.functions import from_json
from pyspark.sql import functions as F
# df is the data frame received from Kafka
location_schema = StructType().add("id", "string").add("latitude", "float").add("longitude", "float")
string_df = df.selectExpr("CAST(value AS STRING)").withColumn("value", from_json(F.col("value"), location_schema))
string_df.printSchema()
string_df.show()
And this is the result:
It created a "value" column with a structure as a value. Any idea what to do to obtain 3 different columns, as I described?
Your df:
df = spark.createDataFrame(
[
(1, '{"id":"e52f247c-f46c-4021-bc62-e28e56db1ad8","latitude":"34.5016064725731","longitude":"123.43996453687777"}'),
(2, '{"id":"32782100-9b59-49c7-9d56-bb4dfc368a86","latitude":"49.938541626415144","longitude":"111.88360885971986"}'),
(3, '{"id":"a72a600f-2b99-4c41-a388-9a24c00545c0","latitude":"4.988768300413497","longitude":"-141.92727675177588"}'),
(4, '{"id":"5a5f056a-cdfd-4957-8e84-4d5271253509","latitude":"41.802942545247134","longitude":"90.45164573613573"}'),
(5, '{"id":"d00d0926-46eb-45dd-9e35-ab765804340d","latitude":"70.60161063520081","longitude":"20.566520665122482"}'),
(6, '{"id":"dda14397-6922-4bb6-9be3-a1546f08169d","latitude":"68.400462882435","longitude":"135.7167027587489"}'),
(7, '{"id":"c7f13b8a-3468-4bc6-9db4-e0b1b34bf9ea","latitude":"26.04757722355835","longitude":"175.20227554031783"}'),
(8, '{"id":"97f8f1cf-3aa0-49bb-b3d5-05b736e0c883","latitude":"35.52624182094499","longitude":"-164.18066699972852"}'),
(9, '{"id":"6bed49bc-ee93-4ed9-893f-4f51c7b7f703","latitude":"-24.319581484353847","longitude":"85.27338980948076"}')
],
['id', 'value']
).drop('id')
+--------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------+
|{"id":"e52f247c-f46c-4021-bc62-e28e56db1ad8","latitude":"34.5016064725731","longitude":"123.43996453687777"} |
|{"id":"32782100-9b59-49c7-9d56-bb4dfc368a86","latitude":"49.938541626415144","longitude":"111.88360885971986"}|
|{"id":"a72a600f-2b99-4c41-a388-9a24c00545c0","latitude":"4.988768300413497","longitude":"-141.92727675177588"}|
|{"id":"5a5f056a-cdfd-4957-8e84-4d5271253509","latitude":"41.802942545247134","longitude":"90.45164573613573"} |
|{"id":"d00d0926-46eb-45dd-9e35-ab765804340d","latitude":"70.60161063520081","longitude":"20.566520665122482"} |
|{"id":"dda14397-6922-4bb6-9be3-a1546f08169d","latitude":"68.400462882435","longitude":"135.7167027587489"} |
|{"id":"c7f13b8a-3468-4bc6-9db4-e0b1b34bf9ea","latitude":"26.04757722355835","longitude":"175.20227554031783"} |
|{"id":"97f8f1cf-3aa0-49bb-b3d5-05b736e0c883","latitude":"35.52624182094499","longitude":"-164.18066699972852"}|
|{"id":"6bed49bc-ee93-4ed9-893f-4f51c7b7f703","latitude":"-24.319581484353847","longitude":"85.27338980948076"}|
+--------------------------------------------------------------------------------------------------------------+
Then:
from pyspark.sql import functions as F
from pyspark.sql.types import *
json_schema = StructType([
StructField("id", StringType(), True),
StructField("latitude", FloatType(), True),
StructField("longitude", FloatType(), True)
])
df\
.withColumn('json', F.from_json(F.col('value'), json_schema))\
.select(F.col('json').getItem('id').alias('id'),
F.col('json').getItem('latitude').alias('latitude'),
F.col('json').getItem('longitude').alias('longitude')
)\
.show(truncate=False)
+------------------------------------+-------------------+-------------------+
|id |latitude |longitude |
+------------------------------------+-------------------+-------------------+
|e52f247c-f46c-4021-bc62-e28e56db1ad8|34.5016064725731 |123.43996453687777 |
|32782100-9b59-49c7-9d56-bb4dfc368a86|49.938541626415144 |111.88360885971986 |
|a72a600f-2b99-4c41-a388-9a24c00545c0|4.988768300413497 |-141.92727675177588|
|5a5f056a-cdfd-4957-8e84-4d5271253509|41.802942545247134 |90.45164573613573 |
|d00d0926-46eb-45dd-9e35-ab765804340d|70.60161063520081 |20.566520665122482 |
|dda14397-6922-4bb6-9be3-a1546f08169d|68.400462882435 |135.7167027587489 |
|c7f13b8a-3468-4bc6-9db4-e0b1b34bf9ea|26.04757722355835 |175.20227554031783 |
|97f8f1cf-3aa0-49bb-b3d5-05b736e0c883|35.52624182094499 |-164.18066699972852|
|6bed49bc-ee93-4ed9-893f-4f51c7b7f703|-24.319581484353847|85.27338980948076 |
+------------------------------------+-------------------+-------------------+
If pattern remains unchanged then you can use regexp_replace()
>>> df = spark.read.option("header",False).option("inferSchema",True).csv("/dir1/dir2/Sample2.csv")
>>> df.show(truncate=False)
+-------------------------------------------------+------------------------------------+---------------------------------------+
|_c0 |_c1 |_c2 |
+-------------------------------------------------+------------------------------------+---------------------------------------+
|"{""id"":""e52f247c-f46c-4021-bc62-e28e56db1ad8""|""latitude"":""34.5016064725731"" |""longitude"":""123.43996453687777""}" |
|"{""id"":""32782100-9b59-49c7-9d56-bb4dfc368a86""|""latitude"":""49.938541626415144"" |""longitude"":""111.88360885971986""}" |
|"{""id"":""a72a600f-2b99-4c41-a388-9a24c00545c0""|""latitude"":""4.988768300413497"" |""longitude"":""-141.92727675177588""}"|
|"{""id"":""5a5f056a-cdfd-4957-8e84-4d5271253509""|""latitude"":""41.802942545247134"" |""longitude"":""90.45164573613573""}" |
|"{""id"":""d00d0926-46eb-45dd-9e35-ab765804340d""|""latitude"":""70.60161063520081"" |""longitude"":""20.566520665122482""}" |
|"{""id"":""dda14397-6922-4bb6-9be3-a1546f08169d""|""latitude"":""68.400462882435"" |""longitude"":""135.7167027587489""}" |
|"{""id"":""c7f13b8a-3468-4bc6-9db4-e0b1b34bf9ea""|""latitude"":""26.04757722355835"" |""longitude"":""175.20227554031783""}" |
|"{""id"":""97f8f1cf-3aa0-49bb-b3d5-05b736e0c883""|""latitude"":""35.52624182094499"" |""longitude"":""-164.18066699972852""}"|
|"{""id"":""6bed49bc-ee93-4ed9-893f-4f51c7b7f703""|""latitude"":""-24.319581484353847""|""longitude"":""85.27338980948076""}" |
+-------------------------------------------------+------------------------------------+---------------------------------------+
>>> df.withColumn("id",regexp_replace('_c0','\"\{\"\"id\"\":\"\"','')).withColumn("id",regexp_replace('id','\"\"','')).withColumn("latitude",regexp_replace('_c1','\"\"latitude\"\":\"\"','')).withColumn("latitude",regexp_replace('latitude','\"\"','')).withColumn("longitude",regexp_replace('_c2','\"\"longitude\"\":\"\"','')).withColumn("longitude",regexp_replace('longitude','\"\"\}\"','')).drop("_c0").drop("_c1").drop("_c2").show()
+--------------------+-------------------+-------------------+
| id| latitude| longitude|
+--------------------+-------------------+-------------------+
|e52f247c-f46c-402...| 34.5016064725731| 123.43996453687777|
|32782100-9b59-49c...| 49.938541626415144| 111.88360885971986|
|a72a600f-2b99-4c4...| 4.988768300413497|-141.92727675177588|
|5a5f056a-cdfd-495...| 41.802942545247134| 90.45164573613573|
|d00d0926-46eb-45d...| 70.60161063520081| 20.566520665122482|
|dda14397-6922-4bb...| 68.400462882435| 135.7167027587489|
|c7f13b8a-3468-4bc...| 26.04757722355835| 175.20227554031783|
|97f8f1cf-3aa0-49b...| 35.52624182094499|-164.18066699972852|
|6bed49bc-ee93-4ed...|-24.319581484353847| 85.27338980948076|
+--------------------+-------------------+-------------------+
You can use json_tuple to extract values from JSON string.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('{"id":"e52f247c-f46c-4021-bc62-e28e56db1ad8","latitude":"34.5016064725731","longitude":"123.43996453687777"}',)],
['value'])
Script:
cols = ['id', 'latitude', 'longitude']
df = df.select(F.json_tuple('value', *cols)).toDF(*cols)
df.show(truncate=0)
# +------------------------------------+----------------+------------------+
# |id |latitude |longitude |
# +------------------------------------+----------------+------------------+
# |e52f247c-f46c-4021-bc62-e28e56db1ad8|34.5016064725731|123.43996453687777|
# +------------------------------------+----------------+------------------+
If needed, cast to double:
.withColumn('latitude', F.col('latitude').cast('double'))
.withColumn('longitude', F.col('longitude').cast('double'))
It's easy to extract JSON string as columns using inline and from_json
df = spark.createDataFrame(
[('{"id":"e52f247c-f46c-4021-bc62-e28e56db1ad8","latitude":"34.5016064725731","longitude":"123.43996453687777"}',)],
['value'])
df = df.selectExpr(
"inline(array(from_json(value, 'struct<id:string, latitude:string, longitude:string>')))"
)
df.show(truncate=0)
# +------------------------------------+----------------+------------------+
# |id |latitude |longitude |
# +------------------------------------+----------------+------------------+
# |e52f247c-f46c-4021-bc62-e28e56db1ad8|34.5016064725731|123.43996453687777|
# +------------------------------------+----------------+------------------+
I used the sample data provided, created a dataframe called df and proceeded to use the same method as you.
The following is the image of the rows present inside df dataframe.
The fields are not displayed as required because of the their datatype. The values for latitude and longitude are present as string types in the dataframe df. But while creating the schema location_schema you have specified their type as float. Instead, try changing their type to string and later convert them to double type. The code looks as shown below:
location_schema = StructType().add("id", "string").add("latitude", "string").add("longitude", "string")
string_df = df.selectExpr('CAST(value AS STRING)').withColumn("value", from_json(F.col("value"), location_schema))
string_df.printSchema()
string_df.show(truncate=False)
Now using DataFrame.withColumn(), Column.withField() and cast() convert the string type fields latitude and longitude to Double Type.
string_df = string_df.withColumn("value", col("value").withField("latitude", col("value.latitude").cast(DoubleType())))\
.withColumn("value", col("value").withField("longitude", col("value.longitude").cast(DoubleType())))
string_df.printSchema()
string_df.show(truncate=False)
So, you can get the desired output as shown below.
Update:
To get separate columns you can simply use json_tuple() method. Refer to this official spark documentation:
pyspark.sql.functions.json_tuple — PySpark 3.3.0 documentation (apache.org)

How to extract time from timestamp in pyspark?

I have a requirement to extract time from timestamp(this is a column in dataframe) using pyspark.
lets say this is the timestamp 2019-01-03T18:21:39 , I want to extract only time "18:21:39" such that it always appears in this manner "01:01:01"
df = spark.createDataFrame(["2020-06-17T00:44:30","2020-06-17T06:06:56","2020-06-17T15:04:34"],StringType()).toDF('datetime')
df=df.select(df['datetime'].cast(TimestampType()))
I tried like below but did not get the expected result
df1=df.withColumn('time',concat(hour(df['datetime']),lit(":"),minute(df['datetime']),lit(":"),second(df['datetime'])))
display(df1)
+-------------------+-------+
| datetime| time|
+-------------------+-------+
|2020-06-17 00:44:30|0:44:30|
|2020-06-17 06:06:56| 6:6:56|
|2020-06-17 15:04:34|15:4:34|
+-------------------+-------+
my results are like this 6:6:56 but i want them to be 06:06:56
Use the date_format function.
from pyspark.sql.types import StringType
df = spark \
.createDataFrame(["2020-06-17T00:44:30","2020-06-17T06:06:56","2020-06-17T15:04:34"], StringType()) \
.toDF('datetime')
from pyspark.sql.functions import date_format
q = df.withColumn('time', date_format('datetime', 'HH:mm:ss'))
>>> q.show()
+-------------------+--------+
| datetime| time|
+-------------------+--------+
|2020-06-17T00:44:30|00:44:30|
|2020-06-17T06:06:56|06:06:56|
|2020-06-17T15:04:34|15:04:34|
+-------------------+--------+

Pyspark Change String Order

I have a dataframe below;
Leadtime
303400
333430
1234111
2356788
258
I completed all the strigns in the data to 7 digits.
filler = udf(lambda x: str(x).zfill(7))
df =df.withColumn('Leadtime',filler('Leadtime'))
Output is;
Leadtime
0303400
0333430
1234111
2356788
0000258
After that,
I want to write a method that will make the first index of the strings the last index as follows;
Leadtime
3034000
3334300
2341111
3567882
0002580
Could you please help me about this?
You can select a substring with substr and concatenate strings with concat:
#string change string
import pyspark.sql.functions as F
l = [('303400',)
,('333430',)
,('1234111',)
,('2356788',)
,('258',)]
df=spark.createDataFrame(l, ['Leadtime'])
filler = F.udf(lambda x: str(x).zfill(7))
df =df.withColumn('Leadtime',filler('Leadtime'))
df.withColumn('Leadtime', F.concat(df.Leadtime.substr(2, 6), df.Leadtime.substr(1, 1)) ).show()
Output:
+--------+
|Leadtime|
+--------+
| 3034000|
| 3334300|
| 2341111|
| 3567882|
| 0002580|
+--------+

Spark order by second field to perform timeseries function

I have a csv with a timeseries:
timestamp, measure-name, value, type, quality
1503377580,x.x-2.A,0.5281250,Float,GOOD
1503377340,x.x-1.B,0.0000000,Float,GOOD
1503377400,x.x-1.B,0.0000000,Float,GOOD
The measure-name should be my partition key and I would like to calculate a moving average with pyspark, here my code (for instance) to calculate the max
def mysplit(line):
ll = line.split(",")
return (ll[1],float(ll[2]))
text_file.map(lambda line: mysplit(line)).reduceByKey(lambda a, b: max(a , b)).foreach(print)
However, for the average I would like to respect the timestamp ordering.
How to order by a second column?
You need to use a window function on pyspark dataframes:
First you should transform your rdd to a dataframe:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
df = hc.createDataFrame(text_file.map(lambda l: l.split(','), ['timestamp', 'measure-name', 'value', 'type', 'quality'])
Or load it directly as a dataframe:
local:
import pandas as pd
df = hc.createDataFrame(pd.read_csv(path_to_csv, sep=",", header=0))
from hdfs:
df = hc.read.format("com.databricks.spark.csv").option("delimiter", ",").load(path_to_csv)
Then use a window function:
from pyspark.sql import Window
import pyspark.sql.functions as psf
w = Window.orderBy('timestamp')
df.withColumn('value_rol_mean', psf.mean('value').over(w))
+----------+------------+--------+-----+-------+-------------------+
| timestamp|measure_name| value| type|quality| value_rol_mean|
+----------+------------+--------+-----+-------+-------------------+
|1503377340| x.x-1.B| 0.0|Float| GOOD| 0.0|
|1503377400| x.x-1.B| 0.0|Float| GOOD| 0.0|
|1503377580| x.x-2.A|0.528125|Float| GOOD|0.17604166666666665|
+----------+------------+--------+-----+-------+-------------------+
in .orderByyou can order by as many columns as you want

Spark: Merge 2 dataframes by adding row index/number on both dataframes

Q: Is there is any way to merge two dataframes or copy a column of a dataframe to another in PySpark?
For example, I have two Dataframes:
DF1
C1 C2
23397414 20875.7353
5213970 20497.5582
41323308 20935.7956
123276113 18884.0477
76456078 18389.9269
the seconde dataframe
DF2
C3 C4
2008-02-04 262.00
2008-02-05 257.25
2008-02-06 262.75
2008-02-07 237.00
2008-02-08 231.00
Then i want to add C3 of DF2 to DF1 like this:
New DF
C1 C2 C3
23397414 20875.7353 2008-02-04
5213970 20497.5582 2008-02-05
41323308 20935.7956 2008-02-06
123276113 18884.0477 2008-02-07
76456078 18389.9269 2008-02-08
I hope this example was clear.
rownum + window function i.e solution 1 or zipWithIndex.map i.e solution 2 should help in this case.
Solution 1 : You can use window functions to get this kind of
Then I would suggest you to add rownumber as additional column name to Dataframe say df1.
DF1
C1 C2 columnindex
23397414 20875.7353 1
5213970 20497.5582 2
41323308 20935.7956 3
123276113 18884.0477 4
76456078 18389.9269 5
the second dataframe
DF2
C3 C4 columnindex
2008-02-04 262.00 1
2008-02-05 257.25 2
2008-02-06 262.75 3
2008-02-07 237.00 4
2008-02-08 231.00 5
Now .. do inner join of df1 and df2 that's all...
you will get below ouput
something like this
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
df1 = .... // as showed above df1
df2 = .... // as shown above df2
df11 = df1.withColumn("columnindex", rowNumber().over(w))
df22 = df2.withColumn("columnindex", rowNumber().over(w))
newDF = df11.join(df22, df11.columnindex == df22.columnindex, 'inner').drop(df22.columnindex)
newDF.show()
New DF
C1 C2 C3
23397414 20875.7353 2008-02-04
5213970 20497.5582 2008-02-05
41323308 20935.7956 2008-02-06
123276113 18884.0477 2008-02-07
76456078 18389.9269 2008-02-08
Solution 2 : Another good way(probably this is best :)) in scala, which you can translate to pyspark :
/**
* Add Column Index to dataframe
*/
def addColumnIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add Column index
df.rdd.zipWithIndex.map{case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex)},
// Create schema
StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
)
// Add index now...
val df1WithIndex = addColumnIndex(df1)
val df2WithIndex = addColumnIndex(df2)
// Now time to join ...
val newone = df1WithIndex
.join(df2WithIndex , Seq("columnindex"))
.drop("columnindex")
I thought I would share the python (pyspark) translation for answer #2 above from #Ram Ghadiyaram:
from pyspark.sql.functions import col
def addColumnIndex(df):
# Create new column names
oldColumns = df.schema.names
newColumns = oldColumns + ["columnindex"]
# Add Column index
df_indexed = df.rdd.zipWithIndex().map(lambda (row, columnindex): \
row + (columnindex,)).toDF()
#Rename all the columns
new_df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx],
newColumns[idx]), xrange(len(oldColumns)), df_indexed)
return new_df
# Add index now...
df1WithIndex = addColumnIndex(df1)
df2WithIndex = addColumnIndex(df2)
#Now time to join ...
newone = df1WithIndex.join(df2WithIndex, col("columnindex"),
'inner').drop("columnindex")
for python3 version,
from pyspark.sql.types import StructType, StructField, LongType
def with_column_index(sdf):
new_schema = StructType(sdf.schema.fields + [StructField("ColumnIndex", LongType(), False),])
return sdf.rdd.zipWithIndex().map(lambda row: row[0] + (row[1],)).toDF(schema=new_schema)
df1_ci = with_column_index(df1)
df2_ci = with_column_index(df2)
join_on_index = df1_ci.join(df2_ci, df1_ci.ColumnIndex == df2_ci.ColumnIndex, 'inner').drop("ColumnIndex")
I referred to his(#Jed) answer
from pyspark.sql.functions import col
def addColumnIndex(df):
# Get old columns names and add a column "columnindex"
oldColumns = df.columns
newColumns = oldColumns + ["columnindex"]
# Add Column index
df_indexed = df.rdd.zipWithIndex().map(lambda (row, columnindex): \
row + (columnindex,)).toDF()
#Rename all the columns
oldColumns = df_indexed.columns
new_df = reduce(lambda data, idx:data.withColumnRenamed(oldColumns[idx],
newColumns[idx]), xrange(len(oldColumns)), df_indexed)
return new_df
# Add index now...
df1WithIndex = addColumnIndex(df1)
df2WithIndex = addColumnIndex(df2)
#Now time to join ...
newone = df1WithIndex.join(df2WithIndex, col("columnindex"),
'inner').drop("columnindex")
This answer solved it for me:
import pyspark.sql.functions as sparkf
# This will return a new DF with all the columns + id
res = df.withColumn('id', sparkf.monotonically_increasing_id())
Credit to Arkadi T
Here is an simple example that can help you even if you have already solve the issue.
//create First Dataframe
val df1 = spark.sparkContext.parallelize(Seq(1,2,1)).toDF("lavel1")
//create second Dataframe
val df2 = spark.sparkContext.parallelize(Seq((1.0, 12.1), (12.1, 1.3), (1.1, 0.3))). toDF("f1", "f2")
//Combine both dataframe
val combinedRow = df1.rdd.zip(df2.rdd). map({
//convert both dataframe to Seq and join them and return as a row
case (df1Data, df2Data) => Row.fromSeq(df1Data.toSeq ++ df2Data.toSeq)
})
// create new Schema from both the dataframe's schema
val combinedschema = StructType(df1.schema.fields ++ df2.schema.fields)
// Create a new dataframe from new row and new schema
val finalDF = spark.sqlContext.createDataFrame(combinedRow, combinedschema)
finalDF.show
Expanding on Jed's answer, in response to Ajinkya's comment:
To get the same old column names, you need to replace "old_cols" with a column list of the newly named indexed columns. See my modified version of the function below
def add_column_index(df):
new_cols = df.schema.names + ['ix']
ix_df = df.rdd.zipWithIndex().map(lambda (row, ix): row + (ix,)).toDF()
tmp_cols = ix_df.schema.names
return reduce(lambda data, idx: data.withColumnRenamed(tmp_cols[idx], new_cols[idx]), xrange(len(tmp_cols)), ix_df)
Not the better way performance wise.
df3=df1.crossJoin(df2).show(3)
To merge columns from two different dataframe you have first to create a column index and then join the two dataframes. Indeed, two dataframes are similar to two SQL tables. To make a connection you have to join them.
If you don't care about the final order of the rows you can generate the index column with monotonically_increasing_id().
Using the following code you can check that monotonically_increasing_id generates the same index column in both dataframes (at least up to a billion of rows), so you won't have any error in the merged dataframe.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
sample_size = 1E9
sdf1 = spark.range(1, sample_size).select(F.col("id").alias("id1"))
sdf2 = spark.range(1, sample_size).select(F.col("id").alias("id2"))
sdf1 = sdf1.withColumn("idx", sf.monotonically_increasing_id())
sdf2 = sdf2.withColumn("idx", sf.monotonically_increasing_id())
sdf3 = sdf1.join(sdf2, 'idx', 'inner')
sdf3 = sdf3.withColumn("diff", F.col("id1")-F.col("id2")).select("diff")
sdf3.filter(F.col("diff") != 0 ).show()
You can use a combination of monotonically_increasing_id (guaranteed to always be increasing) and row_number (guaranteed to always give the same sequence). You cannot use row_number alone because it needs to be ordered by something. So here we order by monotonically_increasing_id. I am using Spark 2.3.1 and Python 2.7.13.
from pandas import DataFrame
from pyspark.sql.functions import (
monotonically_increasing_id,
row_number)
from pyspark.sql import Window
DF1 = spark.createDataFrame(DataFrame({
'C1': [23397414, 5213970, 41323308, 123276113, 76456078],
'C2': [20875.7353, 20497.5582, 20935.7956, 18884.0477, 18389.9269]}))
DF2 = spark.createDataFrame(DataFrame({
'C3':['2008-02-04', '2008-02-05', '2008-02-06', '2008-02-07', '2008-02-08']}))
DF1_idx = (
DF1
.withColumn('id', monotonically_increasing_id())
.withColumn('columnindex', row_number().over(Window.orderBy('id')))
.select('columnindex', 'C1', 'C2'))
DF2_idx = (
DF2
.withColumn('id', monotonically_increasing_id())
.withColumn('columnindex', row_number().over(Window.orderBy('id')))
.select('columnindex', 'C3'))
DF_complete = (
DF1_idx
.join(
other=DF2_idx,
on=['columnindex'],
how='inner')
.select('C1', 'C2', 'C3'))
DF_complete.show()
+---------+----------+----------+
| C1| C2| C3|
+---------+----------+----------+
| 23397414|20875.7353|2008-02-04|
| 5213970|20497.5582|2008-02-05|
| 41323308|20935.7956|2008-02-06|
|123276113|18884.0477|2008-02-07|
| 76456078|18389.9269|2008-02-08|
+---------+----------+----------+

Resources