Remove duplicated fields in RDD from input data

Remove duplicated fields in RDD from input data - apache-spark

my input csv data, some rows contains repeated fields or some missing fields, from this data i want to remove the duplicate fields from each row and then all rows should contain all the fields, with value as NULL is wherever it does not contain fields.

Try this:
def transform(line):
"""
>>> s = 'id:111|name:dave|age:33|city:london'
>>> transform(s)
('id:111', {'age': '33', 'name': 'dave', 'city': 'london'})
"""
bits = line.split("|")
key = bits[0]
pairs = [v.split(":") for v in bits[1:]]
return key, {kv[0].strip(): kv[1].strip() for kv in pairs if len(kv) == 2}
rdd = (sc
.textFile("/tmp/sample")
.map(transform))
Find keys:
from operator import attrgetter
keys = rdd.values().flatMap(lambda d: d.keys()).distinct().collect()
Create data frame:
df = rdd.toDF(["id", "map"])
And expand:
df.select(["id"] + [df["map"][k] for k in keys]).show()

So I assume that you have rdd already from the text file. I create one here:
rdd = spark.sparkContext.parallelize([(u'id:111', u'name:dave', u'dept:marketing', u'age:33', u'city:london'),
(u'id:123', u'name:jhon', u'dept:hr', u'city:newyork'),
(u'id:100', u'name:peter', u'dept:marketing', u'name:peter', u'age:30', u'city:london'),
(u'id:222', u'name:smith', u'dept:finance', u'city:boston'),
(u'id:234', u'name:peter', u'dept:service', u'name:peter', u'dept:service', u'age:32', u'city:richmond')])
I just make the function to map the rdd into key and value pair and also remove the duplicated one
from pyspark.sql import Row
from pyspark.sql.types import *
def split_to_dict(l):
l = list(set(l)) # drop duplicate here
kv_list = []
for e in l:
k, v = e.split(':')
kv_list.append({'key': k, 'value': v})
return kv_list
rdd_map = rdd.flatMap(lambda l: split_to_dict(l)).map(lambda x: Row(**x))
df = rdd_map.toDF()
Output example of first 5 rows
+----+---------+
| key| value|
+----+---------+
|city| london|
|dept|marketing|
|name| dave|
| age| 33|
| id| 111|
+----+---------+

Related

Pyspark - Split a column and take n elements

I want to take a column and split a string using a character. As per usual, I understood that the method split would return a list, but when coding I found that the returning object had only the methods getItem or getField with the following descriptions from the API:
#since(1.3)
def getItem(self, key):
"""
An expression that gets an item at position ``ordinal`` out of a list,
or gets an item by key out of a dict.
#since(1.3)
def getField(self, name):
"""
An expression that gets a field by name in a StructField.
Obviously this doesnt meet my requirements, for example for the text within the column "A_B_C_D" I would like to split between "A_B_C_" and "D" in two different columns.
This is the code I'm using
from pyspark.sql.functions import regexp_extract, col, split
df_test=spark.sql("SELECT * FROM db_test.table_test")
#Applying the transformations to the data
split_col=split(df_test['Full_text'],'_')
df_split=df_test.withColumn('Last_Item',split_col.getItem(3))
Find an example:
from pyspark.sql import Row
from pyspark.sql.functions import regexp_extract, col, split
l = [("Item1_Item2_ItemN"),("FirstItem_SecondItem_LastItem"),("ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn")]
rdd = sc.parallelize(l)
datax = rdd.map(lambda x: Row(fullString=x))
df = sqlContext.createDataFrame(datax)
split_col=split(df['fullString'],'_')
df=df.withColumn('LastItemOfSplit',split_col.getItem(2))
Result:
fullString LastItemOfSplit
Item1_Item2_ItemN ItemN
FirstItem_SecondItem_LastItem LastItem
ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn null
My expected result would be having always the last item
fullString LastItemOfSplit
Item1_Item2_ItemN ItemN
FirstItem_SecondItem_LastItem LastItem
ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn ThisShouldBeInTheLastColumn

You can use getItem(size - 1) to get the last item from the arrays:
Example:
df = spark.createDataFrame([[['A', 'B', 'C', 'D']], [['E', 'F']]], ['split'])
df.show()
+------------+
| split|
+------------+
|[A, B, C, D]|
| [E, F]|
+------------+
import pyspark.sql.functions as F
df.withColumn('lastItem', df.split.getItem(F.size(df.split) - 1)).show()
+------------+--------+
| split|lastItem|
+------------+--------+
|[A, B, C, D]| D|
| [E, F]| F|
+------------+--------+
For your case:
from pyspark.sql.functions import regexp_extract, col, split, size
df_test=spark.sql("SELECT * FROM db_test.table_test")
#Applying the transformations to the data
split_col=split(df_test['Full_text'],'_')
df_split=df_test.withColumn('Last_Item',split_col.getItem(size(split_col) - 1))

You can pass in a regular expression pattern to split.
The following would work for your example:
from pyspark.sql.functions split
split_col=split(df['fullString'], r"_(?=.+$)")
df = df.withColumn('LastItemOfSplit', split_col.getItem(1))
df.show(truncate=False)
#+--------------------------------------------------------+---------------------------+
#|fullString |LastItemOfSplit |
#+--------------------------------------------------------+---------------------------+
#|Item1_Item2_ItemN |Item2 |
#|FirstItem_SecondItem_LastItem |SecondItem |
#|ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn|ThisShouldBeInTheLastColumn|
#+--------------------------------------------------------+---------------------------+
The pattern means the following:
_ the literal underscore
(?=.+$) positive look-ahead for anything (.) until the end of the string $
This will split the string on the last underscore. Then call .getItem(1) to get the item at index 1 in the resultant list.

Getting the table name from a Spark Dataframe

If I have a dataframe created as follows:
df = spark.table("tblName")
Is there anyway that I can get back tblName from df?

You can extract it from the plan:
df.logicalPlan().argString().replace("`","")

We can extract tablename from a dataframe by parsing unresolved logical plan.
Please follow the method below:
def getTableName(df: DataFrame): String = {
Seq(df.queryExecution.logical, df.queryExecution.optimizedPlan).flatMap{_.collect{
case LogicalRelation(_, _, catalogTable: Option[CatalogTable], _) =>
if (catalogTable.isDefined) {
Some(catalogTable.get.identifier.toString())
} else None
case hive: HiveTableRelation => Some(hive.tableMeta.identifier.toString())
}
}.flatten.head
}
scala> val df = spark.table("db.table")
scala> getTableName(df)
res: String = `db`.`table`

Following utility function may be helpful to determine the table name from given DataFrame.
def get_dataframe_tablename(df: pyspark.sql.DataFrame) -> typing.Optional[str]:
"""
If the dataframe was created from an underlying table (e.g. spark.table('dual') or
spark.sql("select * from dual"), this function will return the
fully qualified table name (e.g. `default`.`dual`) as output otherwise it will return None.
Test on: python 3.7, spark 3.0.1, but it should work with Spark >=2.x and python >=3.4 too
Examples:
>>> get_dataframe_tablename(spark.table('dual'))
`default`.`dual`
>>> get_dataframe_tablename(spark.sql("select * from dual"))
`default`.`dual`
It inspects the output of `df.explain()` to determine that the df was created from a table or not
:param df: input dataframe whose underlying table name will be return
:return: table name or None
"""
def _explain(_df: pyspark.sql.DataFrame) -> str:
# df.explain() does not take parameter to accept the out and dump the output on stdout
# by default
import contextlib
import io
with contextlib.redirect_stdout(io.StringIO()) as f:
_df.explain()
f.seek(0) # Rewind stream position
explanation = f.readlines()[1] # Ignore first output line(#Physical Plan...)
return explanation
pattern = re.compile("Scan hive (.+), HiveTableRelation (.+?), (.+)")
output = _explain(df)
match = pattern.search(output)
return match.group(2) if match else None

Below three line of code will give table and database name
import org.apache.spark.sql.execution.FileSourceScanExec
df=session.table("dealer")
df.queryExecution.sparkPlan.asInstanceOf[FileSourceScanExec].tableIdentifier

Any answer on this one yet? I found a way but it's probably not the prettiest. You can access the tablename by retrieving the physical execution plan and then doing some string splitting magic on it.
Let's say you have a table from database_name.tblName. The following should work:
execution_plan = df.__jdf.queryExecution().simpleString()
table_name = string.split('FileScan')[1].split('[')[0].split('.')[1]
The first line will return your execution plan in a string format. That will look similar to this:
== Physical Plan ==\n*(1) ColumnarToRow\n+- FileScan parquet database_name.tblName[column1#2880,column2ban#2881] Batched: true, DataFilters: [], Format: Parquet, Location: PreparedDeltaFileIndex[dbfs:/mnt/lake/database_name/table_name], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<column1:string,column2:string...\n\n'
After that you can run some string splitting to access the relevant information. The first string split gets you all the elements of FileScan- you are interested in the second element, then before and after the [- here the first element is of interest. The second string split after . will return tblName

You can create table from df. But if table is a local temporary view or a global temporary view you should drop it (sqlContext.dropTempTable) before create a table with same name or use create or replace function (spark.createOrReplaceGlobalTempView or spark.createOrReplaceTempView). If table is temp table you can create table with same name without error
#Create data frame
>>> d = [('Alice', 1)]
>>> test_df = spark.createDataFrame(sc.parallelize(d), ['name','age'])
>>> test_df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
#create tables
>>> test_df.createTempView("tbl1")
>>> test_df.registerTempTable("tbl2")
>>> sqlContext.tables().show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | tbl1| true|
| | tbl2| true|
+--------+---------+-----------+
#create data frame from tbl1
>>> df = spark.table("tbl1")
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
#create tbl1 again with using df data frame. It will get error
>>> df.createTempView("tbl1")
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Temporary view 'tbl1' already exists;"
#drop and create again
>>> sqlContext.dropTempTable('tbl1')
>>> df.createTempView("tbl1")
>>> spark.sql('select * from tbl1').show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
#create data frame from tbl2 and replace name value
>>> df = spark.table("tbl2")
>>> df = df.replace('Alice', 'Bob')
>>> df.show()
+----+---+
|name|age|
+----+---+
| Bob| 1|
+----+---+
#create tbl2 again with using df data frame
>>> df.registerTempTable("tbl2")
>>> spark.sql('select * from tbl2').show()
+----+---+
|name|age|
+----+---+
| Bob| 1|
+----+---+

efficiently expand array of Row to separate columns

I have a spark dataframe and one of its fields is an array of Row structures. I need to expand it into their own columns. One of the problems is in the array, sometimes a field is missing.
The following is an example:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as udf
spark = SparkSession.builder.getOrCreate()
# data
rows = [{'status':'active','member_since':1990,'info':[Row(tag='name',value='John'),Row(tag='age',value='50'),Row(tag='phone',value='1234567')]},
{'status':'inactive','member_since':2000,'info':[Row(tag='name',value='Tom'),Row(tag='phone',value='1234567')]},
{'status':'active','member_since':2015,'info':[Row(tag='name',value='Steve'),Row(tag='age',value='28')]}]
# create dataframe
df = spark.createDataFrame(rows)
# transform info to dict
to_dict = udf.UserDefinedFunction(lambda s:dict(s),MapType(StringType(),StringType()))
df = df.withColumn("info_dict",to_dict("info"))
# extract name, NA if not exists
extract_name = udf.UserDefinedFunction(lambda s:s.get("name","NA"))
df = df.withColumn("name",extract_name("info_dict"))
# extract age, NA if not exists
extract_age = udf.UserDefinedFunction(lambda s:s.get("age","NA"))
df = df.withColumn("age",extract_age("info_dict"))
# extract phone, NA if not exists
extract_phone = udf.UserDefinedFunction(lambda s:s.get("phone","NA"))
df = df.withColumn("phone",extract_phone("info_dict"))
df.show()
You can see for 'Tom', 'age' is missing; for 'Steve', 'phone' is missing. Like the above code snippet, my current solution is to first transform the array into dict and then parse each individual field into their column. The result is like this:
+--------------------+------------+--------+--------------------+-----+---+-------+
| info|member_since| status| info_dict| name|age| phone|
+--------------------+------------+--------+--------------------+-----+---+-------+
|[[name, John], [a...| 1990| active|[name -> John, ph...| John| 50|1234567|
|[[name, Tom], [ph...| 2000|inactive|[name -> Tom, pho...| Tom| NA|1234567|
|[[name, Steve], [...| 2015| active|[name -> Steve, a...|Steve| 28| NA|
+--------------------+------------+--------+--------------------+-----+---+-------+
I really just want the columns 'status','member_since','name', 'age' and 'phone'. This solution works but rather slow because of the UDF. Is there any faster alternatives? Thanks

I can think of 2 ways to do this using DataFrame functions. I believe the first one should be faster, but the code is much less elegant. The second is more compact, but probably slower.
Method 1: Create Map Dynamically
The heart of this method is to turn your Row into a MapType(). This can be achieved using pyspark.sql.functions.create_map() and some magic using functools.reduce() and operator.add().
from operator import add
import pyspark.sql.functions as f
f.create_map(
*reduce(
add,
[[f.col('info')['tag'].getItem(k), f.col('info')['value'].getItem(k)]
for k in range(3)]
)
)
The problem is that there isn't a way (AFAIK) to dynamically determine the length of the WrappedArray or iterate through it in an easy way. If a value is missing, this will cause an error because map keys can not be null. However since we know that the list can either contain 1, 2, 3 elements, we can just test for each of these cases.
df.withColumn(
'map',
f.when(f.size(f.col('info')) == 1,
f.create_map(
*reduce(
add,
[[f.col('info')['tag'].getItem(k), f.col('info')['value'].getItem(k)]
for k in range(1)]
)
)
).otherwise(
f.when(f.size(f.col('info')) == 2,
f.create_map(
*reduce(
add,
[[f.col('info')['tag'].getItem(k), f.col('info')['value'].getItem(k)]
for k in range(2)]
)
)
).otherwise(
f.when(f.size(f.col('info')) == 3,
f.create_map(
*reduce(
add,
[[f.col('info')['tag'].getItem(k), f.col('info')['value'].getItem(k)]
for k in range(3)]
)
)
)))
).select(
['member_since', 'status'] + [f.col("map").getItem(k).alias(k) for k in keys]
).show(truncate=False)
The last step turns the 'map' keys into columns using the method described in this answer.
This produces the following output:
+------------+--------+-----+----+-------+
|member_since|status |name |age |phone |
+------------+--------+-----+----+-------+
|1990 |active |John |50 |1234567|
|2000 |inactive|Tom |null|1234567|
|2015 |active |Steve|28 |null |
+------------+--------+-----+----+-------+
Method 2: Use explode, groupBy and pivot
First use pyspark.sql.functions.explode() on the column 'info', and then use the 'tag' and 'value' columns as arguments to create_map():
df.withColumn('id', f.monotonically_increasing_id())\
.withColumn('exploded', f.explode(f.col('info')))\
.withColumn(
'map',
f.create_map(*[f.col('exploded')['tag'], f.col('exploded')['value']]).alias('map')
)\
.select('id', 'member_since', 'status', 'map')\
.show(truncate=False)
#+------------+------------+--------+---------------------+
#|id |member_since|status |map |
#+------------+------------+--------+---------------------+
#|85899345920 |1990 |active |Map(name -> John) |
#|85899345920 |1990 |active |Map(age -> 50) |
#|85899345920 |1990 |active |Map(phone -> 1234567)|
#|180388626432|2000 |inactive|Map(name -> Tom) |
#|180388626432|2000 |inactive|Map(phone -> 1234567)|
#|266287972352|2015 |active |Map(name -> Steve) |
#|266287972352|2015 |active |Map(age -> 28) |
#+------------+------------+--------+---------------------+
I also added a column 'id' using pyspark.sql.functions.monotonically_increasing_id() to make sure we can keep track of which rows belong to the same record.
Now we can explode the map column, groupBy(), and pivot(). We can use pyspark.sql.functions.first() as the aggregate function for the groupBy() because we know there will only be one 'value' in each group.
df.withColumn('id', f.monotonically_increasing_id())\
.withColumn('exploded', f.explode(f.col('info')))\
.withColumn(
'map',
f.create_map(*[f.col('exploded')['tag'], f.col('exploded')['value']]).alias('map')
)\
.select('id', 'member_since', 'status', f.explode('map'))\
.groupBy('id', 'member_since', 'status').pivot('key').agg(f.first('value'))\
.select('member_since', 'status', 'age', 'name', 'phone')\
.show()
#+------------+--------+----+-----+-------+
#|member_since| status| age| name| phone|
#+------------+--------+----+-----+-------+
#| 1990| active| 50| John|1234567|
#| 2000|inactive|null| Tom|1234567|
#| 2015| active| 28|Steve| null|
#+------------+--------+----+-----+-------+

How to change case of whole pyspark dataframe to lower or upper

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help
#Code for Dataframe column headers
self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])

Assuming df is your dataframe, this should do the work:
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(col, F.lower(F.col(col)))

Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))
val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)
Now types are correct also when you have non-string fields, i.e. numeric fields).
If you know that each column is of String type, use one of the other answers - they are correct in that cases :)
Python code in PySpark:
from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields)
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)

You can generate an expression using list comprehension:
from pyspark.sql import functions as psf
expression = [ psf.lower(psf.col(x)).alias(x) for x in df.columns ]
And then just call it over your existing dataframe
>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
>>> df.select(*select_expression).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+

How to detect null column in pyspark

I have a dataframe defined with some null values. Some Columns are fully null values.
>> df.show()
+---+---+---+----+
| A| B| C| D|
+---+---+---+----+
|1.0|4.0|7.0|null|
|2.0|5.0|7.0|null|
|3.0|6.0|5.0|null|
+---+---+---+----+
In my case, I want to return a list of columns name that are filled with null values. My idea was to detect the constant columns (as the whole column contains the same null value).
this is how I did it:
nullCoulumns = [c for c, const in df.select([(min(c) == max(c)).alias(c) for c in df.columns]).first().asDict().items() if const]
but this does no consider null columns as constant, it works only with values.
How should I then do it ?

Extend the condition to
from pyspark.sql.functions import min, max
((min(c).isNull() & max(c).isNull()) | (min(c) == max(c))).alias(c)
or use eqNullSafe (PySpark 2.3):
(min(c).eqNullSafe(max(c))).alias(c)

One way would be to do it implicitly: select each column, count its NULL values, and then compare this with the total number or rows. With your data, this would be:
spark.version
# u'2.2.0'
from pyspark.sql.functions import col
nullColumns = []
numRows = df.count()
for k in df.columns:
nullRows = df.where(col(k).isNull()).count()
if nullRows == numRows: # i.e. if ALL values are NULL
nullColumns.append(k)
nullColumns
# ['D']
But there is a simpler way: it turns out that the function countDistinct, when applied to a column with all NULL values, returns zero (0):
from pyspark.sql.functions import countDistinct
df.agg(countDistinct(df.D).alias('distinct')).collect()
# [Row(distinct=0)]
So the for loop now can be:
nullColumns = []
for k in df.columns:
if df.agg(countDistinct(df[k])).collect()[0][0] == 0:
nullColumns.append(k)
nullColumns
# ['D']
UPDATE (after comments): It seems possible to avoid collect in the second solution; since df.agg returns a dataframe with only one row, replacing collect with take(1) will safely do the job:
nullColumns = []
for k in df.columns:
if df.agg(countDistinct(df[k])).take(1)[0][0] == 0:
nullColumns.append(k)
nullColumns
# ['D']

How about this? In order to guarantee the column are all nulls, two properties must be satisfied:
(1) The min value is equal to the max value
(2) The min or max is null
Or, equivalently
(1) The min AND max are both equal to None
Note that if property (2) is not satisfied, the case where column values are [null, 1, null, 1] would be incorrectly reported since the min and max will be 1.
import pyspark.sql.functions as F
def get_null_column_names(df):
column_names = []
for col_name in df.columns:
min_ = df.select(F.min(col_name)).first()[0]
max_ = df.select(F.max(col_name)).first()[0]
if min_ is None and max_ is None:
column_names.append(col_name)
return column_names
Here's an example in practice:
>>> rows = [(None, 18, None, None),
(1, None, None, None),
(1, 9, 4.0, None),
(None, 0, 0., None)]
>>> schema = "a: int, b: int, c: float, d:int"
>>> df = spark.createDataFrame(data=rows, schema=schema)
>>> df.show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
|null| 18|null|null|
| 1|null|null|null|
| 1| 9| 4.0|null|
|null| 0| 0.0|null|
+----+----+----+----+
>>> get_null_column_names(df)
['d']

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Remove duplicated fields in RDD from input data - apache-spark

my input csv data, some rows contains repeated fields or some missing fields, from this data i want to remove the duplicate fields from each row and then all rows should contain all the fields, with value as NULL is wherever it does not contain fields.

Related

Pyspark - Split a column and take n elements

Getting the table name from a Spark Dataframe

efficiently expand array of Row to separate columns

How to change case of whole pyspark dataframe to lower or upper

How to detect null column in pyspark

Categories

Resources