Replace with withColumn in pyspark

Replace with withColumn in pyspark - apache-spark

Can you help me understand the following error message and the reason behind it:
Create a dummy dataset:
df_=spark.createDataFrame([(1, np.nan,'x'), (None, 2.0,'y'),(3,4.0,None)], ("a", "b","c"))
df_.show()
+----+---+----+
| a| b| c|
+----+---+----+
| 1|NaN| x|
|null|2.0| y|
| 3|4.0|null|
+----+---+----+
Now, I attempt to replace the NaN in the column 'b' the following way:
df_.withColumn("b", df_.select("b").replace({float("nan"):5}).b)
The df_.select("b").replace({float("nan"):5}).b runs just fine and gives a proper column with the expected value. Yet the code above is not working and I am not able to understand the error
The error that I am getting is:
AnalysisException Traceback (most recent call last)
Cell In[170], line 1
----> 1 df_.withColumn("b", df_.select("b").replace({float("nan"):5}).b)
File /usr/lib/spark/python/pyspark/sql/dataframe.py:2455, in DataFrame.withColumn(self, colName, col)
2425 """
2426 Returns a new :class:`DataFrame` by adding a column or replacing the
2427 existing column that has the same name.
(...)
2452
2453 """
2454 assert isinstance(col, Column), "col should be Column"
-> 2455 return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
File /opt/conda/miniconda3/lib/python3.8/site-packages/py4j/java_gateway.py:1304, in JavaMember.__call__(self, *args)
1298 command = proto.CALL_COMMAND_NAME +\
1299 self.command_header +\
1300 args_command +\
1301 proto.END_COMMAND_PART
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1307 for temp_arg in temp_args:
1308 temp_arg._detach()
File /usr/lib/spark/python/pyspark/sql/utils.py:117, in capture_sql_exception.<locals>.deco(*a, **kw)
113 converted = convert_exception(e.java_exception)
114 if not isinstance(converted, UnknownException):
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
AnalysisException: Resolved attribute(s) b#1083 missing from a#930L,b#931,c#932 in operator !Project [a#930L, b#1083 AS b#1085, c#932]. Attribute(s) with the same name appear in the operation: b. Please check if the right attribute(s) are used.;
!Project [a#930L, b#1083 AS b#1085, c#932]
+- LogicalRDD [a#930L, b#931, c#932], false
I can achieve the required objective by using the subset argument in the replace API. i.e. df_.replace({float("nan"):5},subset = ['b']) However, I am trying to understand better the error that I am seeing and the cause behind it.

Based on the documentation of df.withColumn:
Returns a new DataFrame by adding a column or replacing the existing
column that has the same name.
The column expression must be an expression over this DataFrame;
attempting to add a column from some other DataFrame will raise an
error.
So when you do
df_.select("b").replace({float("nan"):5}).b this creates a different dataframe with a different attribute id of column b (since df_.select returns a new dataframe). This attribute id doesnot exist in the original dataframe.
You should instead use replace with subset which refers to the same pointer from the same dataframe
new_df = df_.replace({float("nan"):5},subset='b')
new_df.explain()
== Physical Plan ==
*(1) Project [a#2131L, CASE WHEN (b#2132 = NaN) THEN 5.0 ELSE b#2132 END AS b#2351, c#2133]
+- *(1) Scan ExistingRDD[a#2131L,b#2132,c#2133]
Note how the attribute pointer changes below:
df1 = df_
df1.replace({float("nan"):5},subset='b').explain()
== Physical Plan ==
*(1) Project [a#2131L, CASE WHEN (b#2132 = NaN) THEN 5.0 ELSE b#2132 END AS b#2378, c#2133]
+- *(1) Scan ExistingRDD[a#2131L,b#2132,c#2133]

Related

How to use WHEN clause to check Null condition on a String Column of a Pyspark dataframe?

I am trying to check NULL or empty string on a string column of a data frame and 0 for an integer column as given below.
emp_ext = emp_ext.withColumn('emp_header', when((F.col('emp_header').isNull()) | (F.col('emp_header') == '0'), 'UNKNOWN')) \
.withColumn('emp_item', when((F.col('emp_item').isNull()) | (F.col('emp_item') == 0), -1)) \
.withColumn('emp_lease', when((F.col('emp_header').isNull() | F.col('emp_header') == '0') & (F.col('emp_item').isNull() | F.col('emp_item') == 0), -1)))
The column emp_header is a String column, emp_item is an Integer column and emp_lease is an Integer column.
When I run the above piece of code, I get an error saying there is a data type mismatch in the column emp_header between NULL & STRING as given below.
2022-03-03 07:17:41,931 - src.emp_load - 76 - ERROR - Failed to load history data into emp_data table with the exception: cannot resolve '((`emp_header` IS NULL) OR `emp_header`)' due to data type mismatch: differing types in '((`emp_header` IS NULL) OR `emp_header`)' (boolean and string).;;
So I tried to make NULL to a different syntax as given below.
emp_ext = emp_ext.withColumn('emp_header', when((F.col('emp_header') == '') | (F.col('emp_header') == '0'), 'UNKOWN')) \
.withColumn('emp_item', when((F.col('emp_item') == '') | (F.col('emp_item') == 0), -1)) \
.withColumn('emp_lease', when((F.col('emp_header') == '' | F.col('emp_header') == '0') & (F.col('emp_item') == '' | F.col('emp_item') == 0), -1))
This time the compiler on my Pycharm says: Expected type 'Column', got 'str' instead at the line marked in the screenshot below.
If I go ahead and run the code with the above changes, the code fails with a different exception
2022-03-03 07:46:08,227 - src.emp_load - 76 - ERROR - Failed to load history data into emp_table with the exception: An error occurred while calling o336.or. Trace:
py4j.Py4JException: Method or([class java.lang.String]) does not exist
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/local_disk0/tmp/spark-a2890f41-1167-481a-85a9-6984c04d05c2/template_python-1.0.0-py3-none-any.whl/src/emp_load.py", line 71, in main
so_lookup_insert(spark=spark, df=df, years=years, columns=columns)
File "/local_disk0/tmp/spark-a2890f41-1167-481a-85a9-6984c04d05c2/template_python-1.0.0-py3-none-any.whl/src/emp_load.py", line 100, in so_lookup_insert
.withColumn('emp_lease', when((F.col('emp_header') == '' | F.col('emp_header') == '0') & (F.col('emp_item') == '' | F.col('emp_item') == 0), -1))
File "/databricks/spark/python/pyspark/sql/column.py", line 118, in _
njc = getattr(self._jc, name)(jc)
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/databricks/spark/python/pyspark/sql/utils.py", line 127, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value
format(target_id, ".", name, value))
Update 1:
I added parenthesis to the when condition on the third line as suggested in the comment and I am not facing the second exception anymore.
And I made isNull() to '' as below.
emp_ext = emp_ext.withColumn('emp_header', when((F.col('emp_header') == '') | (F.col('emp_header') == '0'), 'UNKNOWN')) \
.withColumn('emp_item', when((F.col('emp_item') == '') | (F.col('emp_item') == 0), -1)) \
.withColumn('emp_lease', when(((F.col('emp_header') == '') | F.col('emp_header') == '0') & ((F.col('emp_item') == '') | F.col('emp_item') == 0), -1)))
But I still see the exception:
2022-03-03 08:41:31,295 - src.emp_load - 76 - ERROR - Failed to load history data into emp_table with the exception: cannot resolve '((`emp_header` = '') OR `emp_header`)' due to data type mismatch: differing types in '((`emp_header` = '') OR `emp_header`)' (boolean and string).;;
Can anyone let me know what is the mistake I am doing here and how can I fix it?

How to create a PySpark pandas-on-Spark DataFrame from Snowflake SQL query?

NOTE: Need to use distributed processing, which is why I am utilizing Pandas API on Spark.
To create the pandas-on-Spark DataFrame, I attempted 2 different methods (outlined below: "OPTION 1", "OPTION 2").
Are either of these options feasible? If so, how do I proceed given errors (outlined below in "ISSUE(S)" and error log for "OPTION 2")?
Alternatively, should I start with PySpark SQL Pandas UDFs for the query, and then convert to pandas-on-Spark DataFrame?
# (Spark 3.2.0, Scala 2.12, DBR 10.0)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## I. Import libraries & dependencies
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
import numpy as np
import pandas as pd
import pyspark.pandas as ps
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql import Column
from pyspark.sql.functions import *
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## II. Load data + create Spark DataFrames
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
df_1 = spark.read.format("snowflake").options(**options).option("query","SELECT PROPERTY_ID,AVGRENT_MARKET FROM schema_1").load()
df_2 = spark.read.format("snowflake").options(**options).option("query","SELECT PROPERTY_ID,PROPERTY_ZIPCODE FROM schema_2").load()
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## III. OPTION 1: Union Spark DataFrames
## ISSUE(S): Results in 'None' values in PROPERTY_ZIPCODE column
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## Create merged dataframe from two Spark Dataframes
# df_3 = df_1.unionByName(df_2, allowMissingColumns=True)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## III. OPTION 2: Create Spark SQL DataFrame from SQL tables
## ISSUE(S): "AnalysisException: Reference 'PROPERTY_ID' is ambiguous, could be: table_1.PROPERTY_ID, table_2.PROPERTY_ID."
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## Create tables from two Spark DataFrames
df_1.createOrReplaceTempView("Table_1")
df_2.createOrReplaceTempView("Table_2")
## Specify SQL Snowflake query to merge tables
merge_tables = '''
SELECT Table_1.PROPERTY_ID,
Table_1.AVGRENT_MARKET,
Table_2.PROPERTY_ID,
Table_2.PROPERTY_ZIPCODE
FROM Table_2 INNER JOIN Table_1
ON Table_2.PROPERTY_ID=Table_1.PROPERTY_ID
LIMIT 25
'''
## Create merged Spark SQL dataframe based on query
df_3 = spark.sql(merge_tables)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## Create a pandas-on-Spark DataFrame
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
df_3 = ps.DataFrame(df_3)
# df_3 = df_3.to_pandas_on_spark() # Alternative conversion option
Error log for "OPTION 2":
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<command-2142959205032388> in <module>
52 ##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
53 # df_3 = ps.DataFrame(df_3)
---> 54 df_3 = df_3.to_pandas_on_spark() # Alternative conversion option
/databricks/spark/python/pyspark/sql/dataframe.py in to_pandas_on_spark(self, index_col)
2777
2778 index_spark_columns, index_names = _get_index_map(self, index_col)
-> 2779 internal = InternalFrame(
2780 spark_frame=self, index_spark_columns=index_spark_columns, index_names=index_names
2781 )
/databricks/spark/python/pyspark/pandas/internal.py in __init__(self, spark_frame, index_spark_columns, index_names, index_fields, column_labels, data_spark_columns, data_fields, column_label_names)
633
634 # Create default index.
--> 635 spark_frame = InternalFrame.attach_default_index(spark_frame)
636 index_spark_columns = [scol_for(spark_frame, SPARK_DEFAULT_INDEX_NAME)]
637
/databricks/spark/python/pyspark/pandas/internal.py in attach_default_index(sdf, default_index_type)
865
866 if default_index_type == "sequence":
--> 867 return InternalFrame.attach_sequence_column(sdf, column_name=index_column)
868 elif default_index_type == "distributed-sequence":
869 return InternalFrame.attach_distributed_sequence_column(sdf, column_name=index_column)
/databricks/spark/python/pyspark/pandas/internal.py in attach_sequence_column(sdf, column_name)
878 #staticmethod
879 def attach_sequence_column(sdf: SparkDataFrame, column_name: str) -> SparkDataFrame:
--> 880 scols = [scol_for(sdf, column) for column in sdf.columns]
881 sequential_index = (
882 F.row_number().over(Window.orderBy(F.monotonically_increasing_id())).cast("long") - 1
/databricks/spark/python/pyspark/pandas/internal.py in <listcomp>(.0)
878 #staticmethod
879 def attach_sequence_column(sdf: SparkDataFrame, column_name: str) -> SparkDataFrame:
--> 880 scols = [scol_for(sdf, column) for column in sdf.columns]
881 sequential_index = (
882 F.row_number().over(Window.orderBy(F.monotonically_increasing_id())).cast("long") - 1
/databricks/spark/python/pyspark/pandas/utils.py in scol_for(sdf, column_name)
590 def scol_for(sdf: SparkDataFrame, column_name: str) -> Column:
591 """Return Spark Column for the given column name."""
--> 592 return sdf["`{}`".format(column_name)]
593
594
/databricks/spark/python/pyspark/sql/dataframe.py in __getitem__(self, item)
1657 """
1658 if isinstance(item, str):
-> 1659 jc = self._jdf.apply(item)
1660 return Column(jc)
1661 elif isinstance(item, Column):
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
121 # Hide where the exception came from that shows a non-Pythonic
122 # JVM exception message.
--> 123 raise converted from None
124 else:
125 raise
AnalysisException: Reference 'PROPERTY_ID' is ambiguous, could be: table_1.PROPERTY_ID, table_2.PROPERTY_ID.

If all you want is just a join, then use Spark join function instead. It's much cleaner and maintainable.
df_1 = spark.read...load()
df_2 = spark.read...load()
df_3 = df_1.join(df_2, on=['PROPERTY_ID'], how='inner')

Running subqueries in pyspark using where or filter statement

I am trying to run a subquery in pyspark. I see that it is possible using SQL statements. But is there any inherent support using "where" or "filter" operations?
Consider the test data frame :
from pyspark.sql import SparkSession
sqlContext = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate()
tst = sqlContext.createDataFrame([(1,2),(4,3),(1,4),(1,5),(1,6)],schema=['sample','time'])
tst_sub = sqlContext.createDataFrame([(1,2),(4,3),(1,4)],schema=['sample','time'])
#%% using where to query the df
tst.where(F.col('time')>4).show()
+------+----+
|sample|time|
+------+----+
| 1| 5|
| 1| 6|
+------+----+
Here you can see that the where function is working fine.
When I try to do the same using a subquery , like this:
#%% using where with subquery
tst.where(F.col('time')>F.max(tst_sub.select('time'))).show()
I get this error:
AttributeError Traceback (most recent call
last) in
----> 1 tst.where(F.col('time')>F.max(tst_sub.select('time'))).show()
/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4744.12781922/lib/spark/python/pyspark/sql/functions.py
in _(col)
42 def _(col):
43 sc = SparkContext._active_spark_context
---> 44 jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
45 return Column(jc)
46 _.name = name
/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4744.12781922/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py
in call(self, *args) 1246 1247 def call(self,
*args):
-> 1248 args_command, temp_args = self._build_args(*args) 1249 1250 command = proto.CALL_COMMAND_NAME +\
/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4744.12781922/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py
in _build_args(self, *args) 1216 1217 args_command =
"".join(
-> 1218 [get_command_part(arg, self.pool) for arg in new_args]) 1219 1220 return args_command, temp_args
/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4744.12781922/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py
in (.0) 1216 1217 args_command = "".join(
-> 1218 [get_command_part(arg, self.pool) for arg in new_args]) 1219 1220 return args_command, temp_args
/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4744.12781922/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_command_part(parameter, python_proxy_pool)
296 command_part += ";" + interface
297 else:
--> 298 command_part = REFERENCE_TYPE + parameter._get_object_id()
299
300 command_part += "\n"
/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4744.12781922/lib/spark/python/pyspark/sql/dataframe.py
in getattr(self, name) 1298 if name not in
self.columns: 1299 raise AttributeError(
-> 1300 "'%s' object has no attribute '%s'" % (self.class.name, name)) 1301 jc =
self._jdf.apply(name) 1302 return Column(jc)
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
When I register the dataframes as table and perform a sql query, it works fine:
tst.createOrReplaceTempView("tst")
tst_sub.createOrReplaceTempView("tst_sub")
sqlContext.sql("SELECT * FROM tst WHERE time>(SELECT(max(time)) FROM tst_sub)").show()
Is there any method to perform a subquery in pyspark on the dataframes directly using filter, where or any other methods?

You need to collect the max time into a numerical variable in Python before putting it in the filter:
tst.where(F.col('time') > tst_sub.select(F.max('time')).head()[0]).show()
+------+----+
|sample|time|
+------+----+
| 1| 5|
| 1| 6|
+------+----+

How to impute values in a column and overwrite existing values

Im trying to learn machine learning and i need to fill in the missing values for the cleaning stage of the workflow. i have 13 columns and need to impute the values for 8 of them. One column is called Dependents and i want to fill in the blanks with the word missing and change the cells that do contain data as follows: 1 to one, two to 2, 3 to three and 3+ to threePlus.
Im running the program in Anaconda and the name of the dataframe is train
train.columns
this gives me
Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
dtype='object')
next
print("Dependents")
print(train['Dependents'].unique())
this gives me
Dependents
['0' '1' '2' '3+' nan]
now i try imputing values as stated
def impute_dependent():
my_dict={'1':'one','2':'two','3':'three','3+':'threePlus'};
return train.Dependents.map(my_dict).fillna('missing')
def convert_data(dataset):
temp_data = dataset.copy()
temp_data['Dependents'] = temp_data[['Dependents']].apply(impute_dependent,axis=1)
return temp_data
this gives the error
TypeError Traceback (most recent call last)
<ipython-input-46-ccb1a5ea7edd> in <module>()
4 return temp_data
5
----> 6 train_dataset = convert_data(train)
7 #test_dataset = convert_data(test)
<ipython-input-46-ccb1a5ea7edd> in convert_data(dataset)
1 def convert_data(dataset):
2 temp_data = dataset.copy()
----> 3 temp_data['Dependents'] =
temp_data[['Dependents']].apply(impute_dependent,axis=1)
4 return temp_data
5
D:\Anaconda2\lib\site-packages\pandas\core\frame.py in apply(self, func,
axis, broadcast, raw, reduce, result_type, args, **kwds)
6002 args=args,
6003 kwds=kwds)
-> 6004 return op.get_result()
6005
6006 def applymap(self, func):
D:\Anaconda2\lib\site-packages\pandas\core\apply.py in get_result(self)
140 return self.apply_raw()
141
--> 142 return self.apply_standard()
143
144 def apply_empty_result(self):
D:\Anaconda2\lib\site-packages\pandas\core\apply.py in apply_standard(self)
246
247 # compute the result using the series generator
--> 248 self.apply_series_generator()
249
250 # wrap results
D:\Anaconda2\lib\site-packages\pandas\core\apply.py in
apply_series_generator(self)
275 try:
276 for i, v in enumerate(series_gen):
--> 277 results[i] = self.f(v)
278 keys.append(v.name)
279 except Exception as e:
TypeError: ('impute_dependent() takes 0 positional arguments but 1 was
given', 'occurred at index 0')
i expected one, two , three and threePlus to replace the existing values and missing to fill in the blanks

Would this do?
my_dict = {'1':'one','2':'two','3':'three','3+':'threePlus', np.nan: 'missing'}
def convert_data(dataset):
temp_data = dataset.copy()
temp_data.Dependents = temp_data.Dependents.map(my_dict)
return temp_data
As a side note, part of your problem might be the use of apply: essentially apply passes data through a function and puts in what comes out. I might be wrong but I think your function needs to take the input given by apply, eg:
def impute_dependent(dep):
my_dict = {'1':'one','2':'two','3':'three','3+':'threePlus', np.nan: 'missing'}
return my_dict[dep]
df.dependents = df.dependents.apply(impute_dependents)
This way, for every value in df.dependents, apply will take that value and give it to impute_dependents as an argument, then take the returned value as output. As is, when I trial your code I get an error because impute_dependent takes no arguments.

Getting Type Error Expected Strings or Bytes Like Object

I am working on a dataset with tweets and I am trying to find the mentions to other users in a tweet, these tweets can have none, single or multiple users mentioned.
Here is the head of the DataFrame:
The following is the function that I created to extract the list of mentions in a tweet:
def getMention(text):
mention = re.findall('(^|[^#\w])#(\w{1,15})', text)
if len(mention) > 0:
return [x[1] for x in mention]
else:
return None
I'm trying to create a new column in the DataFrame and apply the function with the following code:
df['mention'] = df['text'].apply(getMention)
On running this code I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-43-426da09a8770> in <module>
----> 1 df['mention'] = df['text'].apply(getMention)
~/anaconda3_501/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
3192 else:
3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
3195
3196 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-42-d27373022afd> in getMention(text)
1 def getMention(text):
2
----> 3 mention = re.findall('(^|[^#\w])#(\w{1,15})', text)
4 if len(mention) > 0:
5 return [x[1] for x in mention]
~/anaconda3_501/lib/python3.6/re.py in findall(pattern, string, flags)
220
221 Empty matches are included in the result."""
--> 222 return _compile(pattern, flags).findall(string)
223
224 def finditer(pattern, string, flags=0):
TypeError: expected string or bytes-like object

I can't comment (not enough rep) so here's what I suggest to troubleshoot the error.
It seems findall raises an exception because text is not a string so you might want to check which type text actually is, using this:
def getMention(text):
print(type(text))
mention = re.findall(r'(^|[^#\w])#(\w{1,15})', text)
if len(mention) > 0:
return [x[1] for x in mention]
else:
return None
(or the debugger if you know how to)
And if text can be converted to a string maybe try this ?
def getMention(text):
mention = re.findall(r'(^|[^#\w])#(\w{1,15})', str(text))
if len(mention) > 0:
return [x[1] for x in mention]
else:
return None
P.S.: don't forget the r'...' in front of your regexp, to avoid special chars to be interpreted

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Replace with withColumn in pyspark - apache-spark

Related

How to use WHEN clause to check Null condition on a String Column of a Pyspark dataframe?

How to create a PySpark pandas-on-Spark DataFrame from Snowflake SQL query?

Running subqueries in pyspark using where or filter statement

How to impute values in a column and overwrite existing values

Getting Type Error Expected Strings or Bytes Like Object

Categories

Resources