apply a function to columns in Pandas raises AttributeError

apply a function to columns in Pandas raises AttributeError - python-3.x

I have the following dict and pandas DataFrame.
sample_dict = {'isDuplicate': {'1051681551': False, '1037545402': True, '1035390559': False},
'dateTime': {'1051681551': Timestamp('2019-01-29 09:09:00+0000', tz='UTC'),
'1037545402': Timestamp('2019-01-11 02:06:00+0000', tz='UTC'),
'1035390559': Timestamp('2019-01-08 14:35:00+0000', tz='UTC')},
'dateTimePub': {'1051681551': None, '1037545402': None, '1035390559': None}}
df = pd.DataFrame.from_dict(sample_dict)
I want to apply a np.where() function to dateTimeand dateTimePub columns like:
def _replace_datetime_with_datetime_pub(news_dataframe):
news_dataframe.dateTime = np.where(news_dataframe.dateTimePub, news_dataframe.dateTimePub, news_dataframe.dateTime)
return pd.to_datetime(news_dataframe.dateTime)
df.apply(_replace_datetime_with_datetime_pub)
But I get the following error,
AttributeError: 'Series' object has no attribute 'dateTimePub'
It's possible to do df = _replace_datetime_with_datetime_pub(df). But my question is,
how to apply this function via either pd.DataFrame.apply or pd.DataFrame.transform method, and
why do I get this error?
I have already checked many other similar questions, but none of them had AttributeError.

With apply, you're breaking down your DataFrame into series to pass to your function. Since you don't specify the axis keyword argument, pandas assumes you want to pass each column as a series. This is the source of the AttributeError you're getting. For pandas to pass each row as a series, you want to specify axis=1 in your apply call.
Even then, you'll need to adapt the function some to fit it into the apply paradigm. In particular, you want to think of how the function should process each row it encounters. The function you pass to apply (if you specify axis=1) will work on each row in isolation from every other row. The return value from each row will then be stitched together to return a series.

Related

Nested loops altering rows in pandas - Avoiding "A value is trying to be set on a copy of a slice from a DataFrame"

Summary
I am trying to loop through a pandas dataframe, and to run a secondary loop at each iteration. The secondary loop calculates something that I want to append into the original dataframe, so that when the primary loop advances, some of the rows are recalculated based on the changed values. (For those interested, this is a simple advective model of carbon accumulation in soils. When a new layer of soil is deposited, mixing processes penetrate into older layers and transform their properties to a set depth. Thus, each layer deposited changes those below it incrementally, until a former layer lies below the mixing depth.)
I have produced an example of how I want this to work, however it is generating the common error message:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
I have looked into the linked information in the error message as well as myriad posts on this forum, but none get into the continual looping through a changed dataframe.
What I've tried, and some possible solutions
Below is some example code. This code works more or less as well as I want it to. But it produces the warning. Should I:
Suppress the warning and continue working with this architecture? In this case, am I asking for trouble with un-reproducible results?
Try a different architecture altogether, like a numpy array from the original dataframe?
Try df.append() or df.copy() to avoid the warning?
I have tried `df.copy()' to no avail - the warning was still thrown.
Example code:
import pandas as pd
a = pd.DataFrame(
{
'a':[x/2 for x in range(1,11)],
'b':['hot dog', 'slider', 'watermelon', 'funnel cake', 'cotton candy', 'lemonade', 'fried oreo', 'ice cream', 'corn', 'sausage'],
'c':['meat', 'meat', 'vegan', 'vegan', 'vegan', 'vegan', 'dairy','dairy', 'vegan', 'meat']
}
)
print(a)
z = [x/(x+2) for x in range(1,5)]
print(z)
#Primary loop through rows of the main dataframe
for ind, row in a.iterrows():
#Pull out a chunk of the dataframe. This is the portion of the dataframe that will be modified. What is below
#this is already modified and locked into the geological record. What is above has not yet been deposited.
b = a.iloc[ind:(ind+len(z)), :]
#Define the size of the secondary loop. Taking the minimum avoids the model mixing below the boundary layer (key error)
loop = min([len(z), len(b)])
#Now loop through the sub-dataframe and change accordingly.
for fraction in range(loop):
b['a'].iloc[fraction] = b['a'].iloc[fraction]*z[fraction]
#Append the original dataframe with new data:
a.iloc[ind:(ind+loop), :] = b
#Try df.copy(), but still throws warning!
#a.iloc[ind:(ind+loop), :] = b.copy()
print(a)

Python best way to return multiple values or error from method

I want to create a method/function that queries a database and return 3 values if the searched value exists in the database table.
If there are no results or an error ocurrs, I need to identify that so I can use it in a conditional statement.
So far I have this:
def read_log(self,ftpsource):
try:
conn = pyodbc.connect(...)
sql_query=f"""Select ID,file_modif_date,size FROM [dbo].[Table] WHERE Source='{ftpsource}'"""
cursor = conn.cursor()
cursor.execute(sql_query)
for row in cursor:
id=row[0]
file_mtime=row[1]
file_size=row[2]
return('1',id,file_mtime,file_size)
except Exception as e:
return('0','0','0','0')
#Call the function
sql_status,id,time,size = log.read_log('hol1a')
if sql_status!=0:
print(id,time,size)
elif sql_status==0:
print("Error")
But doesn't looks good to me.. So I wonder, what is the best practice to return multiple values and identify if there are no values or an error ocurrs, with a function?
Thanks!

There is no one best way to return values.
Do you return coordinates, vectors? It is obvious that it will return more than one value and what values to expect?
def get_position2d():
return x, y
If the returned values are not obvious/ordered you may need names, using a dictionary!
def get_ingredients():
return {'sugar': 10, 'salt': 30, 'flour': 150, 'turobo_reactor': 1}
Do you return big datasets for machine learning? Use Pandas or numpy arrays!
import pandas as pd
def get_input_data():
# Imagine this is a 5k+ entries dataset...
return pd.DataFrame(dict(
speed=[1,2,3,4,5], heat=[50,50,35,24,10], tire_type=[1,1,1,1,2]))
The one rule is that it should make sense and should be considered as "one element". For a position, x and y are one element in the sense that they are "the position". Same for ingredients, they are the "ingredients for one specific recipe".
So avoid mixing nonsensical return values. One function, one job, one output (within the current context).
Finally, you may want to consider if the data you return should be mutable or immutable. If you are going to modify the data afterwards you may need to use lists instead of sets.
Now, about errors!
If you are talking about Python errors, you can only have one error at a time. So the following code should allow you to catch errors and play with them.
def i_will_throw_an_error():
try:
do_something_bad()
except:
do_something_if_a_bad_thing_happens()
Now if you talk about error messages and values, you may (or may not) want to trigger a Python error if the data given to you was wrong!
def i_get_bad_data_and_i_raise_it():
bad_data = get_bad_data()
if(bad_data['is_really_bad'] is True):
raise ValueError('Oh no, the data we got is really bad!')

How to find dataframe after it is generated in a for-loop using dictionary values?

I want to create a function that creates 3 data frames then takes the element wise-average of the three. The data frames are generated from a loop using a dictionary that was defined in an earlier step, like this:
# extracting and organizing data
def density_dataP(filenames):
datasets = ["df_1", "df_2", "df_3"]
for num in filenames:
for index in range(len(datasets)):
datasets[index] = pd.DataFrame({
#excluding the edges b/c nothing interesting happens there
"z-coordinate (nm)": mda.auxiliary.XVG.XVGReader(filenames[num]["water"])._auxdata_values[7:43:1,0],
"water": mda.auxiliary.XVG.XVGReader(filenames[num]["water"])._auxdata_values[7:43:1,1],
"acyl": mda.auxiliary.XVG.XVGReader(filenames[num]["acyl"])._auxdata_values[7:43:1,1],
"headgroups": mda.auxiliary.XVG.XVGReader(filenames[num]["head"])._auxdata_values[7:43:1,1],
"ester": mda.auxiliary.XVG.XVGReader(filenames[num]["ester"])._auxdata_values[7:43:1,1],
"protein": mda.auxiliary.XVG.XVGReader(filenames[num]["proa"])._auxdata_values[7:43:1,1]
})
master_data = (df_1 + df_2 + df_3)/3
return master_data
However, when I try to run the function with a valid input I get the following error:
---> 16 master_data = (df_1 + df_2 + df_3)/3
17 return master_data
NameError: name 'df_1' is not defined
The inputs into the XVGReader method needs the path to an XVG file for the input and I have those contained in the dictionary. The first layer of the dictionary has a number, the second layer has the path to the file. Each number is associated with all the paths in one of the three dataframes.(i.e. all paths in key 1 are for df_1, etc.) The dictionary I am using looks roughly like this:
{1: {'water': $PATH_TO_water1.xvg', 'acyl': $PATH_TO_acyl1.xvg', 'head': $PATH_TO_head1.xvg', 'ester': $PATH_TO_ester1.xvg', 'proa': $PATH_TO_proa1.xvg'},
2: {'water': $PATH_TO_water2.xvg', 'acyl': $PATH_TO_acyl2.xvg', 'head': $PATH_TO_head2.xvg', 'ester': $PATH_TO_ester2.xvg', 'proa': $PATH_TO_proa2.xvg'},
3: {'water': $PATH_TO_water3.xvg', 'acyl': $PATH_TO_acyl3.xvg', 'head': $PATH_TO_head3.xvg', 'ester': $PATH_TO_ester3.xvg', 'proa': $PATH_TO_proa3.xvg'}}
How do I get python to recognize the DataFrames created in this loop or at least get the final result of master_data?

In your example, "df_1" is a string in the list datasets, not a variable. If you want to access by name, then you would want datasets to be a dict with a key of df_1, etc. and the value a dataframe.
But you don't need to name items here because all you want is an average.
So I think you should simplify the function. For example, the inner loop over datasets seems to create three copies of the same value; that seems like it can be omitted. Also, if filenames is a dict where each value is a dataframe, so you can iterate over the values directly.
def density_dataP(filenames):
datasets = []
for df in filenames.values():
datasets.append(pd.DataFrame({
"z-coordinate (nm)": mda.auxiliary.XVG.XVGReader(df["water"])._auxdata_values[7:43:1,0],
"water": mda.auxiliary.XVG.XVGReader(df["water"])._auxdata_values[7:43:1,1],
"acyl": mda.auxiliary.XVG.XVGReader(df["acyl"])._auxdata_values[7:43:1,1],
"headgroups": mda.auxiliary.XVG.XVGReader(df["head"])._auxdata_values[7:43:1,1],
"ester": mda.auxiliary.XVG.XVGReader(df["ester"])._auxdata_values[7:43:1,1],
"protein": mda.auxiliary.XVG.XVGReader(df["proa"])._auxdata_values[7:43:1,1]
})
return pd.concat([datasets]).mean()

Return all columns + a few more in a UDF used by the map function

I'm using a map function to generate a new column where its value depends on the result of a column that already exists in the dataframe.
def computeTechFields(row):
if row.col1!=VALUE_TO_COMPARE:
tech1=0
else:
tech1=1
return (row.col1, row.col2, row.col3, tech1)
delta2rdd = delta.map(computeTechFields)
The problem is that my main dataframe has more than 150 columns that I have to return with the map function so in the end I have something like this :
return (row.col1, row.col2, row.col3, row.col4, row.col5, row.col6, row.col7, row.col8, row.col9, row.col10, row.col11, row.col12, row.col13, row.col14, row.col15, row.col16, row.col17, row.col18 ..... row.col149, row.col150, row.col151, tech1)
As you can see, it is really long to write and difficult to read. So I tried to do something like this :
return (row.*, tech1)
But of course it did not work.
I know that the "withColumn" function exists but I don't know much about its performance and could not make it work anyway.
Edit (What happened with the withColumn function) :
def computeTech1(row):
if row.col1!=VALUE_TO_COMPARE:
tech1=0
else:
tech1=1
return tech1
delta2 = delta.withColumn("tech1", computeTech1)
And it gave me this error :
AssertionError: col should be Column
I tried to do something like this :
return col(tech1)
The error was the same
I also tried :
delta2 = delta.withColumn("tech1", col(computeTech1))
This time, the error was :
AttributeError: 'function' object has no attribute '_get_object_id'
End of the edit
So my question is, how can I return all the columns + a few more within my UDF used by the map function ?
Thanks !

Not super firm with Python, so people might correct me on the syntax here, but the general idea is to make your function a UDF with a column as input, then call that inside withColumn. I used a lambda here, but with some fiddeling it should also work with a function.
from pyspark.sql.functions import udf
computeTech1UDF = udf(
lambda col: 0 if col != VALUE_TO_COMPARE else 1, IntegerType())
delta2 = delta.withColumn("tech1", computeTech1UDF(col1))
What you tried did not work since you did not provide withColumn with a column expression (see http://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn). Using the UDF wrapper achieves exactly that.

How to use select() transformation in Apache Spark?

I am following the Intro to Spark course on edX. However, I cant understand few things, following is an lab assignment. FYI, I am not looking for solution.
I am not able to understand as why I am receiving the error
TypeError: 'Column' object is not callable
Following is the code
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""
Args:
column (Column): A Column containing a sentence.
"""
# This following is giving error. I believe I am calling all the rows from the dataframe 'column' where the attribute is named as 'sentence'
result = column.select('sentence')
return result
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
Can you be little elaborate? TIA.

The column parameter is not a DataFrame object and, therefore, does not have access to the select method. You'll need to use other functions to solve this problem.
Hint: Look at the import statement.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

apply a function to columns in Pandas raises AttributeError - python-3.x

Related

Nested loops altering rows in pandas - Avoiding "A value is trying to be set on a copy of a slice from a DataFrame"

Python best way to return multiple values or error from method

How to find dataframe after it is generated in a for-loop using dictionary values?

Return all columns + a few more in a UDF used by the map function

How to use select() transformation in Apache Spark?

Categories

Resources