Pandas Drop and Replace functions won't work within a UDF - python-3.x

I looked around at other questions but couldn't find out that addresses the issue I'm having. I am cleaning a data set in an ipython notebook. When I run the cleaning tasks individually they work as expected, but I am having trouble with the replace() and drop() functions when they are included in a UDF. Specifically, these lines aren't doing anything within the UDF, however, a dataframe is returned that completes the other tasks as expected (i.e. reads in the file, sets the index, and filters select dates out).
Any help is much appreciated!
Note that in this problem the df.drop() and df.replace() commands both work as expected when executed outside of the UDF. The function is below for your reference. The issue is with the last two lines "station.replace()" and "station.drop()".
def read_file(file_path):
'''Function to read in daily x data'''
if os.path.exists(os.getcwd()+'/'+file_path) == True:
station = pd.read_csv(file_path)
else:
!unzip alldata.zip
station = pd.read_csv(file_path)
station.set_index('date',inplace=True) #put date in the index
station = station_data[station_data.index > '1984-09-29'] #removes days where there is no y-data
station.replace('---','0',inplace=True)
station.drop(columns=['Unnamed: 0'],axis=1,inplace=True) #drop non-station columns

There was a mistake here:
station = station_data[station_data.index > '1984-09-29']
I was using an old table index. I corrected it to:
station = station[station.index > '1984-09-29']
Note, I had to restart the notebook and re-run it from the top for it to work. I believe it was an issue with conflicting table names in the UDF vs. what was already stored in memory.

Related

Parametrize and loop KQL queries in JupyterLab

My question is how to assign variables within a loop in KQL magic command in Jupyter lab. I refer to Microsoft's document on this subject and will base my question on the code given here:
https://learn.microsoft.com/en-us/azure/data-explorer/kqlmagic
1. First query below
%%kql
StormEvents
| summarize max(DamageProperty) by State
| order by max_DamageProperty desc
| limit 10
2. Second: Convert the resultant query to a dataframe and assign a variable to 'statefilter'
df = _kql_raw_result_.to_dataframe()
statefilter =df.loc[0].State
statefilter
3. This is where I would like to modify the above query and let statefilter have multiple variables (i.e. consist of different states):
df = _kql_raw_result_.to_dataframe()
statefilter =df.loc[0:3].State
statefilter
4. And finally I would like to run my kql query within a for loop for each of the variables within statefilter. This below syntax may not be correct but it can give an example for what I am looking for:
dfs = [] # an empty list to store dataframes
for state in statefilters:
%%kql
let _state = state;
StormEvents
| where State in (_state)
| do some operations here for that specific state
df = _kql_raw_result_.to_dataframe()
dfs.append(df) # store the df specific to state in the list
The reason why I am not querying all the desired states within the KQL query is to prevent resulting in really large query outcomes being assigned to dataframes. This is not for this sample StormEvents table which has a reasonable size but for my research data which consists of many sites and is really big. Therefore I would like to be able to run a KQL query/analysis for each site within a for loop and assign each site's query results to a dataframe. Please let me know if this is possible or perhaps there may other logical ways to do this within KQL...
There are few ways to do it.
The simplest is to refractor your %%kql cell magic to a %kql line magic.
Line magic can be embedded in python cell.
Other option is to: from Kqlmagic import kql
The Kqlmagic kql method, accept as a string a kql cell or line.
You can call kql from python.
Third way is to call the kql magic via the ipython method:
ip.run_cell_magic('kql', {your kql magic cell text})
You can call it from python.
Example of using the single line magic mentioned by Michael and a return statement that converted the result to JSON. Without the conversion to JSON I wasn't getting anything back.
def testKQL():
%kql DatabaseName | take 10000
return _kql_raw_result_.to_dataframe().to_json(orient='records')

If timestamp includes certain time, then trigger an action

I am trying to trigger an action, based on part of the timestamp.
Code (error is in line 3):
for ticker in tickers:
for i in range(1,len(ohlc_dict[ticker])):
if ohlc_dict[ticker]['Date'][i] == "21:50:00":
ohlc_dict:
Now there are 2 issues with this:
1: It doesn't recognize 'Date' (KeyError: 'Date')
2: it should get triggered at this time on every single day (if timestamp includes "21:50:00" THAN ...)
Just would like to answer an answer to my own question. I was over-complicating thing by wanting to do it all in that line. In stead, I went back to creating the dataframe and added this line:
ohlc_dict[ticker]['time'] = ohlc_dict[ticker].index.str[10:]
Causing 1 extra column in my DF with just the time, hence solving both issues.

Is this example data cleaning code updating the pandas dataframe? [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 3 years ago.
In this article on predicting values with linear regression there's a cleaning step
# For beginning, transform train['FullDescription'] to lowercase using text.lower()
train['FullDescription'].str.lower()
# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)
This isn't actually assigning the changes to the dataframe, is it? But if I try something like this...
train['FullDescription'] = train['FullDescription'].str.lower()
train['FullDescription'] = train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)
Then I get a warning...
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
What's the right way to apply these transformations? Are they in fact already being applied? A print(train['FullDescription']) seems to show me they're not.
EDIT: #EdChum and #jezrael are very much onto something about missing code. When I'm actually trying to run this, my data needs to be split into test and train sets.
from sklearn.model_selection import train_test_split
all_data = pandas.read_csv('salary.csv')
train, test = train_test_split(all_data, test_size=0.1)
That's what seems to be causing this error. If I make the next line
train = train.copy()
test = test.copy()
then everything is happy.
You may be wondering if I shouldn't then just apply this step to all_data, which works, but then lower down in the code train['Body'].fillna('nan', inplace=True) still causes an error. So it seems indeed the problem is with train_test_split not creating copies.
The right way to apply these transformations would be...
df.loc[:, 'FullDescription'] = ...
More informations about this would be here. This is a page from the pandas documentation, all the way to the bottom. Quoting...
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
# We don't know whether this will modify df or not!
foo['quux'] = value
return foo
You can also find extra-reasons of Why to use .loc here. Long story short : Explicit is better than implicit. And while df['some_column'] is not immediatly clear about the intent, usingdf.loc['some_column'] is.
I don't really know how to explain it in a simple way, but if you have further questions or if you think I could make my answer more explicit/eloquent, tell me. :)

How to solve this data set

I am importing a Data set from quandl using API. Everything is perfect, however the time series I am importing is reversed. By this I mean if I used the .head method to print the first elements in the data set, I will get the latest Data set figures and printing the tail will get oldest figures
df = pd.read_csv("https://www.quandl.com/api/v3/datasets/CHRIS/CME_CD4.csv?api_key=H32H8imfVNVm9fcEX6kB",parse_dates=['Date'],index_col='Date')
df.head()
This should be a pretty easy fix if I understand. Credit to behzad.nouri on this answer Right way to reverse pandas.DataFrame?.
You just need to reverse the order of your dataframe using the line below.
df = df.reindex(index=df.index[::-1])

How to use a conditional statement while scraping?

I'm trying to scrape the MTA website and need a little help scraping the "Train Lines Row." (Website for reference: https://advisory.mtanyct.info/EEoutage/EEOutageReport.aspx?StationID=All
The train line information is stored as image files (1 line subway, A line subway, etc) describing each line that's accessible through a particular station. I've had success scraping info out of rows in which only one train passes through, but I'm having difficulty figuring out how to iterate through the columns which have multiple trains passing through it...using a conditional statement to test for whether it has one line or multiple lines.
tableElements = table.find_elements_by_tag_name('tr')
that's the table i'm iterating through
tableElements[2].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt')
this successfully gives me the values if only one value exists in the particular column
tableElements[8].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')
this successfully gives me a list of values I can successfully iterate through to extract my needed values.
Now I try and combine these lines of code together in a forloop to extract all the information without stopping.
for info in tableElements[1:]:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
I'm getting the error message: "list index out of range." I dont know why, as every iteration done in isolation seems to work. My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through. Hence, why I want to use this boolean operation.
Hi All, thanks so much for your help. I've uploaded my full code to Github and attached a link for your reference: https://github.com/tsp2123/MTA-Scraping/blob/master/MTA.ElevatorData.ipynb
The end goal is going to be to put this information into a dataframe using some formulation of and having a for loop that will extract the image information that I want.
dataframe = []
for elements in tableElements:
row = {}
columnName1 = find_element_by_class_name('td')
..
Your logic isn't off here.
"My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through."
The problem is it can't check if the statement is True if there's nothing in index position [1]. Hence the error at this point.
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
What you want to do is use try: So something like:
for info in tableElements[1:]:
try:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
except:
#do something else
print ('Nothing found in index position.')
Is it also possible to back to your question and provide the full code? When I try this, I'm getting 11 table elements, so want to test it with the specific table you're trying to scrape.

Resources