Co-relation Matrix in Spark Dataframe with categorical values - apache-spark

I am working with the Fire Incidents Dataset and I was asked to find co-relation between the columns neighborhood_district, zipcode, and number of fire calls. Since, neighborhood_district and zipcode are categorical and string columns, the inbuilt df.stat.corr('zipcode','count') throws the following error
IllegalArgumentException Traceback (most recent call last)
<ipython-input-36-4c6eebe04755> in <module>
----> 1 df3.stat.corr('zipcode','count')
3 frames
/content/spark-3.2.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
IllegalArgumentException: requirement failed: Currently correlation calculation for columns with dataType string not supported.
How to work with categorical string columns in pyspark dataframes? Note that I'm working in pyspark and not with scala.

Related

To write dataframe without index get wrong format

Here is my dataframe:
df
year 2022 2021
0 return on equity (roe) 160.90% 144.10%
1 average equity 62027.9677 65704.372
2 net profit margin 0.2531 0.2588
3 turnover 1.1179 1.0422
4 leverage 5.687 5.3421
I want to write it into excel without index:
df.to_excel('/tmp/test.xlsx',index=False)
Why there is a empty cell at the left-up corner in the test.xlsx file?
How can get the below format with to_excel method?
It is no use to add header argument.
df.to_excel('/tmp/test.xlsx', index=False, header=True)
Now read from the excel:
new_df = pd.read_excel('/tmp/test.xlsx',index_col=False)
new_df
Unnamed: 0 year 2022 2021
0 return on equity (roe) 160.90% 144.10% NaN
1 average equity 62027.9677 65704.372 NaN
2 net profit margin 0.2531 0.2588 NaN
3 turnover 1.1179 1.0422 NaN
4 leverage 5.687 5.3421 NaN
Can't add header argument when reading:
new_df = pd.read_excel('/tmp/test.xlsx',index_col=False,header=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/debian/.local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/home/debian/.local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/home/debian/.local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 490, in read_excel
data = io.parse(
File "/home/debian/.local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1734, in parse
return self._reader.parse(
File "/home/debian/.local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 732, in parse
validate_header_arg(header)
File "/home/debian/.local/lib/python3.9/site-packages/pandas/io/common.py", line 203, in validate_header_arg
raise TypeError(
TypeError: Passing a bool to header is invalid. Use header=None for no header or header=int or list-like of ints to specify the row(s) making up the column names
Include the header parameter as true:
df.to_excel('test.xlsx', index=False, header=True)
Back to a df, set index_col parameter to none:
new_df = pd.read_excel('test.xlsx',index_col=None)
print(new_df)
year 2022 2021
0 return on equity (roe) 160.90% 144.10%
1 average equity 62027.9677 65704.372
2 net profit margin 0.2531 0.2588
3 turnover 1.1179 1.0422
4 leverage 5.687 5.3421
I find the reason,the dataframe for the example is special:
df.columns
MultiIndex([('year',),
('2022',),
('2021',)],
)
It's not a single index.
df.columns = ['year', '2022', '2021']
df.to_excel('/tmp/test.txt',index=False)
The strange phenomenon disappeared at last.
dataframe with multiIndex [('year',),('2022',),('2021',)] display the same appearance such as single index ['year', '2022', '2021'] in my case.

Azure AutoML with historical data context forecast

I have a question about the peculiar behaviour of Azure AutoML when using forecasting with historical data context.
Basically, I want to apply this usecase from the documentation (documentation)
The idea is to train a model with historical data (imagine, 3 months of historical data) and then feed the model the current prediction context (for example, the last two weeks) in order to predict a certain prediction horizon.
According to the documentation, to train the model with historical data, need to do something like this for configuration:
forecasting_parameters = ForecastingParameters(time_column_name='Timestamp',
target_aggregation_function = "mean",
freq='H',
forecast_horizon = prediction_horizon_hours,
target_lags = 'auto',
)
label = signalTags
automl_config = AutoMLConfig(task='forecasting',
primary_metric='normalized_root_mean_squared_error',
experiment_timeout_minutes=30,
blocked_models=["AutoArima"],
enable_early_stopping=True,
training_data=Data,
label_column_name=label,
n_cross_validations=3,
enable_ensembling=False,
verbosity=logging.INFO,
forecasting_parameters = forecasting_parameters)
After training, in order to perform a predictiton I need to feed the "context" according to what I want to predict in the form of a dataframe (where the values for the target column are filled in in case of the context and empty in case of values I want to predict) and then just call forecast. Something like this:
Timestamp Signal
0 2022-08-07T23:00:00Z 63.16
1 2022-08-08T00:00:00Z 62.92
2 2022-08-08T01:00:00Z 62.89
3 2022-08-08T02:00:00Z 62.79
4 2022-08-08T03:00:00Z 62.75
.. ... ...
233 2022-08-23T17:00:00Z nan
234 2022-08-23T18:00:00Z nan
235 2022-08-23T19:00:00Z nan
236 2022-08-23T20:00:00Z nan
237 2022-08-23T21:00:00Z nan
After all this context (pun intended) here is the question/problem.
When I use the above dataframe to forecast ahead I get an error that mentions the following:
ForecastingConfigException:
Message: Expected column(s) target value column not found in y_pred.
InnerException: None
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Expected column(s) target value column not found in y_pred.",
"target": "y_pred",
"inner_error": {
"code": "BadArgument",
"inner_error": {
"code": "MissingColumnsInData"
}
},
"reference_code": "ac316505-87e4-4877-a855-65a24c3a796b"
}
}
However, if I feed a slightly different dataframe (where the data to be forecasted has any other time except exactly on the hour, i.e. 10h30,11h01, 10h23 etc.) it works normally. If I give it something like this:
Timestamp Signal
0 2022-08-07T23:00:00Z 63.16
1 2022-08-08T00:00:00Z 62.92
2 2022-08-08T01:00:00Z 62.89
3 2022-08-08T02:00:00Z 62.79
4 2022-08-08T03:00:00Z 62.75
.. ... ...
233 2022-08-23T17:00:01Z nan
234 2022-08-23T18:00:01Z nan
235 2022-08-23T19:00:01Z nan
236 2022-08-23T20:00:01Z nan
237 2022-08-23T21:00:01Z nan
It outputs good results. What gives?
I have tried resetting the index of the dataframe, replace None with nan but nothing seems to work. Azure Automl can predict any date except ones that are on the hour.
What can I do to fix this?
Thanks!
I managed to get it to work by changing how I call the forecast model.
Taking into account these variables:
X : time values
y : target values
df: df[[x,y]]
For a univariate series, instead of using this:
model.forecast(x, y)
I need to call:
model.forecast(df, y)
Remember that to call forecast you need to supply the arguments in a dataframe or in numpy array

Pivot Table Function not working - AttributeError: 'numpy.ndarray' object has no attribute 'name'

First columns of dataset
Last columns of datset
Unsure what this error means. I'm creating a pivot table of my dataset the Canadian Labour Force survey and details change in employment statistics over time) which works for every year and province combination except one in particular.
The code I'm running is:
no_row_df = pd.pivot_table(no_row_data, values = "FINALWT", index = ["SURVDATE", "PROV"], aggfunc=np.sum)
for col in count_columns:
table = pd.pivot_table(no_row_data, values = "FINALWT", index = ["SURVDATE", "PROV"], columns=[col], aggfunc=np.sum)
no_row_df = df.merge(table, left_index=True, right_on=["SURVDATE", "PROV"])
no_row_df
where SURVDATE is the date of survey (in year/month form) and PROV is the province. FINALWT is the weighting and represents the number of individuals with a particular labour characteristics. There are a whole host of columns but they essentially boil down to things like Industry, Occupation, Firm Size etc.
The full error is:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-ade72c6b3106> in <module>
2 for col in count_columns:
3 table = pd.pivot_table(no_row_data, values = "FINALWT", index = ["SURVDATE", "PROV"], columns=[col], aggfunc=np.sum)
----> 4 no_row_df = df.merge(table, left_index=True, right_on=["SURVDATE", "PROV"])
5 no_row_df
~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
7347 copy=copy,
7348 indicator=indicator,
-> 7349 validate=validate,
7350 )
7351
~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
81 validate=validate,
82 )
---> 83 return op.get_result()
84
85
~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in get_result(self)
665 result = self._indicator_post_merge(result)
666
--> 667 self._maybe_add_join_keys(result, left_indexer, right_indexer)
668
669 self._maybe_restore_index_levels(result)
~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in _maybe_add_join_keys(self, result, left_indexer, right_indexer)
819 elif result._is_level_reference(name):
820 if isinstance(result.index, MultiIndex):
--> 821 key_col.name = name
822 idx_list = [
823 result.index.get_level_values(level_name)
AttributeError: 'numpy.ndarray' object has no attribute 'name'
Thanks in advance.
First columns of dataset
Last columns of datset

I have a dataframe column with list type but some of the rows in column have missing data so I want to drop the missing data rows from the column

I have a dataframe column with list type but some of the rows in column have missing data so I want to drop the missing data rows from the column.
for i in range(len(data_Re)):
if type(data_Re.ix[i,'Actual CoQ Value1']) == float:
data_Re.ix[i,'Actual CoQ Value1'] = data_Re.ix[i,'Actual CoQ Value1'].drop(data_Re.ix[i,'Actual CoQ Value1'], axis=0)
print(i)
AttributeError Traceback (most recent call last)
in
2
3 if type(data_Re.ix[i,'Actual CoQ Value1']) == float:
----> 4 data_Re.ix[i,'Actual CoQ Value1'] = data_Re.ix[i,'Actual CoQ Value1'].dropna(data_Re.ix[i,'Actual CoQ Value1'], axis=0)
5 print(i)
AttributeError: 'float' object has no attribute 'dropna'
There is a thousand way to do it in only 1 row. Here arre some examples
# Option 1
data_Ref.dropna(subset=['Actual CoQ Value1'],inplace=True)
# Option 2
data_Ref = data_Ref[data_Ref['Actual CoQ Value1'].notnull()]
# Option 3
data_Ref = data_Ref[~data_Ref['Actual CoQ Value1'].isnull()]

ValueError: DataFrame constructor not properly called when convert Json data to Dataframe

I meet a problem when I try to covert Json data to Dataframe using pandas and json package.
My raw data from Json file looks like:
{"Number":41,"Type":["A1","A2","A3","A4","A5"],"Percent":{"Very Good":1.2,"Good":2.1,"OK":1.1,"Bad":1.3,"Very Bad":1.7}}
And my code is:
import pandas as pd
import json
with open('Test.json', 'r') as filename:
json_file=json.load(filename)
df =pd.DataFrame(json_file['Type'],columns=['Type'])
The problem is when I only read Type from Json file, it gives me the correct result which looks like:
Type
0 A1
1 A2
2 A3
3 A4
4 A5
However when only read Number from Json file:
df =pd.DataFrame(json_file['Number'],columns=['Number'])
It gives me error: ValueError: DataFrame constructor not properly called!
Also If I use:
df = pd.DataFrame.from_dict(json_file)
I get the error:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
I have done some researches on Google, but still cannot figure out why.
My goal is to break this Json data into two Dataframe, the first one is combine the Number and Type together:
Number Type
0 41 A1
1 41 A2
2 41 A3
3 41 A4
4 41 A5
Another Dataframe I want to get is the data in the Percent, which may look like:
Very Good 1.2
Good 2.1
OK 1.1
Bad 1.3
Very Bad 1.7
This should give your desired output:
with open('Test.json', 'r') as filename:
json_file=json.load(filename)
df = pd.DataFrame({'Number': json_file.get('Number'), 'Type': json_file.get('Type')})
df2 = pd.DataFrame({'Percent': json_file.get('Percent')})
Number type
0 41 A1
1 41 A2
2 41 A3
3 41 A4
4 41 A5
Percent
Bad 1.3
Good 2.1
OK 1.1
Very Bad 1.7
Very Good 1.2
You can generalize this into a function:
def json_to_df(d, ks):
return pd.DataFrame({k: d.get(k) for k in ks})
df = json_to_df(json_file, ['Number', 'Type'])
If you wish to avoid using the json package, you can do it directly in pandas:
_df = pd.read_json('Test.json', typ='series')
df = pd.DataFrame.from_dict(dict(_df[['Number', 'Type']]))
df2 = pd.DataFrame.from_dict(_df['Percent'], orient='index', columns=['Percent'])

Resources