Spark VectorAssembler creating a dictionary from features - apache-spark

I'm having trouble with VectorAssembler in PySpark.
Task
I have a bunch of data (all numerical) that describes the customer base for a phone company (monthly bills, gender, etc.). On datum can be seen here:
My goal is to try to predict if a customer will churn (leave the service).
Method
In pursuit of this goal, I implemented the following code:
from pyspark.ml.feature import VectorAssembler
ignore = ['Churn_indexed', 'customerID']
vectorAssembler = VectorAssembler(inputCols=[x for x in df_num.columns if x not in ignore],
outputCol='Independent Features')
df_final = vectorAssembler.transform(df)
Where I have made a VectorAssembler to ignore customerID (irrelevant) and churn (to be predicted).
Error
I then printed df_final to ensure it looked as expected as saw the following, where the red dots indicate rows that are of the expected form. The remaining rows do not have the expected form:
For some reason some of the rows have "20" at the beginning followed by a list. I should note that there are 20 features (including Churn and excluding CustomerID) which is perhaps where this pre-appended 20 comes from? I printed an incorrect row and a correct row and it looks like for the incorrect ones the assembler turned the features into a dictionary?
CSV
I opened the CSV and tried looking for extra spaces or bad formatting and could not recognize a pattern for determining what ended up being VectorAssembled properly and what did not.
Has anyone else run into this issue? I've been troubleshooting for a long time to no avail.
Thanks for any help.

It looks like the VectorAssembler automatically converts rows with many non-zeros (>= half from what I can tell) to their sparse form as seen here

Related

Pandas DataFrame indexing problem from 40,000 to 49,999

I have a strange problem with my code (At least it is strange for me!).
I have a Pandas DataFrame called "A". One of the column names is "asin". I want to execute all specific rows including my data. So I write this simple code:
df2 = A[A['asin']=='B0000UYZG0']
And it works normally as expected, except for data from 40,000 to 499,999!!
It doesn't work on these data series at all!
Refer to the picture, df2 = A[A['asin']=='0077614992'] (related to 50,000) works but df2 = A[A['asin']=='B0006O0WW6'] (related to 49,999) does not work!
I do not have tried all 10,000 data! But randomly I test them and have no answer.
I have grow accustomed to fixing bugs such as this one, usually when that happen is because of an alternative dtype or maybe because the string you see displayed to you isn't actually THE string itself. It seen your issue is mostly on the second part.
So lets first clear your "string" column from any white spaces.
df2['asin'] = df2.asin.str.strip()
# I am going with the idea that that is your non functional df object
After that try rerunning your filter
df2[df2['asin'].eq('0077614992')]

How to check where rows in a dataframe are missing?

I'm working on a rather lengthy data manipulation project and one of the datasets that I merged into my dataframe has a lesser length of 96 rows (6963 vs 6867), and I cannot for the life of me figure out where.
Checking .head() and .tail() shows that the all datasets in the dataframe start and end on the same date, but I have reason to think its interfering with my other analysis, for obvious reasons.
Checking crypto_df['doge_close,'doge_open'] shows they are the correct length of 6963 (total # of dates in dataframe), but I cannot find where in the dataset the rows are missing.
Is there a function to check this? I don't think I'm being specific enough on google, if there is.
If you are looking for rows that have NaN, do this:
crypto_df[crypto_df['doge_close'].isnull ()]

How to assign new column to Pandas Data Frame while keeping index in place?

This is my first time working with the Pandas library so if this a silly mistake I apologize.
For a Kaggle Competition, I'm creating a new column, Date_Diff, in a data frame based off of the difference between two columns. Both columns are initially strings, I convert them to datetime data types and then the resultant is a timedelta after I perform subtraction on them. Following that, I convert the type to a float by getting the days of the timedelta object.
All of the features of the dataset are thrown into a matrix in a function. I keep getting the following error when running that function:
ValueError: negative dimensions are not allowed
When assigning Date_Diff to the data frame, how can I keep the index in place so we can keep the number of rows (if this isn't the right terminology please let me know)?
I thought this post would answer my question, but unfortunately it didn't.
I've been looking up how to do this for two days now and still haven't found exactly what I'm looking for.
Converting to float and assigning the column:
for i in range(len(cur_date)):
date_diff.append(abs(cur_date[i] - iss_date[i]))
date_diff[i] = float(date_diff[i].days)
# The first three column assignments work fine, the Date_Diff does not
raw_data_set = raw_data_set.assign(Date_Diff = pd.Series(date_diff).values)
After tracing through the program, it looks like the matrix is getting negative dimensions because one of the parameters used to build the matrix is set to the total number of rows in the feature matrix - 1. The feature matrix isn't getting any rows so this is returing as 0 - 1 for a value of -1.
The function where the matrix is being created was in the starter code for the competition so I don't think it's wrong. I also added three columns to the data frame before Date_Diff and I had no issues so I really think it's with how I'm assigning Date_Diff.

How to perform group by on multiple columns in a pandas dataframe and predict future values using fbProphet in python?

My dataframe looks like following. I am trying to aggregate(sum) my amount column based on Date and Group present in pandas dataframe. I was able to successfully aggregate the column. However, I am not sure how to pass in fbprophet to predict the future values based on grouped date and Group. Below is the code for aggregation.
Note: I am beginner in python, please provide explanation with code.
Data Frame
import pandas as pd
data = {'Date':['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01',
'2017-02-01', '2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01'],'Group':['A','A','B','B','C','C','D','D','A','A','B','B','C','C','D','D'],
'Amount':['12.1','13.2','15.1','10.7','12.9','9.0','5.6','6.7','4.3','2.3','4.0','5.6','7.8','2.3','5.6','8.9']}
df = pd.DataFrame(data)
Code tried so far:
grouped = df.groupby(['Group','Date'])[['Amount']].sum()
You're suffering from a few problems.
numeric
The 3rd line of your data initialization should pass in float rather than str.
Elide the quotes.
Or, this will fix it:
'Amount':[float(n) for n in ['12.1','13.2','15.1','10.7','12.9','9.0','5.6','6.7','4.3','2.3','4.0','5.6','7.8','2.3','5.6','8.9']]}
We do this because you really don't want .sum() to put together 12.1 and 13.2
and come up with '12.113.2'.
You'd much prefer 25.3.
index
The grouped object you computed looks good on the face of it,
but if you inspect the .dtypes attribute you'll see
it's only offering the Amount column to facebook prophet.
To remedy that, use .reset_index():
>>> grouped.reset_index(inplace=True)
>>> grouped.dtypes
Group object
Date object
Amount float64
dtype: object
But now we see one last fly in the ointment.
dates
Having opaque categories of 'A' or 'B' is fine,
but for Date we probably want to know that February or March
comes a certain number of days after January,
rather that leaving opaque str labels in that column.
We might have done the type conversion back when we presented the data input,
but it's fine to clean it up at this stage, too:
import datetime as dt
def to_timestamp(day: str):
return dt.datetime.strptime(day, '%Y-%m-%d')
grouped['Date'] = grouped.Date.apply(to_timestamp)
Having successfully wrangled the shape and types of your data,
you should now be in a good position to let libraries further analyze it.

NetworkXError: Input is not a correct Pandas DataFrame

I'm trying to create a network map in Python using NetworkX with labels to find protential connections between people. Currently I have a 1,316 x 1,316 matrix of 1's and 0's that represent if there is a connection or not. I have been able to import this data into pyhon using a dataframe. Screenshot of Dataframe This is a small screenshot of the python dataframe. You can see the column and row names are numbers but at the end of the list they turn into actual names. If I remove the names and make a grid of just 1's and 0's NetworkX will allow me to turn this into a connection graph but it is almost useless without the labels to know who is connected to who. When including the labels in dataframe and trying to run the code listed below I recieve an error.
Error
NetworkXError: Input is not a correct Pandas DataFrame.
Code
Network = pd.DataFrame.from_csv('H:\\Network.csv')
G1 = nx.to_networkx_graph(Network)
I will admit to being quite new at this so any and all help will be appreciated. If you have any advice on why I'm getting this error or a better way to go about this, I am open to suggestions.
Thanks for your help.
I had the same problem. What helped me to solve is to just transform the dataframe to a numpy array. (This only works if shape[0] == shape[1])
here is what I did:
Network = pd.DataFrame.from_csv('H:\\Network.csv')
G1 = nx.to_networkx_graph(Network.to_numpy())
This should work for you.

Resources