getting a groupby error through a for loop - pandas-groupby

fashion = [1,1,2,3,3,3,21,1,1,1,5,5,5,5,3,3,2,6]
for key,group in groupby(fashion):
print(key,':',list(group))
I have written the above code to group by certain numbers and get a list. For example, I want an outcome such as :
1 : [1,1,1,1,1]
2 : [2,2]
Can someone please tell me what's wrong with my code?

fashion is neither a sorted list for itertools.groupby() nor a Pandas DataFrame. Please post the fully functioning code with clarification on if you are attempting to use Pandas for grouping a series or DataFrame.
How to make good reproducible pandas examples
Contains good advices for getting help with Pandas features.

Related

Spark VectorAssembler creating a dictionary from features

I'm having trouble with VectorAssembler in PySpark.
Task
I have a bunch of data (all numerical) that describes the customer base for a phone company (monthly bills, gender, etc.). On datum can be seen here:
My goal is to try to predict if a customer will churn (leave the service).
Method
In pursuit of this goal, I implemented the following code:
from pyspark.ml.feature import VectorAssembler
ignore = ['Churn_indexed', 'customerID']
vectorAssembler = VectorAssembler(inputCols=[x for x in df_num.columns if x not in ignore],
outputCol='Independent Features')
df_final = vectorAssembler.transform(df)
Where I have made a VectorAssembler to ignore customerID (irrelevant) and churn (to be predicted).
Error
I then printed df_final to ensure it looked as expected as saw the following, where the red dots indicate rows that are of the expected form. The remaining rows do not have the expected form:
For some reason some of the rows have "20" at the beginning followed by a list. I should note that there are 20 features (including Churn and excluding CustomerID) which is perhaps where this pre-appended 20 comes from? I printed an incorrect row and a correct row and it looks like for the incorrect ones the assembler turned the features into a dictionary?
CSV
I opened the CSV and tried looking for extra spaces or bad formatting and could not recognize a pattern for determining what ended up being VectorAssembled properly and what did not.
Has anyone else run into this issue? I've been troubleshooting for a long time to no avail.
Thanks for any help.
It looks like the VectorAssembler automatically converts rows with many non-zeros (>= half from what I can tell) to their sparse form as seen here

How to count unique values in aggregation in koalas

New to koalas and trying to do something really basic. I am simply trying to count the unique values in a column in an aggregation. In pandas I would do:
df.groupby('columnname').agg({'column_i_want_count_of_unique_values' : pd.Series.nunique})
But for example ks.Series.nunique doesn't work and 'count' also does not seem to give the right answer.
Pretty frustrating for something so simple and common, and annoying that I can't seem to find it in the documentation for something that bills itself as porting pandas to spark.
You can use the function nunique
df.groupby('columnname')['column_i_want_count_of_unique_values'].nunique()
I suppose the correct syntax is:
df.groupby('columnname').agg({'column_i_want_count_of_unique_values' : 'nunique'})
Source: https://github.com/databricks/koalas/pull/512

pandas method .isin() automatically converts datatypes in element comparisons?

Wondering if someone can help me here. When I take a regular python list containing strings, and check to see if a pandas series (pla.id) has a value that matches a value in that list. It works.
Which is great and all but I wonder how it's able to compare strings to ints... is there documentation somewhere that states that it will convert under the hood before comparing those values??
I wasn't able to find anything on the .isin() page of pandas documentation..
Also super interesting is that when I try pandas indexing it fails due to a type comparison.
So my two questions:
Does pandas.series.isin(some_list_of_strings) method automatically convert the values in the series (which are int values) to strings before doing a value comparison?
If so, why doesn't pandas indexing i.e. (df[df.series == 'some value']) not do the same thing? What is the thinking behind this? If I wanted to accomplish the same thing I would have to do df[df.series.astype(str) == ''] or df[df.series.astype(str).isin(some_list_of_strings)] to access those values in the df that match
After some digging I think this might be due to the pandas object datatype? but I have no understanding of why this works. Also this doesn't explain why the below works... since it is a int dtype
Thanks in advance!

How to identify and remove observations with ValueError in Pandas

While working on a dataset with pandas I am getting ValueError
ValueError: could not convert string to float: <some_string>
I want identify and remove such observations from my data set and I am unable to find a way to do it. Please suggest.
The code that I am running :
X_scaled = preprocessing.scale(X_final)
Edit 1 : After searching online and looking at some other posts in SO, I have understood that this might be happening because there is some column/s in my data which is supposed to have numbers/floats but contains string in some observations. How can I identify such columns?

NetworkXError: Input is not a correct Pandas DataFrame

I'm trying to create a network map in Python using NetworkX with labels to find protential connections between people. Currently I have a 1,316 x 1,316 matrix of 1's and 0's that represent if there is a connection or not. I have been able to import this data into pyhon using a dataframe. Screenshot of Dataframe This is a small screenshot of the python dataframe. You can see the column and row names are numbers but at the end of the list they turn into actual names. If I remove the names and make a grid of just 1's and 0's NetworkX will allow me to turn this into a connection graph but it is almost useless without the labels to know who is connected to who. When including the labels in dataframe and trying to run the code listed below I recieve an error.
Error
NetworkXError: Input is not a correct Pandas DataFrame.
Code
Network = pd.DataFrame.from_csv('H:\\Network.csv')
G1 = nx.to_networkx_graph(Network)
I will admit to being quite new at this so any and all help will be appreciated. If you have any advice on why I'm getting this error or a better way to go about this, I am open to suggestions.
Thanks for your help.
I had the same problem. What helped me to solve is to just transform the dataframe to a numpy array. (This only works if shape[0] == shape[1])
here is what I did:
Network = pd.DataFrame.from_csv('H:\\Network.csv')
G1 = nx.to_networkx_graph(Network.to_numpy())
This should work for you.

Resources