NetworkXError: Input is not a correct Pandas DataFrame - python-3.x

I'm trying to create a network map in Python using NetworkX with labels to find protential connections between people. Currently I have a 1,316 x 1,316 matrix of 1's and 0's that represent if there is a connection or not. I have been able to import this data into pyhon using a dataframe. Screenshot of Dataframe This is a small screenshot of the python dataframe. You can see the column and row names are numbers but at the end of the list they turn into actual names. If I remove the names and make a grid of just 1's and 0's NetworkX will allow me to turn this into a connection graph but it is almost useless without the labels to know who is connected to who. When including the labels in dataframe and trying to run the code listed below I recieve an error.
Error
NetworkXError: Input is not a correct Pandas DataFrame.
Code
Network = pd.DataFrame.from_csv('H:\\Network.csv')
G1 = nx.to_networkx_graph(Network)
I will admit to being quite new at this so any and all help will be appreciated. If you have any advice on why I'm getting this error or a better way to go about this, I am open to suggestions.
Thanks for your help.

I had the same problem. What helped me to solve is to just transform the dataframe to a numpy array. (This only works if shape[0] == shape[1])
here is what I did:
Network = pd.DataFrame.from_csv('H:\\Network.csv')
G1 = nx.to_networkx_graph(Network.to_numpy())
This should work for you.

Related

Spark VectorAssembler creating a dictionary from features

I'm having trouble with VectorAssembler in PySpark.
Task
I have a bunch of data (all numerical) that describes the customer base for a phone company (monthly bills, gender, etc.). On datum can be seen here:
My goal is to try to predict if a customer will churn (leave the service).
Method
In pursuit of this goal, I implemented the following code:
from pyspark.ml.feature import VectorAssembler
ignore = ['Churn_indexed', 'customerID']
vectorAssembler = VectorAssembler(inputCols=[x for x in df_num.columns if x not in ignore],
outputCol='Independent Features')
df_final = vectorAssembler.transform(df)
Where I have made a VectorAssembler to ignore customerID (irrelevant) and churn (to be predicted).
Error
I then printed df_final to ensure it looked as expected as saw the following, where the red dots indicate rows that are of the expected form. The remaining rows do not have the expected form:
For some reason some of the rows have "20" at the beginning followed by a list. I should note that there are 20 features (including Churn and excluding CustomerID) which is perhaps where this pre-appended 20 comes from? I printed an incorrect row and a correct row and it looks like for the incorrect ones the assembler turned the features into a dictionary?
CSV
I opened the CSV and tried looking for extra spaces or bad formatting and could not recognize a pattern for determining what ended up being VectorAssembled properly and what did not.
Has anyone else run into this issue? I've been troubleshooting for a long time to no avail.
Thanks for any help.
It looks like the VectorAssembler automatically converts rows with many non-zeros (>= half from what I can tell) to their sparse form as seen here

Countif function in Python looping in every cell

enter image description here
Hey everyone I am used to work in excel but recently after getting a dataset of about 500k rows that need to be worked in the same worksheet I have huge capacity issues and I was advised to try and transition any function to a python environment. So this excel function "=IF(COUNTIF($J$2:J3,J3)>1,0,1)"~J is the column of the Asset ID~ goes to each cell and if it has previous encountered it in the cells above it returns 0 and if it is unique it returns 1.
How that would be possible in a python environment if I load my table as a DataFrame?
Thanks in advance
You can use pandas to achieve this very easily:
import pandas as pd # import pandas
df = pd.read_excel('your_file.xlsx') # use appropriate function according to your file type
df['Unique'] = ~df[Asset_Id].duplicated().astype(int) # places 1 where it is not encountered before, 0 elsewhere

Converting Numpy array to Dataframe using pandas

I am struggling to convert a NumPy array to a data frame.
This has been imported from a CSV file.
enter image description here
I have tried codes like
carsdf = pd.DataFrame(newdf=newdf[1:,1:],
index=newdf[1:,0],
columns=newdf[0,1:])
however, get a type error.
I have tried reshaping
print(mydata.reshape((213, 26)))
and get value errors
Now am flustered as I know how to rectify typos, errors, capitalize and fetch data and all the other stuff only my coding is pretty raw and am stuck with this as it is the beginning of the assignment.
would appreciate all advice.

getting a groupby error through a for loop

fashion = [1,1,2,3,3,3,21,1,1,1,5,5,5,5,3,3,2,6]
for key,group in groupby(fashion):
print(key,':',list(group))
I have written the above code to group by certain numbers and get a list. For example, I want an outcome such as :
1 : [1,1,1,1,1]
2 : [2,2]
Can someone please tell me what's wrong with my code?
fashion is neither a sorted list for itertools.groupby() nor a Pandas DataFrame. Please post the fully functioning code with clarification on if you are attempting to use Pandas for grouping a series or DataFrame.
How to make good reproducible pandas examples
Contains good advices for getting help with Pandas features.

Hide wrong values of a graph

I am building graphics using Matplotlib and I sometimes have wrong values in my Csv files, it creates spikes in my graph that I would like to suppress, also sometimes I have lots of zeros ( when the sensor is disconnected ) but I would prefer the graph showing blank spaces than wrong zeros that could be interpreted as real values.
Forgive me for I'm not familiar with matplotlib but I'm presuming that you're reading the csv file directly into matplotlib. If so is there an option to read the csv file into your app as a list of ints or as a string and then do the data validation before passing that string to the library?
Apologies if my idea is not applicable.
I found a way that works:
I used the Xlim to set my max and min x values and then i set all the values that i didnt want to nan !

Resources