Python sort data and load to excel - python-3.x

I am working on a task in which I scan 1000+ emails of candidates and assign points based on their relevance as per requirement. I want to export this data to excel but in sorted order. Sorting order would be, the profile with maximum points (the most relevant profile) should go on top (order by points desc). I have python 3.3.5 on windows 7 32 bit.
I searched and understood that I might need pandas module to store the data in a dataframe and then sort that on my column and load into excel file. I then tried installing pandas using
pip install pandas
on both cmd as well cmd (run as administrator) but it gives an error -
Command "python setup.py egg_info" failed with error code 1 in c:\users\sanket~1
\appdata\local\temp\pip-build-ihqwe4\pandas\
can someone please help me on this sorting issue and suggest how can I resolve installation error of pandas? Or is there any other way to sort the data?

I'm not sure why you need pandas to do this. I would just store them as a sorted list of (score,candidate) tuples and then just write as a CSV:
tuples = ... # read your dataset as a list of (score,candidate) tuples
tuples.sort(reverse=True) # will sort by first element of tuples (score) in descending order
f = open('output.csv','w')
for e in tuples:
f.write("%s,%s" % e)
f.close()
The resulting CSV file can be opened in Excel

Related

How to reformat the output from iexfinance stock.get_financials(). The current output is a 3D nested dictionary, not a dataframe

As per the documentation (https://addisonlynch.github.io/iexfinance/stable/) the default output for the endpoint get requests are dataframes.
However the following sample code returns a nested dictionary (2x1x70)
from iexfinance.stocks import Stock
stocks=Stock(['NEM','FCX'],token='my_live_token',output_format='pandas')
fins_data=stocks.get_financials(period='annual')
print(fins_data)
Is this a standard nested dictionary?
The target output should be a dataframe of two rows indexed on the first keys (which are stock tickers, in this case 'NEM' and 'FCX'). The other text from the dictionary output are the column headings (it appears there is no heading for the ticker as this is the index/key).
I would expect the same format to that which you get when you run the following:
from iexfinance.stocks import Stock
stocks=Stock(['NEM','FCX'],token='my_live_token')
co_info=stocks.get_company()
print(co_info)
Any ideas how to convert the output from get_financials() to a usable dataframe format??
I added the following:
fins_dict=stocks.get_financials(period='annual')
fins_data = pd.concat(fins_dict, axis=0, keys=fins_dict.keys())
fins_data.index = fins_data.index.droplevel(1)
print(type(fins_data))
This seems to strip away the outer dictionary and leave the inner dataframe and removes the date index. IEX Cloud have moved there 'financials' data into a times-series library which appears to want to add a time index to everything. Not sure if this is the correct solution as it appears there is still some inconsistency within the data-structures. Any insights appreciated.

Countif function in Python looping in every cell

enter image description here
Hey everyone I am used to work in excel but recently after getting a dataset of about 500k rows that need to be worked in the same worksheet I have huge capacity issues and I was advised to try and transition any function to a python environment. So this excel function "=IF(COUNTIF($J$2:J3,J3)>1,0,1)"~J is the column of the Asset ID~ goes to each cell and if it has previous encountered it in the cells above it returns 0 and if it is unique it returns 1.
How that would be possible in a python environment if I load my table as a DataFrame?
Thanks in advance
You can use pandas to achieve this very easily:
import pandas as pd # import pandas
df = pd.read_excel('your_file.xlsx') # use appropriate function according to your file type
df['Unique'] = ~df[Asset_Id].duplicated().astype(int) # places 1 where it is not encountered before, 0 elsewhere

how to search a text file in python 3

I have this text file that has lists in it. How would I search for that individual list? I have tried using loops to find it, but every time it gives me an error since I don't know what to search for.
I tried using a if statement to find it but it returns -1.
thanks for the help
I was doing research on this last night. You can use pandas for this. See here: Load data from txt with pandas. One of the answers talks about list in text files.
You can use:
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["Name", "b", "c", "etc."]
Add sep=" " in your code, leaving a blank space between the quotes. So pandas can detect spaces between values and sort in columns. Data columns isenter code here for naming your columns.
With a JSON or XML format, text files become more searchable. In my research I’ve decided to go with an XML approach. Here is the link to a blog that explains how do use Python with XML: http://www.austintaylor.io/lxml/python/pandas/xml/dataframe/2016/07/08/convert-xml-to-pandas-dataframe.
If you want to search the data frame try:
import pandas as pd
txt_file = 'C:\path\to\your\txtfile.txt'
df = pd.read_table(txt_file, sep = ",")
row = df.loc[df['Name'] == 'bob']
Print(row)
Now depending how your text file is formated, your results will not work for every text file. The idea of a dataframe in pandas helps u create a CSV file formats. This giving the process a repeatable structure to enable testing results. Again I recommend using a JSON or XML format before implementing pandas data frames in ur solution. U can then create a consistent result, that is testable too!

NetworkXError: Input is not a correct Pandas DataFrame

I'm trying to create a network map in Python using NetworkX with labels to find protential connections between people. Currently I have a 1,316 x 1,316 matrix of 1's and 0's that represent if there is a connection or not. I have been able to import this data into pyhon using a dataframe. Screenshot of Dataframe This is a small screenshot of the python dataframe. You can see the column and row names are numbers but at the end of the list they turn into actual names. If I remove the names and make a grid of just 1's and 0's NetworkX will allow me to turn this into a connection graph but it is almost useless without the labels to know who is connected to who. When including the labels in dataframe and trying to run the code listed below I recieve an error.
Error
NetworkXError: Input is not a correct Pandas DataFrame.
Code
Network = pd.DataFrame.from_csv('H:\\Network.csv')
G1 = nx.to_networkx_graph(Network)
I will admit to being quite new at this so any and all help will be appreciated. If you have any advice on why I'm getting this error or a better way to go about this, I am open to suggestions.
Thanks for your help.
I had the same problem. What helped me to solve is to just transform the dataframe to a numpy array. (This only works if shape[0] == shape[1])
here is what I did:
Network = pd.DataFrame.from_csv('H:\\Network.csv')
G1 = nx.to_networkx_graph(Network.to_numpy())
This should work for you.

Error with openpyxl after converting a script to Python 3.x that utilizes pandas and numpy

About a year ago I wrote a script that took a single column of datetime values and ran a window through the series to determine the greatest "lumping" of values based on an adjustable dimension of time. For example, given a million date time values what is the maximum value of entries that exist within 1 second, or 1 minute, or 1 hour of each other.
The problem is that I had a machine blow up on me and lost some of the documentation, specifically the versions of packages that I was working with. I think I've updated the code to execute within 3.x but am now getting errors that seem to suggest that pandas no longer supports the packages I'm trying to use. I've tried just installing a few random versions, updating pip, etc., but am not having much luck.
The exact error states, 'UserWarning: Installed openpyxl is not supported at this time. Use >=1.61 and <2.0.0' -- I'm not seeing a version history in their repository. Might just try installing older versions of Python and trying to bash this into place.
Here is the code:
import numpy as np
import pandas as pd
# Your original code was correct here. I assumed there will be a data column along with the timestamps.
df = pd.read_csv("ET.txt", parse_dates=["dt"])
# Construct a univariate `timeseries` instead of a single column dataframe as output by `read_csv`.
# You can think of a dataframe as a matrix with labelled columns and rows. A timeseries is more like
# an associative array, or labelled vector. Since we don't need a labelled column, we can use a simpler
# representation.
data = pd.Series(0, df.dt)
print(data)
window_size = 1
buckets_sec = data.resample("1S", how="count").fillna(0)
# We have to shift the data back by the same number of samples as the window size. This is because `rolling_apply`
# uses the timestamp of the end of the period instead of the beginning. I assume you want to know when the most
# active period started, not when it ended. Finally, `dropna` will remove any NaN entries appearing in the warmup
# period of the sliding window (ie. it will output NaN for the first window_size-1 observations).
rolling_count = pd.rolling_apply(buckets_sec, window=window_size, func=np.nansum).shift(-window_size).dropna()
print(rolling_count.describe())
# Some interesting data massaging
# E.g. See how the maximum hit count over the specified sliding window evolves on an hourly
# basis:
seconds_max_hits = rolling_count.resample("S", how="max").dropna()
# Plot the frequency of various hit counts. This gives you an idea how frequently various
# hit counts occur.
seconds_max_hits.hist()
# Same on a daily basis
daily_max_hits = rolling_count.resample("S", how="max").dropna()
Screen cap of the error: http://i.imgur.com/uSv29I5.png
I'm not sure why you're seeing an openpyxl related error but if you are it seems like you should update your version of Pandas. There were some significant changes in openpyxl that affected exporting to Excel from Pandas but these have since been resolved.

Resources