Strange problem when saving to excel pandas - excel

I have some problem wirting to excel. I have 15 columns in my dataframe. I wish only to write 7 of them to excel and in the process use another name for the header.
Here is my code
cols = ['SN', 'Date_x','Material_x', 'Batch_x', 'Qty_x', 'Booked_x', 'State_x']
headers = ['SN', 'Date', 'Material', 'Batch', 'Qty', 'Booked', 'State']
df.style.apply(highlight_changes_ivt2, axis=None).to_excel(writer, columns =cols, header=headers, sheet_name="temp", index = False)
But I have the following errors
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/formats/style.py", line 235, in to_excel
engine=engine,
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/formats/excel.py", line 735, in write
freeze_panes=freeze_panes,
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/excel/_xlsxwriter.py", line 214, in write_cells
for cell in cells:
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/formats/excel.py", line 684, in get_formatted_cells
for cell in itertools.chain(self._format_header(), self._format_body()):
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/formats/excel.py", line 513, in _format_header_regular
f"Writing {len(self.columns)} cols but got {len(self.header)} "
ValueError: Writing 15 cols but got 7 aliases
I tried to do debugging.. and setting pdb.set_trace()
df.style.apply(highlight_changes_ivt2, axis=None).to_excel(writer, columns =cols, header=headers, sheet_name="temp", index = False)
(Pdb) df.columns
Index(['SN', 'Status_x', 'Material_x', 'Batch_x', 'Date_x', 'Quantity_x',
'Booked_x', 'DiffQty_x', 'Status_y', 'Material_y', 'Batch_y',
'Date_y', 'Quantity_y', 'Booked_y', 'DiffQty_y'],
dtype='object')
(Pdb)
This code is running well at my home laptop though... just wondering what's wrong... the difference is only python using version 3.7 for this and 3.8 back at home
Thanks

Let me elaborate my idea in the comment by an example:
df = pd.DataFrame(np.arange(16).reshape(4,-1))
# this is the reference dataframe
np.random.seed(1)
ref_df = pd.DataFrame(np.random.randint(1,10,(4,4)))
# this is the function
def highlight(col, ref_df=None):
return ['background-color: yellow' if c>r else ''
for c,r in zip(col, ref_df[col.name])]
# this works
df[[0,1,3]].style.apply(highlight, ref_df=ref_df).to_excel('style.xlsx', header=list('abc'))
Output:

Related

How to extract the specific part of text file in python?

I have big data as shown in the uploaded pic, it has 90 BAND-INDEX and each BAND-INDEX has 300 rows.
I want to search the text file for a specific value like -24.83271 and extract the BAND-INDEX containing that value in an array form. Can you please write the code to do so? Thank you in advance
I am unable to extract the specific BAND-INDEX in array form.
Try reading the file line by line and using a generator. Here is an example:
import csv
import pandas as pd
# generate and save demo csv
pd.DataFrame({
'Band-Index': (0.01, 0.02, 0.03, 0.04, 0.05, 0.06),
'value': (1, 2, 3, 4, 5, 6),
}).to_csv('example.csv', index=False)
def search_values_in_file(search_values: list):
with open('example.csv') as csvfile:
reader = csv.reader(csvfile)
reader.__next__() # skip header
for row in reader:
band_index, value = row
if value in search_values:
yield row
# get lines from csv where value in ['4', '6']
df = pd.DataFrame(list(search_values_in_file(['4', '6'])), columns=['Band-Index', 'value'])
print(df)
# Band-Index value
# 0 0.04 4
# 1 0.06 6

How to fill NANs with values tending to zero until the next valid value?

While resampling a dataframe (df) as:
df = pd.DataFrame.from_dict({'2021-03-02': 442,
'2021-03-04': 520,
'2021-03-09': 390,
'2021-03-11': 442,
'2021-03-16': 520,
'2021-03-23': 520,
'2021-03-25': 520,
'2021-03-26': 442,}, orient='index',)
df.index = pd.to_datetime(df.index)
df = df.resample('30Min').asfreq()
How do I fill the NANs with values that linearly tend to zero from their predecessor? (a graphic would be looking like a saw)
Are there any built in methods for this operation or a custom method needs to be used in conjuncture with .apply()?
Thank you for your time.
There's no built-in function for this. You can create one quickly like this:
# group of rows starting with non-nan
groups = df[0].groupby(df[0].notnull().cumsum())
# output
out = df[0].ffill().mul(1-groups.cumcount()/ groups.transform('size'))
# plot
out.plot()
And you get:
Another option is to fill the nan just before a non-nan value with 0 using notnull and shift, then interpolate.
df.loc[df[0].notnull().shift(-1, fill_value=False), 0] = 0
df[0] = df[0].interpolate()

Combining a list of Dataframes

I have a folder with several .csv-files. Each contains data on Time, High, Low, Open, Volumefrom, Volumeto, Close of a cryptocurrency.
I managed to load the .csvs into a list of dataframes and drop the columns Open, High, Low, Volumefrom, Volumeto , which I don't need, leaving me with Time and Close for each dataframe.
Now i want to combine the list of dataframes into one dataframe, where the index starts with the Timestamp of the youngest coin which would be iota in this example.
This is the code I wrote so far:
import pandas as pd
import os
# Path to my folder
PATH_COINS = r"C:\Users\...\Coins"
# creating a path for each of the .csv-files and saving it into a list
namelist = [name for name in os.listdir(PATH_COINS)]
path_lists = [os.path.join(PATH_COINS, path) for path in namelist]
# creating the dataframes and saving them into a list
dfs = [pd.read_csv(k, index_col=0) for k in path_lists]
# dropping unwanted columns
for num, i in enumerate(dfs):
i.drop(columns=["Open", "High", "Low", "Volumefrom", "Volumeto"], inplace=True)
# combining the list of dataframes into one dataframe
pd.concat(dfs, join="inner", axis=1)
However i am getting an Errormessage and cant figure out how to achieve my goal:
Traceback (most recent call last): File
"C:/Users/Jonas/PycharmProjects/Pandas/main.py", line 16, in
pd.concat(dfs, join="inner", axis=1)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\reshape\concat.py",
line 226, in concat
return op.get_result()
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\reshape\concat.py",
line 423, in get_result
copy=self.copy)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 5425, in concatenate_block_managers
return BlockManager(blocks, axes)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 3282, in init
self._verify_integrity()
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 3493, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 4843, in construction_error
passed, implied))
ValueError: Shape of passed values is (5, 8514), indices imply (5,
8490)
join should work
Check for duplicate index values as it doesn't know how to map multiple duplicate indexes across multiple DFs (e.g. df.index.is_unique)
Removing duplicate index values (e.g., df.drop_duplicates(inplace=True)) or one of the methods here should resolve it.

applying a lambda function to pandas dataframe

First time posting on stackoverflow, so bear with me if I'm making some faux pas please :)
I'm trying to calculate the distance between two points, using geopy, but I can't quite get the actual application of the calculation to work.
Here's the head of the dataframe I'm working with (there are some missing values later in the dataframe, not sure if this is the issue or how to handle it in general):
start lat start long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
I've set up a function:
def dist_calc(st_lat, st_long, fin_lat, fin_long):
from geopy.distance import vincenty
start = (st_lat, st_long)
end = (fin_lat, fin_long)
return vincenty(start, end).miles
This one works fine when given manual input.
However, when I try to apply() the function, I run into trouble with the below code:
distances = df.apply(lambda row: dist_calc(row[-4], row[-3], row[-2], row[-1]), axis=1)
I'm fairly new to python, any help will be much appreciated!
Edit: error message:
distances = df.apply(lambda row: dist_calc2(row[-4], row[-3], row[-2], row[-1]), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard
results[i] = func(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 5, in dist_calc2
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 322, in __init__
super(vincenty, self).__init__(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 115, in __init__
kilometers += self.measure(a, b)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 414, in measure
u_sq = cos_sq_alpha * (major ** 2 - minor ** 2) / minor ** 2
UnboundLocalError: ("local variable 'cos_sq_alpha' referenced before assignment", 'occurred at index 10')
The default settings for pandas functions typically used to import text data like this (pd.read_table() etc) will interpret the spaces in the first 2 column names as separators, so you'll end up with 6 columns instead of 4, and your data will be misaligned:
In [23]: df = pd.read_clipboard()
In [24]: df
Out[24]:
start lat start.1 long end_lat end_long
0 0 38.902760 -77.038630 38.880300 -76.986200 NaN
1 2 38.895914 -77.026064 38.915400 -77.044600 NaN
2 3 38.888251 -77.049426 38.895914 -77.026064 NaN
3 4 38.892300 -77.043600 38.888251 -77.049426 NaN
In [25]: df.columns
Out[25]: Index(['start', 'lat', 'start.1', 'long', 'end_lat', 'end_long'], dtype='object')
Notice column names are wrong, the last column is full of NaNs, etc. If I apply your function to the dataframe in this form, I get the same error as you did.
Its usually better to try to fix this before it gets imported as a dataframe. I can think of 2 methods:
clean the data before importing, for example copy it into an editor and replace the offending spaces with underscores. This is the easiest.
use a regex to fix it during import. This may be necessary if the dataset is very large, or its is pulled from a website and has to be refreshed regularly.
Here's an example of case (2):
In [35]: df = pd.read_clipboard(sep=r'\s{2,}|\s(?=-)', engine='python')
In [36]: df = df.rename_axis({'start lat': 'start_lat', 'start long': 'start_long'}, axis=1)
In [37]: df
Out[37]:
start_lat start_long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
The specified that separators must contain either 2+ whitespaces characters, or 1 whitespace followed by a hyphen (minus sign). Then I rename the columns to what i assume are the expected values.
From this point your function / apply works fine, but i've changed it a little:
PEP8 recommends putting imports at the top of each file, rather than in a function
Extracting the columns by name is more robust, and would have given a much more understandable error than the weird error thrown by geopy.
For example:
In [51]: def dist_calc(row):
...: start = row[['start_lat','start_long']]
...: end = row[['end_lat', 'end_long']]
...: return vincenty(start, end).miles
...:
In [52]: df.apply(lambda row: dist_calc(row), axis=1)
Out[52]:
0 3.223232
2 1.674780
3 1.365851
4 0.420305
dtype: float64

Convert string from text file inorder to plot using matplotlib

I am trying to plot a graph using dates and integers from a text file which looks like this:
However I keep getting this error
Traceback (most recent call last):
File "C:\Users\Haeshan\Desktop\Comp Sci CC\graph.py", line 21, in
graph()
File "C:\Users\Haeshan\Desktop\Comp Sci CC\graph.py", line 9, in graph
converters = {1: mdates.strpdate2num("%d/%m/%Y")})
File "C:\Users\Haeshan\AppData\Local\Programs\Python\Python35\lib\site-packages\numpy\lib\npyio.py", line 930, in loadtxt
items = [conv(val) for (conv, val) in zip(converters, vals)]
File "C:\Users\Haeshan\AppData\Local\Programs\Python\Python35\lib\site-packages\numpy\lib\npyio.py", line 930, in
items = [conv(val) for (conv, val) in zip(converters, vals)]
File "C:\Users\Haeshan\AppData\Local\Programs\Python\Python35\lib\site-packages\numpy\lib\npyio.py", line 659, in floatconv
return float(x)
ValueError: could not convert string to float: b"['10'"
import matplotlib.pyplot as plt
import numpy as np
import csv
import matplotlib.dates as mdates
def graph():
date, value = np.loadtxt("Scores.txt", delimiter = ",", unpack=True,
converters = {1: mdates.strpdate2num("%d/%m/%Y")})
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1, axisbg ="white")
plt.plot_date(x=date, y=value)
plt.title("Performace")
plt.ylabel("Score")
plt.xlabel("Date")
graph()
Any ideas,
Many thanks
The problem is your first column which has quotation marks. Since you have anyhow define converters I would change it to
converters = {0: (lambda x: int(x)),1: mdates.strpdate2num("%d/%m/%Y")})
UPDATE
Sorry, due to the quotations marks I didn't see the other issues. TBH, I would not use np.loadtxt in this case since you have also the square brackets in each line. In addition you have the issue that you are using Python 3 where strings are unicode and not bytes anymore, but loadtxt is going for bytes (thus the b in front of your line).
My suggestion is to read it in line by line and parse each line, e.g.
dates,values = list(),list()
formater = mdates.strpdate2num("%d/%m/%Y")
with open("Scores.txt",'r',newline='\n') as input_file:
for line in input_file:
# Remove the square brackets, quotation marks and newlines (if necessary)
# Be aware that this will also kill all square brackets and quotations marks in your line
entries = line.replace('[','').replace(']','').replace("'",'').replace('\n','').split(',')
values.append(int(entries[0]))
dates.append(formater(entries[1]))

Resources