To write dataframe without index get wrong format - python-3.x

Here is my dataframe:
df
year 2022 2021
0 return on equity (roe) 160.90% 144.10%
1 average equity 62027.9677 65704.372
2 net profit margin 0.2531 0.2588
3 turnover 1.1179 1.0422
4 leverage 5.687 5.3421
I want to write it into excel without index:
df.to_excel('/tmp/test.xlsx',index=False)
Why there is a empty cell at the left-up corner in the test.xlsx file?
How can get the below format with to_excel method?
It is no use to add header argument.
df.to_excel('/tmp/test.xlsx', index=False, header=True)
Now read from the excel:
new_df = pd.read_excel('/tmp/test.xlsx',index_col=False)
new_df
Unnamed: 0 year 2022 2021
0 return on equity (roe) 160.90% 144.10% NaN
1 average equity 62027.9677 65704.372 NaN
2 net profit margin 0.2531 0.2588 NaN
3 turnover 1.1179 1.0422 NaN
4 leverage 5.687 5.3421 NaN
Can't add header argument when reading:
new_df = pd.read_excel('/tmp/test.xlsx',index_col=False,header=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/debian/.local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/home/debian/.local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/home/debian/.local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 490, in read_excel
data = io.parse(
File "/home/debian/.local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1734, in parse
return self._reader.parse(
File "/home/debian/.local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 732, in parse
validate_header_arg(header)
File "/home/debian/.local/lib/python3.9/site-packages/pandas/io/common.py", line 203, in validate_header_arg
raise TypeError(
TypeError: Passing a bool to header is invalid. Use header=None for no header or header=int or list-like of ints to specify the row(s) making up the column names

Include the header parameter as true:
df.to_excel('test.xlsx', index=False, header=True)
Back to a df, set index_col parameter to none:
new_df = pd.read_excel('test.xlsx',index_col=None)
print(new_df)
year 2022 2021
0 return on equity (roe) 160.90% 144.10%
1 average equity 62027.9677 65704.372
2 net profit margin 0.2531 0.2588
3 turnover 1.1179 1.0422
4 leverage 5.687 5.3421

I find the reason,the dataframe for the example is special:
df.columns
MultiIndex([('year',),
('2022',),
('2021',)],
)
It's not a single index.
df.columns = ['year', '2022', '2021']
df.to_excel('/tmp/test.txt',index=False)
The strange phenomenon disappeared at last.
dataframe with multiIndex [('year',),('2022',),('2021',)] display the same appearance such as single index ['year', '2022', '2021'] in my case.

Related

Move a Range of cells based on Cell Content

I am trying to use openpyxl to clean up some data ripped from a series of PDFs. The data is coming from a pandas dataframe, so the first cell on each row is the index, and based on that I want to move the range of data in a specific manner. The basic idea is to take data looking like this:
1 abc def ghi
2 jkl mno pqr
1 stu vwx yza
2 bcd efg hij
...
to look like this:
1 abc def ghi jkl mno pqr
2 null null null null
1 stu vwx yza bcd efg hij
2 null null null null
...
The code I'm currently running isn't throwing any errors, but it's also just not doing anything to the sheet. I'm not sure what I'm missing here.
excel_file = load_workbook("C:\\Users\\Jake\\Documents\\Shared-VB\\upload 2.5.23\\Excel Result.xlsx")
sh = excel_file.active
for i in range(1, sh.max_row + 1):
if sh.cell(row=i, column=1).value == "2":
sh.move_range("B{i}:D{i}", rows=-1, cols=3)
excel_file.save("C:\\Users\\Jake\\Documents\\Shared-VB\\upload 2.5.23\\Final Result.xlsx")
UPDATE:
It looks like comparing the value as a string was "the" issue.
I changed it to:
for i in range(1, sh.max_row + 1):
if sh.cell(row=i, column=1).value == 4:
sh.move_range("B{i}:D{i}", rows=-1, cols=3)
which gave me the error :
Traceback (most recent call last):
File "C:\Users\Jake\Documents\Work Projects\Python\Contract Extraction\contract_extraction_xlm_1.py", line 23, in <module>
sh.move_range("B{i}:D{i}", rows=-1, cols=3)
File "C:\Users\Jake\anaconda3\lib\site-packages\openpyxl\worksheet\worksheet.py", line 772, in move_range
cell_range = CellRange(cell_range)
File "C:\Users\Jake\anaconda3\lib\site-packages\openpyxl\worksheet\cell_range.py", line 53, in __init__
min_col, min_row, max_col, max_row = range_boundaries(range_string)
File "openpyxl\utils\cell.py", line 135, in openpyxl.utils.cell.range_boundaries
ValueError: B{i}:D{i} is not a valid coordinate or range
[Finished in 923ms]
So now the issue is that I am not sure how to make the coordinates context-sensitive based on the row that was checked in the for loop.
Per Charlie Clark: ws.move_range() takes either a range string or a CellRange() object, which is easier to parametrise. Your issue is the range string. This should either be an f-string use the .format() method. All this is easier if you stick with the API.

Problems reading a csv file [duplicate]

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?
You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6
I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows
I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df
All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)
Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d
skip[1] will skip second line, not the first one.

Pivot Table Function not working - AttributeError: 'numpy.ndarray' object has no attribute 'name'

First columns of dataset
Last columns of datset
Unsure what this error means. I'm creating a pivot table of my dataset the Canadian Labour Force survey and details change in employment statistics over time) which works for every year and province combination except one in particular.
The code I'm running is:
no_row_df = pd.pivot_table(no_row_data, values = "FINALWT", index = ["SURVDATE", "PROV"], aggfunc=np.sum)
for col in count_columns:
table = pd.pivot_table(no_row_data, values = "FINALWT", index = ["SURVDATE", "PROV"], columns=[col], aggfunc=np.sum)
no_row_df = df.merge(table, left_index=True, right_on=["SURVDATE", "PROV"])
no_row_df
where SURVDATE is the date of survey (in year/month form) and PROV is the province. FINALWT is the weighting and represents the number of individuals with a particular labour characteristics. There are a whole host of columns but they essentially boil down to things like Industry, Occupation, Firm Size etc.
The full error is:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-ade72c6b3106> in <module>
2 for col in count_columns:
3 table = pd.pivot_table(no_row_data, values = "FINALWT", index = ["SURVDATE", "PROV"], columns=[col], aggfunc=np.sum)
----> 4 no_row_df = df.merge(table, left_index=True, right_on=["SURVDATE", "PROV"])
5 no_row_df
~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
7347 copy=copy,
7348 indicator=indicator,
-> 7349 validate=validate,
7350 )
7351
~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
81 validate=validate,
82 )
---> 83 return op.get_result()
84
85
~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in get_result(self)
665 result = self._indicator_post_merge(result)
666
--> 667 self._maybe_add_join_keys(result, left_indexer, right_indexer)
668
669 self._maybe_restore_index_levels(result)
~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in _maybe_add_join_keys(self, result, left_indexer, right_indexer)
819 elif result._is_level_reference(name):
820 if isinstance(result.index, MultiIndex):
--> 821 key_col.name = name
822 idx_list = [
823 result.index.get_level_values(level_name)
AttributeError: 'numpy.ndarray' object has no attribute 'name'
Thanks in advance.
First columns of dataset
Last columns of datset

Pandas: new column using data from multiple other file

I would like to add a new column in a pandas dataframe df, filled with data that are in multiple other files.
Say my df is like this:
Sample Pos
A 5602
A 3069483
B 51948
C 231
And I have three files A_depth-file.txt, B_depth-file.txt, C_depth-file.txt like this (showing A_depth-file.txt):
Pos Depth
1 31
2 33
3 31
... ...
5602 52
... ...
3069483 40
The desired output df would have a new column Depth as follows:
Sample Pos Depth
A 5602 52
A 3069483 40
B 51948 32
C 231 47
I have a method that works but it takes about 20 minutes to fill a df with 712 lines, searching files of ~4 million lines (=positions). Would anyone know a better/faster way to do this?
The code I am using now is:
import pandas as pd
from io import StringIO
with open("mydf.txt") as f:
next(f)
List=[]
for line in f:
df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True)
f2basename = df.iloc[:, 0].values[0]
f2 = f2basename + "_depth-file.txt"
df2 = pd.read_csv(f2, sep='\t')
df = pd.merge(df, df2, on="Pos", how="left")
List.append(df)
df = pd.concat(List, sort=False)
with open("mydf.txt") as f: to open the file to which I wish to add data
next(f) to pass the header
List=[] to create a new empty array called List
for line in f: to go over mydf.txt line by line and reading them with df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True) to rename lost header name for Pos column, used later when merging line with associated file f2
f2basename = df.iloc[:, 0].values[0] getting basename of associated file f2 based on 1st column of mydf.txt
f2 = f2basename + "_depth-file.txt"to get full associated file f2 name
df2 = pd.read_csv(f2, sep='\t') to read file f2
df = pd.merge(df, df2, on="Pos", how="left")to merge the two files on column Pos, essentially adding Depth column to mydf.txt
List.append(df)adding modified line to the array List
df = pd.concat(List, sort=False) to concatenate elements of the List array into a dataframe df
Additional NOTES
In reality, I may need to search not only three files but several hundreds.
I didn't test the execution time, but should be faster if you read your 'mydf.txt' file in a dataframe too using read_csv and then use groupby and groupby apply.
If you know in advance that you have 3 samples and 3 relative files storing the depth, you can make a dictionary to read and store the three respective dataframes in advance and use them when needed.
df = pd.read_csv('mydf.txt', sep='\s+')
files = {basename : pd.read_csv(basename + "_depth-file.txt", sep='\s+') for basename in ['A', 'B', 'C']}
res = df.groupby('Sample').apply(lambda x : pd.merge(x, files[x.name], on="Pos", how="left"))
The final res would look like:
Sample Pos Depth
Sample
A 0 A 5602 52.0
1 A 3069483 40.0
B 0 B 51948 NaN
C 0 C 231 NaN
There are NaN values because I am using the sample provided and I don't have files for B and C (I used a copy of A), so values are missing. Provided that your files contain a 'Depth' for each 'Pos' you should not get any NaN.
To get rid of the multiindex made by groupby you can do:
res.reset_index(drop=True, inplace=True)
and res becomes:
Sample Pos Depth
0 A 5602 52.0
1 A 3069483 40.0
2 B 51948 NaN
3 C 231 NaN
EDIT after comments
Since you have a lot of files, you can use the following solution: same idea, but it does not require to read all the files in advance. Each file will be read when needed.
def merging_depth(x):
td = pd.read_csv(x.name + "_depth-file.txt", sep='\s+')
return pd.merge(x, td, on="Pos", how="left")
res = df.groupby('Sample').apply(merging_depth)
The result is the same.

Pandas: rare data form disrupts normal statistical analysis

I'm having an issue with analysing some bioinformatics data (in Pandas), where a rare but valid form of data messes up the statistical analysis of said data. This is what the data (in the dataframe group_PrEST) normally looks like:
PrEST ID Gene pepCN1 pepCN2 pepCN3
HPRR1 CD38 5298 10158 NaN
HPRR2 EGFR 79749 85793 117274
HPRR6 EPS8 68076 62305 66599
HPRR6 EPS8 NaN NaN 141828
Here is some of the code that works on this data (PrEST_stats is another dataframe that collects the statistics):
PrEST_stats['R1'] = group_PrESTs['pepCN1'].median()
PrEST_stats['R2'] = group_PrESTs['pepCN2'].median()
PrEST_stats['R3'] = group_PrESTs['pepCN3'].median()
PrEST_stats['CN'] = PrEST_stats[['R1', 'R2', 'R3']].median(axis=1)
PrEST_stats['CN'] = PrEST_stats['CN'].round()
PrEST_stats['STD'] = PrEST_stats[['R1', 'R2', 'R3']].std(axis=1, ddof=1)
PrEST_stats['CV'] = PrEST_stats['STD'] / \
PrEST_stats[['R1', 'R2', 'R3']].mean(axis=1) * 100
What it does is essentially this:
Calculate median of the columns pepCN1, pepCN2 and pepCN3, respectively, for each gene
Calculate the median of the results of (1)
Calculate standard deviation and coefficient of variation of results from (1)
And this works just fine, in most cases:
The above data for gene CD38 would give two median values (R1 and
R2) identical to their origins in pepCN1 & pepCN2 (since
there's only one row of gene CD38).
Gene EPS8 would give R1 and R2 in a similar manner, but assign
another value for R3 based on the median of the two values in
the two rows for column pepCN3.
In both of these cases, the statistics would then be calculated in the correct way. The fact that the statistics are calculated after the first round of "reducing" the data (i.e. calculating the first median(s)) is intended: the three data columns represent technical replicates, and should be handled individually before "merging" them together in a final statistical value.
The problem arises in the rare cases where the data looks like this:
PrEST ID Gene pepCN1 pepCN2 pepCN3
HPRR9 PTK2B 4972 NaN NaN
HPRR9 PTK2B 17095 NaN NaN
The script will here reduce the two pepCN1 values to a single median, disregarding the fact that there are no values (i.e. no data from replicates 2 and 3) with which to calculate statistics with from the other data columns. The script will function and give the correct CN value (the median), but the statistics of standard deviation and coefficient of variation will be left out (i.e. show as NaN).
In cases like this, I want the script to somehow see that reducing the data column to one value (the first median) is not the way to go. Essentially, I want it to skip calculating the first median (here: R1) and just calculate statistics on the two pepCN1 rows. Is there a way to do this? Thanks in advance!
[EDIT: new problems]
Ok, so now the code looks like this:
indexer = PrESTs.groupby('PrEST ID').median().count(1) == 1
one_replicate = PrESTs.loc[PrESTs['PrEST ID'].isin(indexer[indexer].index)]
multiple_replicates = PrESTs.loc[~PrESTs['PrEST ID'].isin(indexer[indexer]
.index)]
all_replicates = {0: one_replicate, 1: multiple_replicates}
# Calculations (PrESTs)
PrEST_stats_1 = pd.DataFrame()
PrEST_stats_2 = pd.DataFrame()
all_stats = {0: PrEST_stats_1, 1: PrEST_stats_2}
for n in range(2):
current_replicate = all_replicates[n].groupby(['PrEST ID', 'Gene names'])
current_stats = all_stats[n]
if n == 1:
current_stats['R1'] = current_replicate['pepCN1'].median()
current_stats['R2'] = current_replicate['pepCN2'].median()
current_stats['R3'] = current_replicate['pepCN3'].median()
else:
current_stats['R1'] = current_replicate['pepCN1'] # PROBLEM (not with .median())
current_stats['R2'] = current_replicate['pepCN2'] # PROBLEM (not with .median())
current_stats['R3'] = current_replicate['pepCN3'] # PROBLEM (not with .median())
current_stats['CN'] = current_stats[['R1', 'R2', 'R3']].median(axis=1)
current_stats['CN'] = current_stats['CN'].round()
current_stats['STD'] = current_stats[['R1', 'R2', 'R3']].std(axis=1, ddof=1)
current_stats['CV'] = current_stats['STD'] / \
current_stats[['R1', 'R2', 'R3']].mean(axis=1) * 100
current_stats['STD'] = current_stats['STD'].round()
current_stats['CV'] = current_stats['CV'].round(1)
PrEST_stats = PrEST_stats_1.append(PrEST_stats_2)
... and I have a new problem. Dividing up the two cases into two new DataFrames works just fine, and what I want to do now is to just handle them slightly differently in the for-loop above. I have checked the lines where I commented # PROBLEM in that I added .median() there as well, giving me the same results that I previously got - i.e. the rest of the code works, just NOT when I try to leave the data as it is! This is the error I get:
Traceback (most recent call last):
File "/Users/erikfas/Dropbox/Jobb/Data - QE/QEtest.py", line 110, in <module>
current_stats['R1'] = current_replicate['pepCN1']
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1863, in __setitem__
self._set_item(key, value)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1938, in _set_item
self._ensure_valid_index(value)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1915, in _ensure_valid_index
raise ValueError('Cannot set a frame with no defined index '
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
I've tried to find out what's wrong here, but I'm drawing blanks. Is it that I have to do SOMETHING to the data instead of .median(), rather than just nothing? Or something else?
[EDIT]: Changed some lines in the above code (in the else statement):
temp.append(current_replicate['pepCN1'])
temp.append(current_replicate['pepCN2'])
temp.append(current_replicate['pepCN3'])
current_stats = pd.concat(temp)
... where temp is an empty list, but then I get the following error:
File "/Users/erikfas/Dropbox/Jobb/Data - QE/QEtest.py", line 119, in <module>
temp2 = pd.concat(temp)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/tools/merge.py", line 926, in concat
verify_integrity=verify_integrity)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/tools/merge.py", line 986, in __init__
if not 0 <= axis <= sample.ndim:
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/groupby.py", line 295, in __getattr__
return self._make_wrapper(attr)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/groupby.py", line 310, in _make_wrapper
raise AttributeError(msg)
AttributeError: Cannot access attribute 'ndim' of 'SeriesGroupBy' objects, try using the 'apply' method
Is this not something I can do with group by objects?
Try this
In [28]: df
Out[28]:
id gene p1 p2 p3
0 HPRR1 CD38 5298 10158 NaN
1 HPRR2 EGFR 79749 85793 117274
2 HPRR6 EPS8 68076 62305 66599
3 HPRR6 EPS8 NaN NaN 141828
4 HPRR9 PTK2B 4972 NaN NaN
5 HPRR9 PTK2B 17095 NaN NaN
[6 rows x 5 columns]
Groupby the id field (I gather that's where you want valid medians). Figure out
if their are any invalid medians in that group (e.g. they come up nan after combinging the group).
In [53]: df.groupby('id').median().count(1)
Out[53]:
id
HPRR1 2
HPRR2 3
HPRR6 3
HPRR9 1
dtype: int64
You want to remove groups that only have 1 valid value, ok!
In [54]: df.groupby('id').median().count(1) == 1
Out[54]:
id
HPRR1 False
HPRR2 False
HPRR6 False
HPRR9 True
dtype: bool
In [30]: indexers = df.groupby('id').median().count(1) == 1
Take them out from the original data (then rerun) or fill or whatever.
In [67]: df.loc[~df.id.isin(indexers[indexers].index)]
Out[67]:
id gene p1 p2 p3
0 HPRR1 CD38 5298 10158 NaN
1 HPRR2 EGFR 79749 85793 117274
2 HPRR6 EPS8 68076 62305 66599
3 HPRR6 EPS8 NaN NaN 141828
[4 rows x 5 columns]
For your overall calculations you can do something like this. This is much preferred to appending to an initially empty DataFrame.
results = []
for r in range(2):
# do the calcs from above to generate say df1 and df2
results.append(df1)
results.append(df2)
# concatenate the rows!
final_result = pd.concat(results)

Resources