Astropy Fits: How to write out a table with rows sliced out? - python-3.x

I'm currently working with some fits tables and I'm having trouble with outputting in Astropy.io.fits. Essentially, I am slicing out a bunch of rows that have data for objects I'm not interested in, but when I save the new table all of those rows have magically reappeared.
For example:
import astropy.io.fits as fits
import numpy as np
hdu = fits.open('some_fits_file.fits')[1].data
sample_slice = [True True True False False True]
hdu_sliced = hdu[sample_slice]
Now my naive mind expects that "hdu" has 6 rows and hdu_sliced has 4 rows, which is what you would get if you used np.size(). So if I save hdu_sliced, the new fits file will also have 4 rows:
new_hdu = fits.BinTableHDU.from_columns(fits.ColDefs(hdu_sliced.columns))
new_hdu.writeto('new_fits_file.fits')
np.size(hdu3)
6
So those two rows that I got rid of with the slice are for some reason not actually being removed from the table and the outputted file is just the same as the original file.
How do I delete the rows I don't want from the table and then output that new data to a new file?
Cheers,
Ashley

Can you use astropy.table.Table instead of astropy.io.fits.BinTable?
It's a much more friendly table object.
One way to make a row selection is to index into the table object with a list (or array) of rows you want:
>>> from astropy.table import Table
>>> table = Table()
>>> table['col_a'] = [1, 2, 3]
>>> table['col_b'] = ['spam', 'ham', 'jam']
>>> print(table)
col_a col_b
----- -----
1 spam
2 ham
3 jam
>>> table[[0, 2]] # Table with rows 0 and 2 only, row 1 removed (a copy)
<Table length=2>
col_a col_b
int64 str4
----- -----
1 spam
3 jam
You can read and write to FITS directly with Table:
table = Table.read('file.fits', hdu='mydata')
table2 = table[[2, 7, 10]]
table2.write('file2.fits')
There are potential issues, e.g. the FITS BINTABLE header isn't preserved when using Table, only the key, value info is storead in table.meta. You can consult the Astropy docs on table and FITS BINTABLE for details about the two table objects, how they represent data or how you can convert between the two, or just ask follow-up questions here or on the astropy-dev mailing list.

If you want to stick to using FITS_rec, you can try the following, which seems to be a workaround:
new_hdu = fits.BinTableHDU.from_columns(hdu_sliced._get_raw_data())

Related

Python Converting a List into an Array

I have a list that is 5 rows by 5 columns.
I am trying to convert this list into a dataframe.
When I try to do so, it only grabs the first row.
This failed because I had it set to 5,5:
df2 = pd.DataFrame(np.array(pdf_read).reshape(5,5),columns=list("abcde"))
When I switched it to this:
df2 = pd.DataFrame(np.array(pdf_read).reshape(1,5),columns=list("abcde"))
It only grabbed the first row.
Why does it do this?
Any advice?
Edit: Added Context
I am using the tabula module in python to read a PDF file.
The PDF file results are stored in the variable pdf_read.
When I do len(pdf_read) it has a length of 1, but when I type
print(pdf_read) it says it is 5 rows x 5 columns, which is very strange.
Edit #2: Datatypes
I ran the following:
print(type(pdf_read))
print(type(pdf_read[0]))
I got <class 'list'> and <class 'pandas.core.frame.DataFrame'> respectively.
It seems I have a Dataframe inside of a list.
I ran this code:
df = pd.DataFrame(
pdf_read[0],columns=["column_a","column_b","column_c","column_d","column_e"]
)
This just returns a 5,5 dataframe, but all of the values in each column are NaN.
Some progress made, but will need to figure out why the values are not populated now.
EDIT: After some research output pdf_read is list of DataFrames.
So for first DataFrame:
df = pdf_read[0]

How can I groupby rows by the columns in which they actually posses a data point?

I don't even know if groupby is the correct function to use for this. It's a bit hard to understand so Ill include a screenshot of my dataframe: screenshot
Basically, this dataframe has way too many columns because each column is specific to only one or a few rows. You can see in the screenshot that the first few columns are specific towards the first row and the last few columns are specific to the last row. I want to make it so that each row only has the columns that actually pertain to it. I've tried several methods of using groupby('equipment name') and several methods using dropna but none work in the way I need it to. I'm also open to separating it into multiple dataframes.
Any method is acceptable, this bug has been driving me crazy. It took me a while to get to this point because this started out as an unintelligible 10,000 line json. I'm pretty new to programming as well.
This is a very cool answer that could be one option - and it does use groupby so sorry for dismissing!!! This will group your data into DataFrames where each DataFrame has a unique group of columns, and any row which only contains values for those columns will be in that DataFrame. If your data are such that there are multiple groups of rows which share the exact same columns, this solution is ideal I think.
Just to note, though, if your null values are more randomly spread out throughout the dataset, or if one row in a group of rows is missing a single entry (compared to related rows), you will end up with more combinations of unique non-null columns, and then more output DataFrames.
There are also (in my opinion) nice ways to search a DataFrame, even if it is very sparse. You can check the non-null values for a row:
df.loc[index_name].dropna()
Or for an index number:
df.iloc[index_number].dropna()
You could further store these values, say in a dictionary (this is a dictionary of Series, but could be converted to DataFrame:
row_dict = {row : df.loc[row].dropna() for row in df.index}
I could imagine some scenarios where something based off these options is more helpful for searching. But that linked answer is slick, I would try that.
EDIT: Expanding on the answer above based on comments with OP.
The dictionary created in the linked post contain the DataFrames . Basically you can use this dictionary to do comparisons with the original source data. My only issue with that answer was that it may be hard to search the dictionary if the column names are janky (as it looks like in your data), so here's a slight modification:
for i, (name,df) in enumerate(df.groupby(df.isnull().dot(df.columns))):
d['df' + str(i)] = df.dropna(1)
Now the dictionary keys are "df#", and the values are the DataFrames. So if you wanted to inspect the content one DataFrame, you can call:
d['df1'].head()
#OR
print(d['df0'])
If you wanted to look at all the DataFrames, you could call
for df in d.values():
print(df.head()) #you can also pass an integer to head to show more rows than 5
Or if you wanted to save each DataFrame you could call:
for name in sorted(d.keys()):
d[name].to_csv('path/to/file/' + name + '.csv')
The point is, you've gotten to a data structure where you can look at the original data, separated into DataFrames without missing data. Joining these back into a single DataFrame would be redundant, as it would create a single DataFrame (equal to the original) or multiple with some amount of missing data.
I think it comes down to what you are looking for and how you need to search the data. You could rename the dictionary keys / output .CSV files based on the types of machinery inside, for example.
I thought your last comment might mean that objects of similar type might not share the same columns; say for example if not all "Exhaust Fans" have the same columns, they will end up in different DataFrames in the dictionary. This maybe the type of case where it might be easier to just look at individual rows, rather than grouping them into weird categories:
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
You could again then save these DataFrames as CSV files or look at them one by one (or e.g. search for Exhaust Fans by seeing if "Exhaust" is in they key). You could also print them all at once:
import pandas as pd
import numpy as np
import natsort
#making some randomly sparse data
columns = ['Column ' + str(i+1) for i in range(10)]
index = ['Row ' + str(i+1) for i in range(100)]
df = pd.DataFrame(np.random.rand(100,10), columns=columns,index=index)
df[df<.7] = np.nan
#creating the dictionary where each key is a row name
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
#printing all the output
for key in natsort.natsorted(df_dict.keys())[:5]: #using [:5] to limit output
print(df_dict[key], '\n')
Out[1]:
Column 1 Column 4 Column 7 Column 9 Column 10
Row 1 0.790282 0.710857 0.949141 0.82537 0.998411
Column 5 Column 8 Column 10
Row 2 0.941822 0.722561 0.796324
Column 2 Column 4 Column 5 Column 6
Row 3 0.8187 0.894869 0.997043 0.987833
Column 1 Column 7
Row 4 0.832628 0.8349
Column 1 Column 4 Column 6
Row 5 0.863212 0.811487 0.924363
Instead of printing, you could write the output to a text file; maybe that's the type of document that you could look at (and search) to compare to the input tables. Bute not that even though the printed data are tabular, they can't be made into a DataFrame without accepting that there will be missing data for rows which don't have entries for all columns.

Splicing a Pandas dataframe by column name

I am trying to split a copy off of a Pandas dataframe starting after a certain column by header name.
So far, I've been able to manipulate the column headers or indexes according to a set number of known columns, like below. However, the number of columns will change, and I want to still extract every column that happens after.
In the below example, say I want to grab all columns after 'Tail' even if the 'Body' columns goes to column X. So the below sample with X number of Body columns:
df = pd.DataFrame({'Intro1': ['blah'],
'Intro2': ['blah'],'Intro3': ['blah'],'Body1': ['blah'],'Body2': ['blah'],'Body3': ['blah'],'Body4': ['blah'], ... 'BodyX': ['blah'],'Tail': ['blah'],'OtherTail': ['blah'],'StillAnotherTail': ['blah'],})
Should produce a copy of the dataframe as:
dftail = pd.DataFrame({'Tail': ['blah'],'OtherTail': ['blah'],'StillAnotherTail': ['blah'],})
Ideally I'd like to find a way to combine the two techiques below so that the column starts at 'Tail' and goes to the end of the dataframe:
dftail = [col for col in df if col.startswith('Tail')]
dftail = df.iloc[:, 164:] # column number (164) will change based on 'Tail' index number
How about this:
df_tail = df.iloc[:, list(df.columns).index("Tail"):]
df_tail then prints out:
Tail OtherTail StillAnotherTail
0 blah blah blah

Valueerror when filtering and adding a new column at once

I'm getting the error code:
ValueError: Wrong number of items passed 3, placement implies 1.
What i want to do is import a dataset and count the duplicated values, drop the duplicated values and add a column which says that there were x number of duplicates of that number.
This is to try and sort a dataset of 13 000 rows and 45 columns.
I've tried different solutions found online but it seems like it does not help. I'm pretty new to programming and all help is really appreciated
'''import pandas as pd
# Making file ready
data = pd.read_excel(r'Some file.xlsx', header = 0)
data.rename(columns={'Dato': 'Last ordered', 'ArtNr': 'Item No:'}, inplace
= True)
#Formatting dates
pd.to_datetime(data['Last ordered'],
format = '%Y-%m-%d %H:%M:%S')
#Creates new table content and order
df = data[['Item No:','Last ordered', 'Description']]
df['Last ordered'] = df['Last ordered'].dt.strftime('%Y-/%m-/%d')
df = df.sort_values('Last ordered', ascending = False)
#Adds total sold quantity column
df['Quantity'] = df.groupby('Item No:').transform('count')
df2 = df.drop_duplicates('Item No:').reset_index(drop=True)
#Prints to environment and creates new excel file
print(df2)
df2.to_excel(r'New Sorted File.xlsx')'''
I expect it to provide a new excel file with columns:
Item No | Last ordered | Description | Quantity
And i want to be able to add other columns from the original dataset as well if i need to later on.
The problem is at this line:
df['Quantity'] = df.groupby('Item No:').transform('count')
The right side part of the assignment is a dataframe and you are trying to fit it inside a column. You need to select only one of the columns. Something like
df['Quantity'] = df.groupby('Item No:').transform('count')['Description']
should work.

Can we automate data filters for Excel using Pandas?

I want to apply filters to spread sheet using Python, which module is more useful Pandas or any other?
Filtering within your pandas dataframe can be done with loc (in addition to some other methods). What I THINK you're looking for is a way to export dataframes to excel and apply a filter within excel.
XLSXWRITER (by John McNamara) satisfies pretty much all xlsx/pandas use cases and has great documentation here --> https://xlsxwriter.readthedocs.io/.
Auto-filtering is an option :) https://xlsxwriter.readthedocs.io/worksheet.html?highlight=auto%20filter#worksheet-autofilter
I am not sure if I understand your question right. Maybe the combination of pandas and
qgrid might help you.
Simple filtering in pandas can be accomplished using the .loc DataFrame method.
In [4]: data = ({'name': ['Joe', 'Bob', 'Alice', 'Susan'],
...: 'dept': ['Marketing', 'IT', 'Marketing', 'Sales']})
In [5]: employees = pd.DataFrame(data)
In [6]: employees
Out[6]:
name dept
0 Joe Marketing
1 Bob IT
2 Alice Marketing
3 Susan Sales
In [7]: marketing = employees.loc[employees['dept'] == 'Marketing']
In [8]: marketing
Out[8]:
name dept
0 Joe Marketing
2 Alice Marketing
You can also use .loc with .isin to select multiple values in the same column
In [9]: marketing_it = employees.loc[employees['dept'].isin(['Marketing', 'IT'])]
In [10]: marketing_it
Out[10]:
name dept
0 Joe Marketing
1 Bob IT
2 Alice Marketing
You can also pass multiple conditions to .loc using an and (&) or or (|) statement to select values from multiple columns
In [11]: joe = employees.loc[(employees['dept'] == 'Marketing') & (employees['name'] == 'Joe')]
In [12]: joe
Out[12]:
name dept
0 Joe Marketing
Here is an an example of adding an autofilter to a worksheet exported from Pandas using XlsxWriter:
import pandas as pd
# Create a Pandas dataframe by reading some data from a space-separated file.
df = pd.read_csv('autofilter_data.txt', sep=r'\s+')
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_autofilter.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object. We also turn off the
# index column at the left of the output dataframe.
df.to_excel(writer, sheet_name='Sheet1', index=False)
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Get the dimensions of the dataframe.
(max_row, max_col) = df.shape
# Make the columns wider for clarity.
worksheet.set_column(0, max_col - 1, 12)
# Set the autofilter.
worksheet.autofilter(0, 0, max_row, max_col - 1)
# Add an optional filter criteria. The placeholder "Region" in the filter
# is ignored and can be any string that adds clarity to the expression.
worksheet.filter_column(0, 'Region == East')
# It isn't enough to just apply the criteria. The rows that don't match
# must also be hidden. We use Pandas to figure our which rows to hide.
for row_num in (df.index[(df['Region'] != 'East')].tolist()):
worksheet.set_row(row_num + 1, options={'hidden': True})
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Output:
The data used in this example is here.

Resources