I have a list that is 5 rows by 5 columns.
I am trying to convert this list into a dataframe.
When I try to do so, it only grabs the first row.
This failed because I had it set to 5,5:
df2 = pd.DataFrame(np.array(pdf_read).reshape(5,5),columns=list("abcde"))
When I switched it to this:
df2 = pd.DataFrame(np.array(pdf_read).reshape(1,5),columns=list("abcde"))
It only grabbed the first row.
Why does it do this?
Any advice?
Edit: Added Context
I am using the tabula module in python to read a PDF file.
The PDF file results are stored in the variable pdf_read.
When I do len(pdf_read) it has a length of 1, but when I type
print(pdf_read) it says it is 5 rows x 5 columns, which is very strange.
Edit #2: Datatypes
I ran the following:
print(type(pdf_read))
print(type(pdf_read[0]))
I got <class 'list'> and <class 'pandas.core.frame.DataFrame'> respectively.
It seems I have a Dataframe inside of a list.
I ran this code:
df = pd.DataFrame(
pdf_read[0],columns=["column_a","column_b","column_c","column_d","column_e"]
)
This just returns a 5,5 dataframe, but all of the values in each column are NaN.
Some progress made, but will need to figure out why the values are not populated now.
EDIT: After some research output pdf_read is list of DataFrames.
So for first DataFrame:
df = pdf_read[0]
Related
This is how I am reading and creating the dataframe with pandas
def get_sheet_data(sheet_name='SomeName'):
df = pd.read_excel(f'{full_q_name}',
sheet_name=sheet_name,
header=[0,1],
index_col=0)#.fillna(method='ffill')
df = df.swapaxes(axis1="index", axis2="columns")
return df.set_index('Product Code')
printing this tabularized gives me(this potentially will have hundreds of columns):
I cant seem to add those first two rows into the header, I've tried:
python:pandas - How to combine first two rows of pandas dataframe to dataframe header?https://stackoverflow.com/questions/59837241/combine-first-row-and-header-with-pandas
and I'm failing at each point. I think its because of the multiindex, not necessarily the axis swap? But using: https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html is kind of going over my head right now. Please help me add those two rows into the header?
The output of df.columns is massive so Ive cut it down alot:
Index(['Product Code','Product Narrative\nHigh-level service description','Product Name','Huawei Product ID','Type','Bill Cycle Alignment',nan,'Stackable',nan,
and ends with:
nan], dtype='object')
We Create new column names and set them to df.columns, the new column names are generated by joining the 3 Multindex headers and the 1st row of the DataFrame.
df.columns = ['_'.join(i) for i in zip(df.columns.get_level_values(0).tolist(), df.columns.get_level_values(1).tolist(), df.iloc[0,:].replace(np.nan,'').tolist())]
I pulled some stock data from a financial API and created a DataFrame with it. Columns were 'date', 'data1', 'data2', 'data3'. Then, I converted that DataFrame into a CSV with 'date' column as index:
df.to_csv('data.csv', index_label='date')
In a second script, I read that CSV and attempted to slice the resulting DataFrame between two dates:
df = pd.read_csv('data.csv', parse_dates=['date'] ,index_col='date')
df = df['2020-03-28':'2020-04-28']
When I attempt to do this, I get the following TypeError:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [2020-03-28] of <class 'str'>
So clearly, the problem is that I'm trying to use a str to slice a datetime object. But here's the confusing part! If in the first step, I save the DataFrame to a csv and DO NOT set 'date' as index:
df.to_csv('data.csv')
In my second script, I no longer get the TypeError:
df = pd.read_csv('data.csv', parse_dates=['date'] ,index_col='date')
df = df['2020-03-28':'2020-04-28']
Now it works just fine. The only problem is I have the default Pandas index column to deal with.
Why do I get a TypeError when I set the 'date' column as index in my CSV...but I do NOT get a TypeError when I don't set any index in the CSV?
It seems that in your "first" instance of df, date column was an
ordinary column (not the index) and this DataFrame had a default
index - consecutive integers (its name is not important).
In this situation running df.to_csv('data.csv', index_label='date')
causes that the output file contains:
date,date,data1,data2,data3
0,2020-03-27,10.5,12.3,13.2
1,2020-03-28,10.6,12.9,14.7
i.e.:
the index column (integers) was given date name, passed by you in
index_label parameter,
the next column, which in df was named date was also
given date name.
Then if you read it running
df = pd.read_csv('data.csv', parse_dates=['date'], index_col='date'), then:
the first date column (integers) has been read as date and
set as the index,
the second date column (dates) has been read as date.1 and
it is an ordinary column.
Now when you run df['2020-03-28':'2020-04-28'], you attempt to find rows
with index in the range given. But the index column is of Int64Index
type (check this in your installation), hence just the mentioned exception
was thrown.
Things look other way when you run df.to_csv('data.csv').
Now this file contains:
,date,data1,data2,data3
0,2020-03-27,10.5,12.3,13.2
1,2020-03-28,10.6,12.9,14.7
i.e.:
the first column (which in df was the index) has no name and int
values,
the only column named date is the second column and contains
dates.
Now when you read it, the result is:
date (converted do DatetimeIndex) is the index,
the original index column got name Unnamed: 0, no surprise,
since in the source file it had no name.
And now, when you run df['2020-03-28':'2020-04-28'] everything is OK.
The thing to learn for the future:
Running df.to_csv('data.csv', index_label='date') does not set this
column as the index. It only saves the current index column
under the given name, without any check whether any other column has
just the same name.
The result is that 2 columns can have the same name.
I have a dataframe df as below.
I want the final dataframe to be like this as follows. i.e, for each unique Name only last 2 rows must be present in the final output.
i tried the following snippet but its not working.
df = df[df['Name']].tail(2)
Use GroupBy.tail:
df1 = df.groupby('Name').tail(2)
Just one more way to solve this using GroupBy.nth:
df1 = df.groupby('Name').nth([-1,-2]) ## this will pick the last 2 rows
I stuck in this error, I want to subtract two date and get the difference as Days
I always receive the below error
here is the dataframe info
The reason this is happening is because you have pd.MultiIndex column headers. I can tell you have MultiIndex column headers by tuples in your column names from pd.DataFrame.info() results.
See this example below:
df = pd.DataFrame(np.random.randint(100,999,(5,5))) #create a dataframe
df.columns = pd.MultiIndex.from_arrays([['A','B','C','D','E'],['max','min','max','min','max']])
#create multi index column headers
type(df['A'] - df['E'])
Output:
pandas.core.frame.DataFrame
Note The type of the return even though you are subtracting one column from another column. You expected a pd.Series, but this is returning a dataframe.
You have a couple of options to solving this.
Option 1 use squeeze:
type((df['A'] - df['E']).squeeze())
pandas.core.series.Series
Option 2 flatten your column headers before:
df.columns = df.columns.map('_'.join)
type(df['A_max'] - df['E_max'])
Output:
pandas.core.series.Series
Now, you can apply .dt datetime accessor to your series. type is important to know the object you are working with.
Well, as #EdChum said above, .dt is a pd.DataFrame attribute, not a pd.Series method. If you want to get the date difference, use the apply() pd.Dataframe method.
I'm trying to load multiple csv files into a single dataframe df while:
adding column names
adding and populating a new column (Station)
excluding one of the columns (QD)
All of this works fine until I attempt to exclude a column with usecols, which throws the error Too many columns specified: expected 5 and found 4.
Is it possible to create a new column and pass usecols at the same time?
The reason I'm creating & populating a new 'Station' column during read_csv is my dataframe will contain data from multiple stations. I can work around the error by doing read_csv in one statement and dropping the QD column in the next with df.drop('QD', axis=1, inplace=True) but want to make sure I understand how to do this the most pandas way possible.
Here's the code that throws the error:
df = pd.concat(pd.read_csv("http://lgdc.uml.edu/common/DIDBGetValues?ursiCode=" + row['StationCode'] + "&charName=MUFD&DMUF=3000",
skiprows=17,
delim_whitespace=True,
parse_dates=[0],
usecols=['Time','CS','MUFD','Station'],
names=['Time','CS','MUFD','QD','Station']
).fillna(row['StationCode']
).set_index(['Time', 'Station'])
for index, row in stationdf.iterrows())
Example StationCode from stationdf BC840.
Data sample 2016-09-19T00:00:05.000Z 100 19.34 //
You can create a new column using operator chaining with assign:
df = pd.read_csv(...).assign(StationCode=row['StationCode'])