If I try to print any value of the data frame I'm getting an Object (id value). I just want the value in the data frame.
I've tried removing the first column (df[0]), but that's removes the date column...
If you want to display the values of a specific column of a DataFrame :
df['Start Time'].values.tolist()
To refresh the index, you can use :
df = df.reset_index(drop=True)
And to change the type of a specific column (to int/float/str ...) :
df['column_name'] = df['column_name'].astype(int)
Related
I am trying to fill out the null values of a column of a dataframe with the first value that is not null of that same column.
The dataframe that I want to fill out looks like this, and I want all the rows of the column 'id_book' to have the same number
I have tried the following but it still shows the null values
w = Window.partitionBy('id_book').orderBy('id_book', 'date').rowsBetween(0,sys.maxsize)
filled_column = first(spark_df['id_book'], ignorenulls=True).over(w)
spark_df_filled = union_dias.withColumn('id_book_filled_spark', filled_column)
The window should be
w = Window.orderBy('date').rowsBetween(0, Window.unboundedFollowing)
filled_column = first(spark_df['id_book'], ignorenulls=True).over(w)
spark_df_filled = spark_df.withColumn('id_book_filled_spark', filled_column)
Because you don't want to partition by id_book. There is also no point to order by id_book because only the order of the date matters.
Also I think the better practice is to use Window.unboundedFollowing instead of sys.maxsize.
I have an array in the format [27.214 27.566] - there can be several numbers. Additionally I have a Datetime variable.
now=datetime.now()
datetime=now.strftime('%Y-%m-%d %H:%M:%S')
time.sleep(0.5)
agilent.write("MEAS:TEMP? (#101:102)")
values = np.fromstring(agilent.read(), dtype=float, sep=',')
The output from the array is [27.214 27.566]
Now I would like to write this into a dataframe with the following structure:
Datetime, FirstValueArray, SecondValueArray, ....
How to do this? In the dataframe every one minute a new array is added.
I will assume you want to append a row to an existing dataframe df with appropriate columns : value1, value2, ..., lastvalue, datetime
We can easily convert the array to a series :
s = pd.Series(array)
What you want to do next is append the datetime value to the series :
s.append(datetime, ignore_index=True) cf Series.append
Now you have a series whose length matches df.columns. You want to convert that series to a dataframe to be able to use pd.concat :
df_to_append = s.to_frame().T
We need to get the transpose of the original dataframe, because Series.to_frame() returns a dataframe with the series as a single column, and we want a single index but multiple columns.
Before you concatenate, however, you need to make sure both those dataframes columns names match, or it will create additional columns :
df_to_append.columns = df.columns
Now we can concatenate our two dataframes :
pd.concat([df, df_to_append], ignore_index=True) cf pandas.Concat
For further details, see the documentation
I pulled some stock data from a financial API and created a DataFrame with it. Columns were 'date', 'data1', 'data2', 'data3'. Then, I converted that DataFrame into a CSV with 'date' column as index:
df.to_csv('data.csv', index_label='date')
In a second script, I read that CSV and attempted to slice the resulting DataFrame between two dates:
df = pd.read_csv('data.csv', parse_dates=['date'] ,index_col='date')
df = df['2020-03-28':'2020-04-28']
When I attempt to do this, I get the following TypeError:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [2020-03-28] of <class 'str'>
So clearly, the problem is that I'm trying to use a str to slice a datetime object. But here's the confusing part! If in the first step, I save the DataFrame to a csv and DO NOT set 'date' as index:
df.to_csv('data.csv')
In my second script, I no longer get the TypeError:
df = pd.read_csv('data.csv', parse_dates=['date'] ,index_col='date')
df = df['2020-03-28':'2020-04-28']
Now it works just fine. The only problem is I have the default Pandas index column to deal with.
Why do I get a TypeError when I set the 'date' column as index in my CSV...but I do NOT get a TypeError when I don't set any index in the CSV?
It seems that in your "first" instance of df, date column was an
ordinary column (not the index) and this DataFrame had a default
index - consecutive integers (its name is not important).
In this situation running df.to_csv('data.csv', index_label='date')
causes that the output file contains:
date,date,data1,data2,data3
0,2020-03-27,10.5,12.3,13.2
1,2020-03-28,10.6,12.9,14.7
i.e.:
the index column (integers) was given date name, passed by you in
index_label parameter,
the next column, which in df was named date was also
given date name.
Then if you read it running
df = pd.read_csv('data.csv', parse_dates=['date'], index_col='date'), then:
the first date column (integers) has been read as date and
set as the index,
the second date column (dates) has been read as date.1 and
it is an ordinary column.
Now when you run df['2020-03-28':'2020-04-28'], you attempt to find rows
with index in the range given. But the index column is of Int64Index
type (check this in your installation), hence just the mentioned exception
was thrown.
Things look other way when you run df.to_csv('data.csv').
Now this file contains:
,date,data1,data2,data3
0,2020-03-27,10.5,12.3,13.2
1,2020-03-28,10.6,12.9,14.7
i.e.:
the first column (which in df was the index) has no name and int
values,
the only column named date is the second column and contains
dates.
Now when you read it, the result is:
date (converted do DatetimeIndex) is the index,
the original index column got name Unnamed: 0, no surprise,
since in the source file it had no name.
And now, when you run df['2020-03-28':'2020-04-28'] everything is OK.
The thing to learn for the future:
Running df.to_csv('data.csv', index_label='date') does not set this
column as the index. It only saves the current index column
under the given name, without any check whether any other column has
just the same name.
The result is that 2 columns can have the same name.
I have a pandas dataframe and one of the column is of datatype object . There is a blank element present in this column, so I tried to check if there are other empty element in this column by using df['colname'].isnull().sum() but it is giving me 0. How can I replace the above value(empty) with some arbitrary value(numeric) so that I can convert this column into a column of float datatype for further computation.
pandas.to_numeric
df['colname'] = pd.to_numeric(df['colname'], errors='coerce')
This will produce np.nan for any thing it can't convert to a number. After this, you can fill in with any value you'd like with fillna
df['colname'] = df['colname'].fillna(0)
All in one go
df['colname'] = pd.to_numeric(df['colname'], errors='coerce').fillna(0)
Dataframe image
I have created the following dataframe 'user_char' in Pandas with:
## Create a new workbook User Char with empty datetime columns to import data from the ledger
user_char = all_users[['createdAt', 'uuid','gasType','role']]
## filter on consumers in the user_char table
user_char = user_char[user_char.role == 'CONSUMER']
user_char.set_index('uuid', inplace = True)
## creates datetime columns that need to be added to the existing df
user_char_rng = pd.date_range('3/1/2016', periods = 25, dtype = 'period[M]', freq = 'MS')
## converts date time index to a list
user_char_rng = list(user_char_rng)
## adds empty cols
user_char = user_char.reindex(columns = user_char.columns.tolist() + user_char_rng)
user_char
and I am trying to assign a value to the highlighted column using the following command:
user_char['2016-03-01 00:00:00'] = 1
but this keeps creating a new column rather than editing the existing one. How do I assign the value 1 to all the indices without adding a new column?
Also how do I rename the datetime column that excludes the timestamp and only leaves the date field in there?
Try
user_char.loc[:, '2016-03-01'] = 1
Because your column index is a DatetimeIndex, pandas is smart enough to translate the string '2016-03-01' into datetime format. Using loc[c] seems to hint to pandas to first look for c in the index, rather than create a new column named c.
Side note: the DatetimeIndex of time-series data is conventionally used as the (row) index of a DataFrame, not in the columns. (There's no technical reason why you can't use time in the columns, of course!) In my experience, most of the PyData stack is built to expect "tidy data", where each variable (like time) forms a column, and each observation (timestamp value) forms a row. The way you're doing it, you'll need to transpose your DataFrame before calling plot() on it, for example.