How can we split the Datetime values to Year and month and need to split the columns year
(2017_year, 2018_year so on...) and values under the year column should get month of respective year?
Example data:
call time area age
2017-12-12 19:38:00 Rural 28
2018-01-12 22:05:00 Rural 50
2018-02-12 22:33:00 Rural 76
2019-01-12 22:37:00 Urban 45
2020-02-13 00:26:00 Urban 52
Required Output:
call time area age Year_2017 Year_2018
2017-12-12 19:38:00 Rural 28 jan jan
2018-01-12 22:05:00 Rural 50 Feb Feb
2018-02-12 22:33:00 Rural 76 mar mar
2019-01-12 22:37:00 Urban 45 Apr Apr
2020-02-13 00:26:00 Urban 52 may may
I think you need generate years and month from call time datetimes, so output is different:
Explanation - First generate column of months by DataFrame.assign and Series.dt.strftime, then convert years to index with append=True for MultiIndex, so possible reshape by Series.unstack, last add to original:
df1 = (df.assign(m = df['call time'].dt.strftime('%b'))
.set_index(df['call time'].dt.year, append=True)['m']
.unstack()
.add_prefix('Year_'))
print (df1)
call time Year_2017 Year_2018 Year_2019 Year_2020
0 Dec NaN NaN NaN
1 NaN Jan NaN NaN
2 NaN Feb NaN NaN
3 NaN NaN Jan NaN
4 NaN NaN NaN Feb
df = df.join(df1)
print (df)
call time area age Year_2017 Year_2018 Year_2019 Year_2020
0 2017-12-12 19:38:00 Rural 28 Dec NaN NaN NaN
1 2018-01-12 22:05:00 Rural 50 NaN Jan NaN NaN
2 2018-02-12 22:33:00 Rural 76 NaN Feb NaN NaN
3 2019-01-12 22:37:00 Urban 45 NaN NaN Jan NaN
4 2020-02-13 00:26:00 Urban 52 NaN NaN NaN Feb
Related
I am trying to do an index match in 2 data set but having trouble. Here is an example of what I am trying to do. I want to fill in column "a", "b", "c" that are empty in df with the df2 data where "Machine", "Year", and "Order Type".
The first dataframe lets call this one "df"
Machine Year Cost a b c
0 abc 2014 5500 nan nan nan
1 abc 2015 89 nan nan nan
2 abc 2016 600 nan nan nan
3 abc 2017 250 nan nan nan
4 abc 2018 2100 nan nan nan
5 abc 2019 590 nan nan nan
6 dcb 2020 3000 nan nan nan
7 dcb 2021 100 nan nan nan
The second data set is called "df2"
Order Type Machine Year Total Count
0 a abc 2014 1
1 b abc 2014 1
2 c abc 2014 2
4 c dcb 2015 4
3 a abc 2016 3
Final Output is:
Machine Year Cost a b c
0 abc 2014 5500 1 1 2
1 abc 2015 89 nan nan nan
2 abc 2016 600 3 nan nan
3 abc 2017 250 nan nan nan
4 abc 2018 2100 nan nan nan
5 abc 2019 590 1 nan nan
6 dcb 2014 3000 nan nan 4
7 dcb 2015 100 nan nan nan
Thanks for help in advance
Consider DataFrame.pivot to reshape df2 to merge with df1.
final_df = (
df1.reindex(["Machine", "Type", "Cost"], axis=True)
.merge(
df.pivot(
index=["Machine", "Year"],
columns="Order Type",
values="Total Count"
).reset_index(),
on = ["Machine", "Year"]
)
)
I have df as shown below.
Date t_factor
2020-02-01 5
2020-02-03 -23
2020-02-06 14
2020-02-09 23
2020-02-10 -2
2020-02-11 23
2020-02-13 NaN
2020-02-20 29
From the above I would like to replace -ve values in a column t_factor as NaN
Expected output:
Date t_factor
2020-02-01 5
2020-02-03 NaN
2020-02-06 14
2020-02-09 23
2020-02-10 NaN
2020-02-11 23
2020-02-13 NaN
2020-02-20 29
You can use pandas clip implementation as well. This assigns values outside boundary to boundary values. And then chain this with a replace function as below:
df['t_factor'] = df['t_factor'].clip(-1).replace(-1, np.nan)
df
Output:
Date t_factor
0 2020-02-01 5.0
1 2020-02-03 NaN
2 2020-02-06 14.0
3 2020-02-09 23.0
4 2020-02-10 NaN
5 2020-02-11 23.0
6 2020-02-13 NaN
7 2020-02-20 29.0
Use Series.mask:
df['t_factor'] = df['t_factor'].mask(df['t_factor'].lt(0))
OR use boolean indexing and assign np.nan,
df.loc[df['t_factor'].lt(0), 't_factor'] = np.nan
Result:
Date t_factor
0 2020-02-01 5.0
1 2020-02-03 NaN
2 2020-02-06 14.0
3 2020-02-09 23.0
4 2020-02-10 NaN
5 2020-02-11 23.0
6 2020-02-13 NaN
7 2020-02-20 29.0
Use pd.Series.where - by default it will replace values where the condition is False with NaN.
df["t_factor"] = df.t_factor.where(df.t_factor > 0)
I have a data frame like this,
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14
4 Total 6223 16
5 RRA 7222 18
6 MLQ 5648 45
Now, I need to extract rows/new dataframe that has rows until Total that is in Name column.
Output needed:
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14
I tried this,
df[df.Name.str.contains("Total", na=False)]
This is not helpful for now. Any suggestion would be great.
Select the index where the True value is located and slice using df.iloc:
df_new=df.iloc[:df.loc[df.Name.str.contains('Total',na=False)].index[0]]
or using series.idxmax() which allows you to get the index of max value (max of True/False is True):
df_new=df.iloc[:df.Name.str.contains('Total',na=False).idxmax()]
print(df_new)
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14
I have a dataframe that looks like this,
Date/Time Volt Current
2011-01-01 11:30:00 NaN NaN
2011-01-01 11:35:00 NaN NaN
2011-01-01 11:40:00 NaN NaN
...
2011-01-01 12:30:00 NaN NaN
2011-01-02 11:30:00 45 23
2011-01-02 11:35:00 31 34
2011-01-02 11:40:00 23 15
...
2011-01-02 12:30:00 13 1
2011-01-03 11:30:00 41 51
...
2011-01-03 12:25:00 14 5
2011-01-03 12:30:00 54 45
...
2011-01-04 11:30:00 45 -
2011-01-04 11:35:00 41 -
2011-01-04 11:40:00 - 4
...
2011-01-04 12:30:00 - 14
The dataframe has a date and time between 11:30:00 to 12:30:00 with a 5 minutes interval. I am trying to figure out how to find the minimum value based on the "Current" column for each day, and copy the entire row. My expected output should be something like this,
Date/Time Volt Current
2011-01-01 NaN NaN
2011-01-02 12:30:00 13 1
2011-01-03 12:25:00 14 5
2011-01-04 11:40:00 NaN 4
For rows with a value in current, it will copy the entire minimum value row.
For rows with "NaN" in current, it will copy the row still with NaN.
Do note that some data in the volt/current are something empty or with a dash.
Is this possible?
Thank you.
Please try,
df=df[df['Current'] != '-']
df.groupby(df['Date/Time'].dt.day).apply(lambda x:x.loc[x['Current'].astype(float).fillna(0).argmin(),:])
I am trying to load a csv file from the following URL into a dataframe using Python 3.5 and Pandas:
link = "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv"
The csv file (API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv) is inside of a zip file. My try:
import urllib.request
urllib.request.urlretrieve(link, "GDP.zip")
import zipfile
compressed_file = zipfile.ZipFile('GDP.zip')
csv_file = compressed_file.open('API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv')
GDP = pd.read_csv(csv_file)
But when reading it, I got the error "pandas.io.common.CParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 62".
Any idea?
I think you need parameter skiprows, because csv header is in row 5:
GDP = pd.read_csv(csv_file, skiprows=4)
print (GDP.head())
Country Name Country Code Indicator Name Indicator Code 1960 \
0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN
1 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN
2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08
3 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN
4 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN
1961 1962 1963 1964 1965 \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
2008 2009 2010 2011 \
0 ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09
1 ... 4.001201e+09 3.650083e+09 3.346517e+09 3.427023e+09
2 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10
3 ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041159e+11
4 ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10
2012 2013 2014 2015 2016 Unnamed: 61
0 NaN NaN NaN NaN NaN NaN
1 3.146152e+09 3.248925e+09 NaN NaN NaN NaN
2 2.053654e+10 2.004633e+10 2.005019e+10 1.933129e+10 NaN NaN
3 1.153984e+11 1.249121e+11 1.267769e+11 1.026269e+11 NaN NaN
4 1.231978e+10 1.278103e+10 1.321986e+10 1.139839e+10 NaN NaN