I want to see the entire row for a dask dataframe without the fields being cutoff, in pandas the command is pd.set_option('display.max_colwidth', -1), is there an equivalent for dask? I was not able to find anything.
You can import pandas and use pd.set_option() and Dask will respect pandas' settings.
import pandas as pd
# Don't truncate text fields in the display
pd.set_option("display.max_colwidth", -1)
dd.head()
And you should see the long columns. It 'just works.'
Dask does not normally display the data in a dataframe at all, because it represents lazily-evaluated values. You may want to get a specific row by index, using the .loc accessor (same as in Pandas, but only efficient if the index is known to be sorted).
If you meant to get the whole list of columns only, you can get this by the .columns attribute.
Related
I have 2 different tabular files, in excel formats. I want to know if an id number from one of the columns in the first excel file (from the "ID" column) exists in the proteome file in a specific column (take "IHD" for example) and if so, to display the value associated with it. Is there a way to do this, specifically in pandas and possible using a for loop?
After loading the excel files with read_excel(), you should merge() the dataframes on ID and protein. This is the recommended approach with pandas rather than looping.
import pandas as pd
clusters = pd.read_excel('clusters.xlsx')
proteins = pd.read_excel('proteins.xlsx')
clusters.merge(proteins, left_on='ID', right_on='protein')
Using the Kiva Loan_Data from Kaggle I aggregated the Loan Amounts by country. Pandas allows them to be easily turned into a DataFrame, but indexes on the country data. The reset_index can be used to create a numerical/sequential index, but I'm guessing I am adding an unnecessary step. Is there a way to create an automatic default index when creating a DataFrame like this?
Use as_index=False
groupby
split-apply-combine
df.groupby('country', as_index=False)['loan_amount'].sum()
So I was investigating how some commands from Pandas work, and I ran into this issue; when I use the reindex command, my data is replaced by NaN values. Below is my code:
>>>import pandas as pd
>>>import numpy as np
>>>frame1=pd.DataFrame(np.arange(365))
then, I give it an index of dates:
>>>frame1.index=pd.date_range(pd.datetime(2017, 4, 6), pd.datetime(2018, 4, 5))
then I reindex:
>>>broken_frame=frame1.reindex(np.arange(365))
aaaand all my values are erased. This example isn't particularly useful, but it happens any and every time I use the reindex command, seemingly regardless of context. Similarly, when I try to join two dataframes:
>>>big_frame=frame1.join(pd.DataFrame(np.arange(365)), lsuffix='_frame1')
all of the values in the frame being attached (np.arange(365)) are replaced with NaNs before the frames are joined. If I had to guess, I would say this is because the second frame is reindexed as part of the joining process, and reindexing erases my values.
What's going on here?
From the Docs
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
Emphasis my own.
You want either set_index
frame1.set_index(np.arange(365))
Or do what you did in the first place
frame1.index = np.arange(365)
I did not find the answer helpful in relation to what I think the question is getting at so I am adding to this.
The key is that the initial dataframe must have the same index that you are reindexing on for this to work. Even the names must be the same! So if you're new MultiIndex has no names, your initial dataframe must also have no names.
m = pd.MultiIndex.from_product([df['col'].unique(),
pd.date_range(df.date.min(),
df.date.max() +
pd.offsets.MonthEnd(1),
freq='M')])
df = df.set_index(['col','date']).rename_axis([None,None])
df.reindex(m)
Then you will preserve your initial data values and reindex the dataframe.
I have a pandas DataFrame filled with strings. I would like to apply a string operation to all entries, for example capitalize(). I know that for a series we can use series.str.capitlize(). I also know that I can loop over the column of the Dataframe and do this for each of the columns. But I want something more efficient and elegant, without looping. Thanks
use stack + unstack
stack makes a dataframe with a single level column index into a series. You can then perform your str.capitalize() and unstack to get back your original form.
df.stack().str.capitalize().unstack()
I am trying to get the top 5 values of a column of my dataframe.
A sample of the dataframe is given below. In fact the original dataframe has thousands of rows.
Row(item_id=u'2712821', similarity=5.0)
Row(item_id=u'1728166', similarity=6.0)
Row(item_id=u'1054467', similarity=9.0)
Row(item_id=u'2788825', similarity=5.0)
Row(item_id=u'1128169', similarity=1.0)
Row(item_id=u'1053461', similarity=3.0)
The solution I came up with is to sort all of the dataframe and then to take the first 5 values. (the code below does that)
items_of_common_users.sort(items_of_common_users.similarity.desc()).take(5)
I am wondering if there is a faster way of achieving this.
Thanks
You can use RDD.top method with key:
from operator import attrgetter
df.rdd.top(5, attrgetter("similarity"))
There is a significant overhead of DataFrame to RDD conversion but it should be worth it.