I'm working in a jupyter notebook and am trying to create objects for two different answers in a column: Yes and No; in order to see the similarities between all of the 'yes' responses and the same for the 'no' responses as well.
When I use the following code, i get an error that states: UndefinedVariableError: name 'No' is not defined
df_yes=df.query('No-show == \"Yes\"')
df_no=df.query('No-show == \"No\"')
Since the same error occurs even when I'm only including the df_yes, then I figured it has to have something to do with the column name "No-show." So I tried it with different columns and sure enough, it works.
So can someone enlighten me what I'm doing wrong with with this code block so I won't do it again? Thanks!
Observe this example:
>>> import pandas as pd
>>> d = {'col1': ['Yes','No'], 'col2': ['No','No']}
>>> df = pd.DataFrame(data=d)
>>> df.query('col1 == \"Yes\"')
col1 col2
0 Yes No
>>> df.query('col2 == \"Yes\"')
Empty DataFrame
Columns: [col1, col2]
Index: []
>>>
Everything seems to work as expected. But, if I change col1 and col2 to col-1 and col-2, respectively:
>>> d = {'col-1': ['Yes','No'], 'col-2': ['No','No']}
>>> df = pd.DataFrame(data=d)
>>> df.query('col-1 == \"Yes\"')
...
pandas.core.computation.ops.UndefinedVariableError: name 'col' is not defined
As you can see, the problem is the minus (-) you use in your column name. As a matter of fact, you were even more unlucky because No in your error message refers to No-show and not to the value No of your columns.
So, the best solution (and best practice in general) is to name your columns differently (think of them as variables; you can not have a minus in the name of a variable, at least in Python). For example, No_show. If this data frame is not created by you (e.g. you read your data from a csv file), it 's a common practice to rename columns appropriately.
Related
I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755
I want to map through the rows of df1 and compare those with the values of df2 , by month and day, across every year in df2,leaving only the values in df1 which are larger than those in df2, to add into a new column, 'New'. df1 and df2 are of the same size, and are indexed by 'Month' and 'Day'. what would be the best way to do this?
df1=pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-``04','2005-01-05'],'Values':[-5.6,-5.6,0,3.9,9.4]})
df1.Date=pd.to_datetime(df1.Date)
df1['Day']=pd.DatetimeIndex(df1['Date']).day
df1['Month']=pd.DatetimeIndex(df1['Date']).month
df1.set_index(['Month','Day'],inplace=True)
df1
df2 = pd.DataFrame({'Date':['2005-01-01','2005-01-02','2005-01-03','2005-01-``04','2005-01-05'],'Values':[-13.3,-12.2,6.7,8.8,15.5]})
df2.Date=pd.to_datetime(df1.Date)
df2['Day']=pd.DatetimeIndex(df2['Date']).day
df2['Month']=pd.DatetimeIndex(df2['Date']).month
df2.set_index(['Month','Day'],inplace=True)
df2
df1 and df2
df2['New']=df2[df2['Values']<df1['Values']]
gives
ValueError: Can only compare identically-labeled Series objects
I have also tried
df2['New']=df2[df2['Values'].apply(lambda x: x < df1['Values'].values)]
The best way to handle your problem is by using numpy as a tool. Numpy has an attribute called "where"that helps a lot in cases like this.
This is how the sentence works:
df1['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false').
First import the library:
import numpy as np
Using the condition provided by you:
df2['New'] = np.where(df2['Values'] > df1['Values'], df2['Values'],'')
So, I think that solves your problem... You can change the value passed to the False condition to every thin you want, this is only an example.
Tell us if it worked!
Let´s try two possible solutions:
The first solution is to sort the index first.
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
it is possible to raise some kind of error, so if that happens, try this correction instead:
df1.sort_index(inplace=True, axis=1)
df2.sort_index(inplace=True, axis=1)
The second solution is to drop the indexes and reset it:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
See if it works and tell us the result.
Please find the below input and output. Corresponding to each store id and period id , 11 Items should be present , if any item is missing, add it and fill that row with 0
without using loop.
Any help is highly appreciated.
input
Expected Output
You can do this:
Sample df:
df = pd.DataFrame({'store_id':[1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962, 1160962],
'period_id':[1025,1025,1025,1025,1025,1025,1026,1026,1026,1026,1026],
'item_x':[1,4,5,6,7,8,1,2,5,6,7],
'z':[1,4,5,6,7,8,1,2,5,6,7]})
Solution:
num = range(1,12)
def f(x):
return x.reindex(num, fill_value=0)\
.assign(store_id=x['store_id'].mode()[0], period_id = x['period_id'].mode()[0])
df.set_index('item_x').groupby(['store_id','period_id'], group_keys=False).apply(f).reset_index()
You can do:
from itertools import product
pdindex=product(df.groupby(["store_id", "period_id"]).groups, range(1,12))
pdindex=pd.MultiIndex.from_tuples(map(lambda x: (*x[0], x[1]), pdindex), names=["store_id", "period_id", "Item"])
df=df.set_index(["store_id", "period_id", "Item"])
res=pd.DataFrame(index=pdindex, columns=df.columns)
res.loc[df.index, df.columns]=df
res=res.fillna(0).reset_index()
Now this will work only assuming you don't have any Item outside of range [1,11].
This is a simplification of #GrzegorzSkibinski's correct answer.
This answer is not modifying the original DataFrame. It uses fewer variables to store intermediate data structures and employs a list comprehension to simplify an use of map.
I'm also using reindex() rather than creating a new DataFrame using the generated index and populating it with the original data.
import pandas as pd
import itertools
df.set_index(
["store_id", "period_id", "Item_x"]
).reindex(
pd.MultiIndex.from_tuples([
group + (item,)
for group, item in itertools.product(
df.groupby(["store_id", "period_id"]).groups,
range(1, 12),
)],
names=["store_id", "period_id", "Item_x"]
),
fill_value=0,
).reset_index()
In testing, output matched what you listed as expected.
So I have 3 dataframes - df1,df2.df3. I'm trying to loop through each dataframe so that I can run some preprocessing - set date time, extract hour to a separate column etc. However, I'm running into some issues:
If I store the df in a dict as in df_dict = {'df1' : df1, 'df2' : df2, 'df3' : df3} and then loop through it as in
for k, v in df_dict.items():
if k == 'df1':
v['Col1']....
else:
v['Coln']....
I get a NameError: name 'df1' is not defined
What am I doing wrong? I initially thought I was not reading in the df1..3 data in but that seems to operate ok (as in it doesn't fail and its clearly reading it in given the time lag (they are big files)). The code preceding it (for load) is:
DF_DATA = { 'df1': 'df1.csv','df2': 'df2.csv', 'df3': 'df3.csv' }
for k,v in DF_DATA.items():
print(k, v) #this works to print out both key and value
k = pd.read_csv(v) #this does not
I am thinking this maybe the cause but not sure.I'm expecting the load loop to create the 3 dataframes and put them into memory. Then for the loop on the top of the page, I want to reference the string key in my if block condition so that each df can get a slightly different preprocessing treatment.
Thanks very much in advance for your assist.
You didn't create df_dict correctly. Try this:
DF_DATA = { 'df1': 'df1.csv','df2': 'df2.csv', 'df3': 'df3.csv' }
df_dict= {k:pd.read_csv(v) for k,v in DF_DATA.items()}
Here's some statement: https://stackoverflow.com/a/45600938/4164722
Dataset.col returns resolved column while col returns unresolved column.
Can someone provide more details? When should I use Dataset.col() and when functions.col?
Thanks.
In majority of contexts there is no practical difference. For example:
val df: Dataset[Row] = ???
df.select(df.col("foo"))
df.select(col("foo"))
are equivalent, same as:
df.where(df.col("foo") > 0)
df.where(col("foo") > 0)
Difference becomes important when provenance matters, for example joins:
val df1: Dataset[Row] = ???
val df2: Dataset[Row] = ???
df1.join(df2, Seq("id")).select(df1.col("foo") =!= df2.col("foo"))
Because Dataset.col is resolved and bound to a DataFrame it allows you to unambiguously select column descending from a particular parent. It wouldn't be possible with col.
EXPLANATION:
At times you may want to programmatically pre-create (i.e. ahead of time) column expressions for later use -- before the related DataFrame(s) actually exists. In that use-case, col(expression) can be useful. Generically illustrated using pySpark syntax:
>>> cX = col('col0') # Define an unresolved column.
>>> cY = col('myCol') # Define another unresolved column.
>>> cX,cY # Show that these are naked column names.
(Column<b'col0'>, Column<b'myCol'>)
Now these are called unresolved columns because they are not associated with a DataFrame statement to actually know whether those column names actually exist anywhere. However you may, in fact, apply them in a DF context later on, after having prepared them:
>>> df = spark_sesn.createDataFrame([Row(col0=10, col1='Ten', col2=10.0),])
>>> df
>>> DataFrame[col0: bigint, col1: string, col2: double]
>>> df.select(cX).collect()
[Row(col0=10)] # cX is successfully resolved.
>>> df.select(cY).collect()
Traceback (most recent call last): # Oh dear! cY, which represents
[ ... snip ... ] # 'myCol' is truly unresolved here.
# BUT maybe later on it won't be, say,
# after a join() or something else.
CONCLUSION:
col(expression) can help programmatically decouple the DEFINITION of a column specification with the APPLICATION of it against DataFrame(s) later on. Note that expr(aString), which also returns a column specification, provides a generalization of col('xyz'), where whole expressions can be DEFINED and later APPLIED:
>>> cZ = expr('col0 + 10') # Creates a column specification / expression.
>>> cZ
Column<b'(col0 + 10)'>
>>> df.select(cZ).collect() # Applying that expression later on.
[Row((col0 + 10)=20)]
I hope this alternative use-case helps.