I have a 1650x40 dataframe that is a matrix of people who worked on projects each day. It looks like this:
import pandas as pd
df = pd.DataFrame([['bob','11/1/19','X','','',''], ['pete','11/1/19','X','','',''],
['wendy','11/1/19','','','X',''], ['sam','11/1/19','','','',''],
['cara','11/1/19','','','X','']],
columns=['person', 'date', 'project1','project2','project3','project4'])
I am trying to sanity check the data by:
listing any columns that do not have an X in them (in this case
'project2' and 'project4')
listing any rows that do not have an X in them (in this case
'sam')
Desired outcome:
Something like df.show_empty(columns) returns ['project2','project4'] and df.show_empty(rows) returns ['sam']
Obviously the this method would need some way to tell it that the first two columns are not expected to be empty and they should be ignored.
My desired outcome above would return lists of column headings (or row indexes) so that I could go back and check my data and application to find out why there's no entry in the relevant cell (I am guessing there's a good chance that more than one row or column are affected). This seems like it should be trivial but I'm really stuck with figuring this out.
Thanks for any help offered!
For me, it is easier to use apply to accomplish this task. The working code is shown below:
import pandas as pd
df = pd.DataFrame([['bob','11/1/19','X','','',''], ['pete','11/1/19','X','','',''],
['wendy','11/1/19','','','X',''], ['sam','11/1/19','','','',''],
['cara','11/1/19','','','X','']],
columns=['person', 'date', 'project1','project2','project3','project4'])
import numpy as np
df = df.replace('', np.NaN)
colmns = df.apply(lambda x: x.count()==0, axis=0)
df[colmns.index[colmns]]
df[df.apply(lambda x: x[2:].count()==0, axis=1)]
df = df.replace('', np.NaN) will replace the '' with NaN, so that we can use count() function.
colmns = df.apply(lambda x: x.count()==0, axis=0): this will find the columns that are all NaN.
df[df.apply(lambda x: x[2:].count()==0, axis=1)]: this will ignore the first two columns.
I have a dataframe in python which is made using the following code:
import pandas as pd
df = pd.read_csv('myfile.txt', sep="\t")
df1 = df.iloc[:, 3:]
now in df1 there are 24 columns. I would like to transform the values to log2 value and make a new dataframe in which there are 24 columns with log value of original dataframe. to do so I used numpy.log like the following line:
df2 = (numpy.log(df1))
this code does not return what I would like to get. do you know how to fix it?
Here is my working code example:
def load_databases():
df1 = pd.read_csv(file_directory+'data1.csv', encoding='utf-8', low_memory=False)
df2 = pd.read_csv(file_directory+'data2.csv', encoding='utf-8', low_memory=False)
df3 = pd.read_csv(file_directory+'data3.csv', encoding='utf-8', low_memory=False)
return df1, df2, df3
df1_new, df2_new, df3_new = load_ref_databases()
To use this function, I need to remember the order and nature of the returned output values (df1, df2, df3). And I need to make sure the order of the new variables (df1_new, df2_new, df3_new) needs to map with the function's tuple output. Is there a better way to map the tuple values? Or better yet, bypassing the need of this line:
df1, df2, df3 = load_ref_databases()
So that when I run load_databases(), df1, df2, and df3 will be created and accessible as global variables?
I saved a Pandas DataFrame with "pickle". When I call it it looks like Figure A (that is alright). But when I want to change the name of the columns it looks like Figure B.
What am I doing wrong? What are the other ways to change the name of columns?
Figure A
Figure B
import pandas as pd
df = pd.read_pickle('/home/myfile')
df = pd.DataFrame(df, columns=('AWA', 'REM', 'S1', 'S2', 'SWS', 'ALL'))
df
read.pickle already returns a DataFrame.
And you're trying to create a DataFrame from an existing DataFrame, just with renamed columns. That's not necessary...
As you want to rename all columns:
df.columns = ['AWA', 'REM','S1','S2','SWS','ALL']
Renaming specific columns in general could be achieved with:
df.rename(columns={'REM':'NewColumnName'},inplace=True)
Pandas docs
I have just solved it.
df = pd.read_pickle('/home/myfile')
df1 = pd.DataFrame(df.values*100)
df1.index='Feature' + (df1.index+1).astype(str)
df1.columns=('AWA', 'REM', 'S1', 'S2', 'SWS', 'ALL')
df1
Please find below the psuedocode:
source dataframe with 5 columns
creating a target dataframe with schema(6 columns)
For item in source_dataframe:
#adding a column to the list buy checking item.coulmn2
list = [item.column1,item.column2,newcolumn]
#creating an rdd out of this list
#now i need to add this rdd to a target dataframe?????
You could definately explain your question a bit more in detail or give some sample code. I'm interested how others will solve that. My proposed solution is this one:
df = (
sc.parallelize([
(134, "2016-07-02 12:01:40"),
(134, "2016-07-02 12:21:23"),
(125, "2016-07-02 13:22:56"),
(125, "2016-07-02 13:27:07")
]).toDF(["itemid", "timestamp"])
)
rdd = df.map(lambda x: (x[0], x[1], 10))
df2 = rdd.toDF(["itemid", "timestamp", "newCol"])
df3 = df.join(df2, df.itemid == df2.itemid and df.timestamp == df2.timestamp, "inner").drop(df2.itemid).drop(df2.timestamp)
I'm converting the RDD to a Dataframe. Afterwards I join both Dataframes, which duplicates some columns. So finally I drop those duplicated columns.