Create pandas dataframe from csv rows in string or list format - python-3.x

I convert some data into a csv string like format row by row for example the rows look like:
string format
1st row: "A,B,R,K,S,E"
2nd row: "B,C,S,E,G,Q,W,R,W" # sometimes longer rows
3rd row: "A,E,R,E,S" # sometimes shorter rows
or list format
1st row: ['A','B','R','K','S','E']
2nd row: ['B','C','S','E','G','Q','W','R','W']
3rd row: ['A','E','R','E','S']
I can also add \n at the end of each row.
I want to create a pandas dataframe from these rows but not sure how.
Normally I just save this data into a .csv file then I do pd.read_csv but I want to skip that step.
Thanks for the help

This will solve your problem:
import numpy as np
import pandas as pd
First_row=['A','B','R','K','S','E']
Second_row=['B','C','S','E','G','Q','W','R','W']
Third_row=['A','E','R','E','S']
df=pd.DataFrame({'1st row':pd.Series(First_row),'2nd row':pd.Series(Second_row),'3rd row':pd.Series(Third_row)})
answer=df.T
answer
0 1 2 3 4 5 6 7 8
1st row A B R K S E NaN NaN NaN
2nd row B C S E G Q W R W
3rd row A E R E S NaN NaN NaN NaN

Method - 1 : From List
Take 2D list and append it. Else, it would add the values in columns.
Method - 2 : From String

Related

How to find complete empty row in pandas

I am working on one dataset in which I need to find complete empty columns from the dataset.
example:
A B C D
nan nan nan nan
1 ss nan 3.0
2 bb w2 4.0
nan nan nan nan
Currently, I am using
import pandas as pd
nan_col=[]
for col in df.columns:
if df.loc[df[col].isnull()].empty !=True:
nan_col.append(col)
But this is capturing null values in the specified columns but I need to capture null rows.
expected Answer: row [0,3]
Can anyone suggest me a way to proceed to identify a complete null row in the dataframe.
You can compare if all rows has missing values by DataFrame.isna with DataFrame.all and then get index values by boolean indexing:
L = df.index[df.isna().all(axis=1)].tolist()
#alternative, if huge dataframe slowier
#L = df[df.isna().all(axis=1)].index.tolist()
print (L)
[0, 3]
Or you could use dropna with set and sorted, I get the index after dropping the rows with NaNs and then also get the index of the whole dataframe and use ^ to get the values that aren't in both indexes, then after the I use sorted to sort the list and convert it into a list, like the below:
print(sorted(set(df.index) ^ set(df.dropna(how='all').index)))
If you might have duplicate index, you can do a list comprehension to iterate through the whole df's index, and add the value to the list comprehension if the value isn't in the dropna index, I also use enumerate so that if all indexes are the same (all duplicate index), it would still work, like the below:
idx = df.dropna(how='all').index
print([i for index, i in enumerate(df.index) if index not in idx])
Both codes output:
[0, 3]

How to add an empty column in a dataframe using pandas (without specifying column names)?

I have a dataframe with only one column (headerless). I want to add another empty column to it having the same number of rows.
To make it clearer, currently, the size of my data frame is 1050 (since only one column), I want the new size to be 1050*2 with the second column being completely empty.
In pandas in DataFrame are always columns, so for new default column filled by missing values use length of columns:
s = pd.Series([2,3,4])
df = s.to_frame()
df[len(df.columns)] = np.nan
#what is same for one column df like
#df[1] = np.nan
print (df)
0 1
0 2 NaN
1 3 NaN
2 4 NaN

Conditionally concatenate variable names into a new variable in Python

I have a data set with 3 columns and occasional NAs. I am trying to create a new string column called 'check' that will concatenate the name of the variables that don't have an NA in each row in between underscores ('_'). I pasted my code below as well as the data that I have, the data that I need and what I actually get (See the hyperlinks after the code). For some reason, it seems the conditional that I have in place is completely ignored and the example_set['check'] = example_set['check'] + column is executed at every loop with or without the conditional code block. I assume there is a Python/Pandas quirk that I haven't fully comprehended... Can you please help?
example_set = pd.DataFrame({
'A':[3,4,np.nan]
,'B':[1,np.nan,np.nan]
,'C':[3,4,5]
}
)
example_set
columns = list(example_set.columns)
example_set['check'] = '_'
for column in columns:
for row in range(example_set.shape[0]):
if example_set[column][row] != np.nan:
example_set['check'] = example_set['check'] + column
else:
continue
example_set
Data that I have
Data that I was hoping to get
What I actually get
Find the rows that have null values, iterate the values with numpy compress, get the difference of the iteration from the columns, format the strings to your taste and create a new column:
columns = example_set.columns
example_set['check'] = [f'_{"".join(columns.difference(np.compress(boolean,columns)))}_'
for boolean in example_set.isna().to_numpy()]
A B C check
0 3.0 1.0 3 _ABC_
1 4.0 NaN 4 _AC_
2 NaN NaN 5 _C_
The simplest strategy is to copy df. Create a new column in it. Iterate over the rows of the old df, filter for nan cells then get the remaining indices.
Join them in a string and put those values to new df.
It is probably not the most efficient method, but it should be easy to understand.
Here is some code to get you going:
nset = example_set.copy()
nset["checked"] = "__"
for s in range(example_set.shape[0]):
serie = example_set.iloc[s]
nserie = serie[serie.notnull()]
names = "".join(nserie.index.tolist())
nset.at[s, "checked"] = "__" + names + "__"
Please try:
import numpy as np
example_set = pd.DataFrame({
'A':[3,4,np.nan]
,'B':[1,np.nan,np.nan]
,'C':[3,4,5]
}
)
example_set['check'] = '_ABC_'
for i in range(len(example_set)):
list_ = example_set.iloc[i].values.tolist()
if math.isnan(list_[0]):
example_set['check'][i] = example_set['check'][i].replace('A','')
if math.isnan(list_[1]):
example_set['check'][i] = example_set['check'][i].replace('B','')
if math.isnan(list_[2]):
example_set['check'][i] = example_set['check'][i].replace('C','')
Output:
A B C check
0 3.0 1.0 3 _ABC_
1 4.0 NaN 4 _AC_
2 NaN NaN 5 _C_

Merging sheets of excel using python

I am trying to take data of two sheets and comparing with each other if it matches i want to append column. Let me explain this by showing what i am doing and what i am trying to get in output using python.
This is my sheet1 from excel.xlsx:
it contains four column name,class,age and group.
This is my sheet2 from excel.xlsx:
it contains default, and name column with extra names in it.
So, Now i am trying to match name of sheet2 with sheet1, if the name containing in sheet1 matches with sheet2 then i want to add default value corresponding to that name from sheet2.
This i need in output:
As you can see only Ravi and Neha having default in sheet2 and that name matches with sheet1 name. Suhash and Aish dont have any default value so not anything coming there.
This code i tried:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1')
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
df1['DEFAULT'] = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)
and getting output excel like this:
Not getting default against Ravi.
Please help me with this to get this expected output using python.
Assuming you read each sheet into a dataframe (df = sheet1, df2 = sheet2)
it's quite easy and there are a few options (ranked in order of speed, from fastest to slowest):
# .merge
df = df.merge(df2, how='left', on='Name')
# pd.conact
df = pd.concat([df.set_index('Name'), df2.set_index('Name').Default], axis=1, sort='Name', join='inner')
# .join
df = df.set_index('Name').join(df2.set_index('Name'))
# .map
df.Default = df.Name.map(df2.set_index('Name')['Default'].to_dict())
All of them will have the following output:
Name Default Class Age Group
0 NaN NaN 4 2 tig
1 Ravi 2.0 5 5 rose
2 NaN NaN 3 3 lily
3 Suhas NaN 5 5 rose
4 NaN NaN 2 2 sun
5 Neha 3.0 5 5 rose
6 NaN NaN 5 2 sun
7 Aish NaN 5 5 rose
Then you overwrite the original sheet by using df.to_excel
EDIT
So the code you shared has 3 problems. One of which seems to be a language barrier... You only need 1 of the options I gave you. Secondly there's a missing ' when reading the first sheet into df. And lastly you're inconsistent when using the df names. you defined df1 and df2 but used just df in the code which doesn't work
So the correct code would be as follows:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1') #Here the ' was missing
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
## Now you chose one of the options, I used map here, but you can pick any one of them
df1.DEFAULT = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)

Modifying multiple columns of data using iteration, but changing increment value for each column

I'm trying to modify multiple column values in pandas.Dataframes with different increments in each column so that the values in each column do not overlap with each other when graphed on a line graph.
Here's the end goal of what I want to do: link
Let's say I have this kind of Dataframe:
Col1 Col2 Col3
0 0.3 0.2
1 1.1 1.2
2 2.2 2.4
3 3 3.1
but with hundreds of columns and thousands of values.
When graphing this on a line-graph on excel or matplotlib, the values overlap with each other, so I would like to separate each column by adding the same values for each column like so:
Col1(+0) Col2(+10) Col3(+20)
0 10.3 20.2
1 11.1 21.2
2 12.2 22.4
3 13 23.1
By adding the same value to one column and increasing by an increment of 10 over each column, I am able to see each line without it overlapping in one graph.
I thought of using loops and iterations to automate this value-adding process, but I couldn't find any previous solutions on Stackoverflow that addresses how I could change the increment value (e.g. from adding 0 in Col1 in one loop, then adding 10 to Col2 in the next loop) between different columns, but not within the values in a column. To make things worse, I'm a beginner with no clue about programming or data manipulation.
Since the data is in a CSV format, I first used Pandas to read it and store in a Dataframe, and selected the columns that I wanted to edit:
import pandas as pd
#import CSV file
df = pd.read_csv ('data.csv')
#store csv data into dataframe
df1 = pd.DataFrame (data = df)
# Locate columns that I want to edit with df.loc
columns = df1.loc[:, ' C000':]
here is where I'm stuck:
# use iteration with increments to add numbers
n = 0
for values in columns:
values = n + 0
print (values)
But this for-loop only adds one increment value (in this case 0), and adds it to all columns, not just the first column. Not only that, but I don't know how to add the next increment value for the next column.
Any possible solutions would be greatly appreciated.
IIUC ,just use df.add() over axis=1 with a list made from the length of df.columns:
df1 = df.add(list(range(0,len(df.columns)*10))[::10],axis=1)
Or as #jezrael suggested, better:
df1=df.add(range(0,len(df.columns)*10, 10),axis=1)
print(df1)
Col1 Col2 Col3
0 0 10.3 20.2
1 1 11.1 21.2
2 2 12.2 22.4
3 3 13.0 23.1
Details :
list(range(0,len(df.columns)*10))[::10]
#[0, 10, 20]
I would recommend you to avoid looping over the data frame as it is inefficient but rather think of adding to matrixes.
e.g.
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# Multiply each column with an incremented value * 10
x = x * 10*np.arange(1,df.shape[1]+1)
# Add the matrix to the data
df + x
Edit: In case you do not want to increment with 10, 20 ,30 but 0,10,20 use this instead
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# THIS LINE CHANGED
# Obmit the 1 so there is only an end value -> default start is 0
# Adjust the length of the vector
x = x * 10*np.arange(df.shape[1])
# Add the matrix to the data
df + x

Resources