Apply function to specific rows of dataframe column - python-3.x

I have a following column in a dataframe:
COLUMN_NAME
1
0
1
1
65280
65376
65280
I want to convert 5 digit values in a column to their corresponding binary values. I know how to convert it by using bin() function, but i don't know how to apply it only to rows that has 5digits.
Note that the column contains only values with either 1 or 5 digits. Values with 1 digit is only 1 or 0.

import pandas as pd
import numpy as np
data = {'c': [1,0,1,1,65280,65376,65280] }
df = pd.DataFrame (data, columns = ['c'])
// create another column 'clen' which has length of 'c'
df['clen'] = df['c'].astype(str).map(len)
//check condition and apply bin function to entire column
df.loc[df['clen']==5,'c'] = df['c'].apply(bin)

Related

Find and Add Missing Column Values Based on Index Increment Python Pandas Dataframe

Good Afternoon!
I have a pandas dataframe with an index and a count.
dictionary = {1:5,2:10,4:3,5:2}
df = pd.DataFrame.from_dict(dictionary , orient = 'index' , columns = ['count'])
What I want to do is check from df.index.min() to df.index.max() that the index increment is 1. If a value is missing like in my case the 3 is missing then I want to add 3 to the index with a 0 in the count.
The output will look like the below df2 but done in a programmatic fashion so I can use it on a much bigger dataframe.
RESULTS EXAMPLE DF:
dictionary2 = {1:5,2:10,3:0,4:3,5:2}
df2 = pd.DataFrame.from_dict(dictionary2 , orient = 'index' , columns = ['count'])
Thank you much!!!
Ensure the index is sorted:
df = df.sort_index()
Create an array that starts from the minimum index to the maximum index
complete_array = np.arange(df.index.min(), df.index.max() + 1)
Reindex, fill the null value with 0, and optionally change the dtype to Pandas Int:
df.reindex(complete_array, fill_value=0).astype("Int16")
count
1 5
2 10
3 0
4 3
5 2

How to add an empty column in a dataframe using pandas (without specifying column names)?

I have a dataframe with only one column (headerless). I want to add another empty column to it having the same number of rows.
To make it clearer, currently, the size of my data frame is 1050 (since only one column), I want the new size to be 1050*2 with the second column being completely empty.
In pandas in DataFrame are always columns, so for new default column filled by missing values use length of columns:
s = pd.Series([2,3,4])
df = s.to_frame()
df[len(df.columns)] = np.nan
#what is same for one column df like
#df[1] = np.nan
print (df)
0 1
0 2 NaN
1 3 NaN
2 4 NaN

Replace dataframe value by indices

I've the dataframe in
import pandas as pd
in = pd.DataFrame(
columns=[1, 2],
data= [['a','b'],['c','d']],
)
in
1 2
0 a b
1 c d
and would like to replace single values (here: d with z) by indices (of row, column) resulting in a dataframe out:
1 2
0 a b
1 c z
How can I replace a value by indices (here: row idx is 1, column idx is 1) most efficient (memory consumption, execution time)?
Use DataFrame.iloc if want set values by positions (first value is 0, because python counts from 0):
df.iloc[1,1] = 'z'
Or if want set by labels (index and columns values) use DataFrame.loc:
df.loc[1,2] = 'z'
If want set one value only better is use DataFrame.iat or
DataFrame.at:
#by positions
df.iat[1,1] = 'z'
#by labels
df.at[1,2] = 'z'

Modifying multiple columns of data using iteration, but changing increment value for each column

I'm trying to modify multiple column values in pandas.Dataframes with different increments in each column so that the values in each column do not overlap with each other when graphed on a line graph.
Here's the end goal of what I want to do: link
Let's say I have this kind of Dataframe:
Col1 Col2 Col3
0 0.3 0.2
1 1.1 1.2
2 2.2 2.4
3 3 3.1
but with hundreds of columns and thousands of values.
When graphing this on a line-graph on excel or matplotlib, the values overlap with each other, so I would like to separate each column by adding the same values for each column like so:
Col1(+0) Col2(+10) Col3(+20)
0 10.3 20.2
1 11.1 21.2
2 12.2 22.4
3 13 23.1
By adding the same value to one column and increasing by an increment of 10 over each column, I am able to see each line without it overlapping in one graph.
I thought of using loops and iterations to automate this value-adding process, but I couldn't find any previous solutions on Stackoverflow that addresses how I could change the increment value (e.g. from adding 0 in Col1 in one loop, then adding 10 to Col2 in the next loop) between different columns, but not within the values in a column. To make things worse, I'm a beginner with no clue about programming or data manipulation.
Since the data is in a CSV format, I first used Pandas to read it and store in a Dataframe, and selected the columns that I wanted to edit:
import pandas as pd
#import CSV file
df = pd.read_csv ('data.csv')
#store csv data into dataframe
df1 = pd.DataFrame (data = df)
# Locate columns that I want to edit with df.loc
columns = df1.loc[:, ' C000':]
here is where I'm stuck:
# use iteration with increments to add numbers
n = 0
for values in columns:
values = n + 0
print (values)
But this for-loop only adds one increment value (in this case 0), and adds it to all columns, not just the first column. Not only that, but I don't know how to add the next increment value for the next column.
Any possible solutions would be greatly appreciated.
IIUC ,just use df.add() over axis=1 with a list made from the length of df.columns:
df1 = df.add(list(range(0,len(df.columns)*10))[::10],axis=1)
Or as #jezrael suggested, better:
df1=df.add(range(0,len(df.columns)*10, 10),axis=1)
print(df1)
Col1 Col2 Col3
0 0 10.3 20.2
1 1 11.1 21.2
2 2 12.2 22.4
3 3 13.0 23.1
Details :
list(range(0,len(df.columns)*10))[::10]
#[0, 10, 20]
I would recommend you to avoid looping over the data frame as it is inefficient but rather think of adding to matrixes.
e.g.
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# Multiply each column with an incremented value * 10
x = x * 10*np.arange(1,df.shape[1]+1)
# Add the matrix to the data
df + x
Edit: In case you do not want to increment with 10, 20 ,30 but 0,10,20 use this instead
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# THIS LINE CHANGED
# Obmit the 1 so there is only an end value -> default start is 0
# Adjust the length of the vector
x = x * 10*np.arange(df.shape[1])
# Add the matrix to the data
df + x

Convert pandas dataframe column containing Excel general numbers into datetime object

I have a dataframe that I constructed from pulling data from SQL using pd.read_sql_query(). I have one column that has dates but in excel general number format. How do convert this column into datetime object.
I can convert one value with the xlrd library but looking for the best way to convert the entire column.
datetime_value = datetime(*xlrd.xldate_as_tuple(42369, 0))
You can use map to apply a lambda function performing that operation to every entry in a column:
import pandas as pd
import xlrd
from datetime import datetime
# Create dummy dataframe
df = pd.DataFrame({
"date": [42369, 42370, 42371, 42372]
})
print df.to_string()
# Convert values into a new column named "converted"
df["converted"] = df["date"].map(lambda x: datetime(*xlrd.xldate_as_tuple(x, 0)))
print df.to_string()
Before conversion:
date
0 42369
1 42370
2 42371
3 42372
After:
date converted
0 42369 2015-12-31
1 42370 2016-01-01
2 42371 2016-01-02
3 42372 2016-01-03
Is this what you are looking for?
Update:
To make this work with string entries, you could either tell Pandas to treat the column as ints or floats:
# int
df["converted"] = df["date"].astype(int).map(lambda x: datetime(*xlrd.xldate_as_tuple(x, 0)))
# float
df["converted"] = df["date"].astype(float).map(lambda x: datetime(*xlrd.xldate_as_tuple(x, 0)))
or just cast x to int or float within the lambda function:
# int
df["converted"] = df["date"].map(lambda x: datetime(*xlrd.xldate_as_tuple(int(x), 0)))
# float
df["converted"] = df["date"].map(lambda x: datetime(*xlrd.xldate_as_tuple(float(x), 0)))

Resources