concatenate column values in a loop - python-3.x

I have a csv file with two columns:
col1 col2
----- | -----
link1 unix=number1
link2 unix=number2
link3 unix=number3
What I need:
I need to concatenate each value in col1 with each value in col2 to have the following result:
ecol1 col2 col3
----- | ----- | ----
link1 unix=number1 link1unix=number1
link2 unix=number2 link1unix=number2
link3 unix=number2 link1unix=number3
link4 NAN link2unix=number1
link5 NAN link2unix=number2
link6 NAN link2unix=number3
etc ..
This is my code and it is not working:
i = 0
while True:
df= pd.read_csv('file.csv', skiprows=lambda x: x in range(0,i))
for i, row in df.iterrows():
row = df['col1'] + df['col2']
i+=1
Please help

Use:
import itertools
df['col3']=[''.join(i) for i in list(itertools.product(df['col1'],df['col2']))]
EDIT:
l= [''.join(i) for i in list(itertools.product(df1.col1,df1.col2))]
df=df.reindex(range(len(l)))
df['col3']=l
print(df)
col1 col2 col3
0 a x ax
1 b y ay
2 c z az
3 NaN NaN bx
4 NaN NaN by
5 NaN NaN bz
6 NaN NaN cx
7 NaN NaN cy
8 NaN NaN cz

Related

Convert one dataframe's format and check if each row exits in another dataframe in Python

Given a small dataset df1 as follow:
city year quarter
0 sh 2019 q4
1 bj 2020 q3
2 bj 2020 q2
3 sh 2020 q4
4 sh 2020 q1
5 bj 2021 q1
I would like to create date range in quarter from 2019-q2 to 2021-q1 as column names, then check if each row in df1's year and quarter for each city exist in df2.
If they exist, then return ys for that cell, otherwise, return NaNs.
The final result will like:
city 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN y y NaN y
1 sh NaN NaN y y NaN NaN y NaN
To create column names for df2:
pd.date_range('2019-04-01', '2021-04-01', freq = 'Q').to_period('Q')
How could I achieve this in Python? Thanks.
We can use crosstab on city and the string concatenation of the year and quarter columns:
new_df = pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
new_df:
col_0 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
city
bj 0 0 1 1 0 1
sh 1 1 0 0 1 0
We can convert to bool, replace False and True to be the correct values, reindex to add missing columns, and cleanup axes and index to get exact output:
col_names = pd.date_range('2019-01-01', '2021-04-01', freq='Q').to_period('Q')
new_df = (
pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
.astype(bool) # Counts to boolean
.replace({False: np.NaN, True: 'y'}) # Fill values
.reindex(columns=col_names.strftime('%Y-q%q')) # Add missing columns
.rename_axis(columns=None) # Cleanup axis name
.reset_index() # reset index
)
new_df:
city 2019-q1 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN NaN y y NaN y
1 sh NaN NaN NaN y y NaN NaN y NaN
DataFrame and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'city': ['sh', 'bj', 'bj', 'sh', 'sh', 'bj'],
'year': [2019, 2020, 2020, 2020, 2020, 2021],
'quarter': ['q4', 'q3', 'q2', 'q4', 'q1', 'q1']
})

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?
Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN
You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

How reindex_like function works with method "ffill" & "bfill"?

I have two dataframe of shape (6,3) & (2,3). Now I want to reindex second dataframe like first dataframe and also fill na values with either ffill method or bfill method. my code is as follows:
df1 = pd.DataFrame(np.random.randn(6,3),columns = ['Col1','Col2','Col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns = ['Col1','Col2','Col3'])
df2 = df2.reindex_like(df1,method='ffill')
But this code is not working well as I am getting following result:
Col1 Col2 Col3
0 0.578282 -0.199872 0.468505
1 1.086811 -0.707933 -0.924984
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
Any suggestion would be great

how to filter out rows in pandas which are just numbers and not fully numeric?

I have a Pandas dataframe column which has data in rows such as below:
col1
abc
ab23
2345
fgh67#
8980
I need to create 2 more columns col 2 and col 3 as such below:
col2 col3
abc 2345
ab23 8980
fgh67#
I have used str.isnumeric(), but thats not helping me in a dataframe column. can someone kindly help?
Use str.isnumeric or to_numeric with check non NaNs for boolean mask and filter by boolean indexing:
m = df['col1'].str.isnumeric()
#alternative
#m = pd.to_numeric(df['col1'], errors='coerce').notnull()
df = pd.concat([df.loc[~m, 'col1'].reset_index(drop=True),
df.loc[m, 'col1'].reset_index(drop=True)], axis=1, keys=('col2','col3'))
print (df)
col2 col3
0 abc 2345
1 ab23 8980
2 fgh67# NaN
If want add new columns to existed DataFrame with align by indices:
df['col2'] = df.loc[~m, 'col1']
df['col3'] = df.loc[m, 'col1']
print (df)
col1 col2 col3
0 abc abc NaN
1 ab23 ab23 NaN
2 2345 NaN 2345
3 fgh67# fgh67# NaN
4 8980 NaN 8980
Or without align:
df['col2'] = df.loc[~m, 'col1'].reset_index(drop=True)
df['col3'] = df.loc[m, 'col1'].reset_index(drop=True)
print (df)
col1 col2 col3
0 abc abc 2345
1 ab23 ab23 8980
2 2345 fgh67# NaN
3 fgh67# NaN NaN
4 8980 NaN NaN

Pandas Use Value if Not Null, Else Use Value From Next Column

Given the following dataframe:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'A','A']})
df
COL1 COL2
0 A NaN
1 NaN A
2 A A
I would like to create a column ('COL3') that uses the value from COL1 per row unless that value is null (or NaN). If the value is null (or NaN), I'd like for it to use the value from COL2.
The desired result is:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
Thanks in advance!
In [8]: df
Out[8]:
COL1 COL2
0 A NaN
1 NaN B
2 A B
In [9]: df["COL3"] = df["COL1"].fillna(df["COL2"])
In [10]: df
Out[10]:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
You can use np.where to conditionally set column values.
df = df.assign(COL3=np.where(df.COL1.isnull(), df.COL2, df.COL1))
>>> df
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
If you don't mind mutating the values in COL2, you can update them directly to get your desired result.
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
>>> df
COL1 COL2
0 A NaN
1 NaN B
2 A B
df.COL2.update(df.COL1)
>>> df
COL1 COL2
0 A A
1 NaN B
2 A A
Using .combine_first, which gives precedence to non-null values in the Series or DataFrame calling it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
df['COL3'] = df.COL1.combine_first(df.COL2)
Output:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
If we mod your df slightly then you will see that this works and in fact will work for any number of columns so long as there is a single valid value:
In [5]:
df = pd.DataFrame({'COL1': ['B', np.nan,'B'],
'COL2' : [np.nan,'A','A']})
df
Out[5]:
COL1 COL2
0 B NaN
1 NaN A
2 B A
In [6]:
df.apply(lambda x: x[x.first_valid_index()], axis=1)
Out[6]:
0 B
1 A
2 B
dtype: object
first_valid_index will return the index value (in this case column) that contains the first non-NaN value:
In [7]:
df.apply(lambda x: x.first_valid_index(), axis=1)
Out[7]:
0 COL1
1 COL2
2 COL1
dtype: object
So we can use this to index into the series
You can also use mask which replaces the values where COL1 is NaN by column COL2:
In [8]: df.assign(COL3=df['COL1'].mask(df['COL1'].isna(), df['COL2']))
Out[8]:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A

Resources