Why does merging unequally matched rows not work on local dataset? - python-3.x

I have a pandas dataframe with questions (type = 1) and answers (type = 2). col section_id and type are integer. all other col are string. I want to merge the "answer rows" with their corresponding "question rows" (equal values in section_id) before appending some of the answer rows' values as extra columns (Ans, ans_t) to their corresponding "question rows".
c = ['pos', 'Ans', 'Q_ID', 'leg', 'que_l', 'ans_l', 'par', 'ans_f', 'que_date', 'ask', 'M_ID', 'part', 'area', 'que_t', 'ans_t', 'ISO', 'con', 'id', 'section_id', 'type', 'dep', 'off']
d = [[None, None, '16-17/1/2017-11-15/1', '16-17', '14.0', None, 'aaa', 'hhh', '2016-11-20', 'Peter Muller', '41749', 'bbb', 'ccc', 'ddd', None, 'eee', 'fff', '111865.q2', 24339851, 1, None, None],
[None, None, '16-17/24/17-11-09/1', '16-17', '28.0', None, 'aaa', 'hhh', '2016-11-20', 'Peter Muller', '41749', 'bbb', 'ccc', 'ppp', None, 'eee', 'fff', '111867.q1', 24339851, 1, None, None],
[None, None, '16-17/73/17-10-09/1', '16-17', '69.0', None, 'aaa', 'hhh', '2016-11-20', 'Peter Muller', '41749', 'bbb', 'ccc', 'lll', None, 'eee', 'fff', '111863.r0', 24339851, 1, None, None],
['erg', 'wer', '16-17/42/16-10-09/1', '16-17', None, 67.0, 'aaa', 'hhh', '2016-11-20', None, '46753', 'bbb', 'ccc', None, 'ttt', 'eee', 'asd', '111863.r0', 24339851, 2, None, None],
[None, None, '16-17/12/16-12-08/1', '16-17', '37.0', None, 'aaa', 'hhh', '2016-10-10', 'Peter Muller', '41749', 'bbb', 'qqq', 'rrr', None, 'eee', 'fff', '108143.r0', 24303320, 1, None, None],
['erg', 'wer', '16-17/12/16-12-07/1', '16-17', None, 64.0, 'aaa', 'hhh', '2016-10-10', None, '46753', 'bbb', 'qqq', None, 'uuu', 'eee', 'asd', '108143.r0', 24303320, 2, None, None],
[None, None, '16-17/77/16-12-04/1', '16-17', '46.0', None, 'aaa', 'hhh', '2016-10-08', 'Markus John', '34567', 'ztr', 'yyy', 'nnn', None, 'eee', 'www', '127193.q0', 10343145, 1, None, None],
['qwe', 'wer', '16-17/37/17-11-07/1', '16-17', None, 60.0, 'aaa', 'hhh', '2016-12-12', None, '19745', 'bbb', 'gtt', None, 'ooo', 'eee', 'asd', '906213.r0', 23222978, 2, None, None]]
data = pd.DataFrame(d,columns=c)
data.loc[data['type'] == 2, 'Ans.1'] = data['Ans']
data.loc[data['type'] == 2, 'ans_t.1'] = data['ans_t']
my_cols = ['que_t','ans_t','Ans','ans_t','Ans.1','ans_t.1']
data[my_cols] = data.sort_values(['section_id','type']).groupby('section_id')[my_cols].transform(lambda x: x.bfill())
data.dropna(subset=['que_t'],inplace=True)
data.reset_index(drop=True,inplace=True)
print(data)
The code works fine on the minimal reproducible example. Unfortunately the dataset is too large to account for every detail, which is why this example may not necessarily be representative.
Problem: When i run the code on the actual dataset, nothing gets merged, even though i manually checked for section_id duplicates
Before executing the code, i remove empty cells from the dataset
data.where(pd.notnull(data), None)
data.replace(r'^\s+$', np.nan, regex=True, inplace=True)
which doesent solve the problem
Question: How do i need to adjust my code in order to account for details (e.g. encoding, formats, ..) in the dataset that could cause it not to merge?
Appendix:
Someone told me to remove data from the dataset gradually, checking each time that the testcase is still reproducible. If some removal results in the testcase not working then reinstate it and remove something else instead. When there's absolutely nothing that can be removed, you have your minimal data set.
Someone else said i should apply a parsing function to parse the data. Didnt help
def parse(x):
try:
return int(x)
except ValueError:
return np.nan
data['que_t'] = data['que_t'].apply(parse)
data['ans_t'] = data['ans_t'].apply(parse)
data.dtypes
Or should I search for non-number string and replace when with NaN ?
replaced_with_nan = data['col_name'].replace(re.compile('\D+'), np.nan)
data['col_name'] = replaced_with_nan.astype(np.float)
Here is another approach which like the answer from Andrej Kesely returns an empty dataframe when used on the actual dataframe
df1 = data.loc[df.type == 1].copy()
df2 = data.loc[df.type == 2].copy()
merged_df = pd.merge(df1, df2, on='section_id', how='outer')
merged_df = merged_df.loc[:,['section_id','que_t_x','ans_t_y','Ans_x','Ans_y']]
merged_df.rename(columns={'que_t_x':'que_t','ans_t_y':'ans_t','Ans_x':'Ans','Ans_y':'Ans.1'}, inplace=True)

If I've understand you correctly, you can filter the dataframe and do .merge:
x = (
df[df["que/ans"] == 1]
.merge(
df[df["que/ans"] == 2],
on="section_id",
how="outer",
suffixes=("_que", "_ans"),
)
.drop(columns=["ans_t_que", "name_que", "ans_len_que", "que_t_ans"])
)
print(x)
Prints:
que/ans_que section_id que_t_que date_que part_que que/ans_ans ans_t_ans name_ans date_ans part_ans ans_len_ans
0 1 444 qtext1 456 bbb 2.0 atext2 Markus 654.0 eee 64.0
1 1 444 qtext3 987 ddd 2.0 atext2 Markus 654.0 eee 64.0
2 1 123 qtext2 789 ccc 2.0 atext1 Alex 123.0 aaa 78.0
3 1 555 qtext4 321 fff NaN NaN NaN NaN NaN NaN

If you are reading your data from csv or excel, I would recommend you to define the dtype during reading. This is to ensure that the keys you use to merge did not have any data loss.
Example:
section_id = 00001234
After reading from csv, it could just be 1234.
df = pd.read_csv(filename, dtype={'section_id' = str})
Hope this will solve your merging issue.

Your last solution following Andrej's answer seems to be working. However, there are some missing context regarding df variable. In addition, all the strings reflecting emptiness is replaced with np.nan. Thus I rewrote it to data as follows:
data.replace(['', 'None', 'nan', None], np.nan, inplace=True)
df1 = data.loc[data.type == 1].copy()
df2 = data.loc[data.type == 2].copy()
merged_df = pd.merge(df1, df2, on='section_id', how='outer')
merged_df = merged_df[['section_id','que_t_x','ans_t_y','Ans_x','Ans_y']]
merged_df.rename(columns={'que_t_x':'que_t','ans_t_y':'ans_t','Ans_x':'Ans','Ans_y':'Ans.1'}, inplace=True)
print(merged_df)

Related

Issue in getting the comments from YAML using ruamel.yaml

Code:
import ruamel.yaml
yaml_str = """\
# comments start
a: 52
# comment for a
b: 50
# comment for b
c: 50
# comment for c
d:
# comment for d
e: 60
# comment for e
f: 70
# comment for f
"""
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_str)
print(data.ca.comment)
print(data.ca.items)
Output:
[None, [CommentToken('# comments start\n', line: 0, col: 0)]]
{'a': [None, None, CommentToken('\n# comment for a\n', line: 2, col: 0), None], 'b': [None, None, CommentToken('\n# comment for b\n', line: 4, col: 0), None], 'c': [None, None, CommentToken('\n# comment for c\n', line: 6, col: 0), None], 'd': [None, None, None, [CommentToken('# comment for d\n', line: 8, col: 4)]]}
Question:
Why isn't it showing comments pertaining to the keys e and f?
What is the correct way to retrieve the comments based on the key say for example. How to get the comments for the key e ( # comment for e) ?
In ruamel.yaml most comments are attached to the dict (or list) like
structure containing the key (or element) after which the comment
occurred.
To get to the comments following keys e and f you need to look at the dict
that is the value for d:
print(data['d'].ca.items)
print('comment post commment for "e":', repr(data['d'].ca.get('e', 2).value))
which gives:
{'e': [None, None, CommentToken('\n # comment for e\n', line: 10, col: 3), None], 'f': [None, None, CommentToken('\n # comment for f\n', line: 12, col: 3), None]}
comment post commment for "e": '\n # comment for e\n'
Please note that the comment for e starts with a newline, indicating there is no end-of-line comment

Python replace None with blank with tuple data set by keeping the field name linkage

Currently have below data set which is row Iterator and values are in tuple. I'm trying to formulate code to replace None with blank and access the tuple value based on field name. Any thought how I can do
current data set:
print(data)
Row(('Robert', 'Hoit', None, None, 'TX'), {'fname': 0, 'lname': 1, 'Age': 2, 'Gender': 3, 'State': 4})
Row(('James', 'Burns', 34, 'M', 'CA'), {'fname': 0, 'lname': 1, 'Age': 2, 'Gender': 3, 'State': 4})
Row(('Matt', 'Dan', 45, None, 'NY'), {'fname': 0, 'lname': 1, 'Age': 2, 'Gender': 3, 'State': 4})
Approach I took to replace None was convert tuple to list perform replace and convert back to tuple. but during that case, field name linkage got lost. How to replace without droping field name linkage as I want to utilize row.fname and row.lname call in downstream process
for row in data:
a = list(row)
c = ['' if x is None else x for x in a]
d = tuple(c)
print(d.fname)
print(d.age)
Here I'm assuming that data is a tuple that contains tuples and dictionaries. If so, then:
def rem_none(row):
if type(row) != dict:
row = list(row)
for i in range(len(row)):
if row[i] is None:
row[i] = ''
row = tuple(row)
return row
return row
for row in data:
print(rem_none(row))

How to iterate through 2d list and alternate items with each inner list?

Given 2 dimensional list
twoDList = [[a1,a2,a3,a4,a5,a6,a7],[b1,b2,b3,b4],[c1,c2,c3,c4,c5,c6,c7,c8,c9],[d1,d2,d3,d4,d5,d6],[e1,e2,e3]]
How can I iterate through this 2d array as such?
answer = alternate(twoDList)
print(answer)
'[a1,b1,c1,d1,e1,a2,b2,c2,d2,e2,a3,b3,c3,d3,e3,a4,b4,c4,d4,a5,c5,d5,a6,c6,d6,a7,c7,c8,c9]'
I tried with this code:
def alternateShoes(twodShoes):
numOfBrands = len(twodShoes)
brandCount = 0
shoecount = 0
masterList = []
if numOfBrands != 0:
for shoes in itertools.cycle(twodShoes):
if (brandCount == numOfBrands):
masterList.append(shoes[shoecount])
brandCount = 0
shoecount = shoecount + 1
else:
masterList.append(shoes[shoecount])
brandCount = brandCount + 1
return masterList
But am stuck because each inner list can have different lengths. Note, there can be any number of inner lists. (0 or more inner lists)
There's also a useful function in itertools:
from itertools import zip_longest
zipped = list(zip_longest(*twoDList))
That gives you a layout of:
[('a1', 'b1', 'c1', 'd1', 'e1'),
('a2', 'b2', 'c2', 'd2', 'e2'),
('a3', 'b3', 'c3', 'd3', 'e3'),
('a4', 'b4', 'c4', 'd4', None),
('a5', None, 'c5', 'd5', None),
('a6', None, 'c6', 'd6', None),
('a7', None, 'c7', None, None),
(None, None, 'c8', None, None),
(None, None, 'c9', None, None)]
So then just stick them together ignoring the Nones
result = [x for y in zipped for x in y if x is not None]
This is how I would do it:
def mergeLists(inlst):
rslt = []
lstlens = []
for l in inlst:
lstlens.append(len(l))
mxlen = max(lstlens)
for i in range(mxlen):
for k in range(len(inlst)):
if i < lstlens[k]:
rslt.append(inlst[k][i])
return rslt
so given the input twoDList as defined in your question, runnio9ng:
print(mergeLists(twoDList))
yields:
['a1', 'b1', 'c1', 'd1', 'e1', 'a2', 'b2', 'c2', 'd2', 'e2', 'a3', 'b3', 'c3', 'd3', 'e3', 'a4', 'b4', 'c4', 'd4', 'a5', 'c5', 'd5', 'a6', 'c6', 'd6', 'a7', 'c7', 'c8', 'c9']

Ffill and interpolate koalas dataframe

Is it possible to interpolate and ffill different columns in a Koalas dataframe something like this?
%%spark -s sparkenv2
kdf = ks.DataFrame({
'id':[1,2,3,4],
'A': [None, 3, None, None],
'B': [2, 4, None, 3],
'C': [99, None, None, 1],
'D': [0, 1, 5, 4]
},
columns=['id','A', 'B', 'C', 'D'])
kdf['A']=kdf['A'].ffill()
kdf['B']=kdf['B'].interpolate()
For ffill, this is taken from John Paton's blog
from pyspark.sql import Window
from pyspark.sql.functions import last
spark_df = kdf.to_spark()
# define the window
window = Window.orderBy('id').rowsBetween(-sys.maxsize, 0)
# define the forward-filled column
filled_column = last(spark_df['A'], ignorenulls=True).over(window)
# do the fill
spark_df_filled = spark_df.withColumn('A_filled', filled_column)
I have no answer for interpolate - still trying to find it myself.
PS - You can switch to backfill, by changing rowsBetween(0, max.size) and using first() rather than last().

Initializing matrix python3

I don't know whether this is a bug, or I got a wrong semantic meaning of the * token in arrays:
>>> arr = [None] * 5 # Initialize array of 5 'None' items
>>> arr
[None, None, None, None, None]
>>> arr[2] = "banana"
>>> arr
[None, None, 'banana', None, None]
>>> # right?
...
>>> mx = [ [None] * 3 ] * 2 # initialize a 3x2 matrix with 'None' items
>>> mx
[[None, None, None], [None, None, None]]
>>> # so far, so good, but then:
...
>>> mx[0][0] = "banana"
>>> mx
[['banana', None, None], ['banana', None, None]]
>>> # Huh?
Is this a bug, or did I got the wrong semantic meaning of the __mult__ token?
You're copying the same reference to the list multiple times. Do it like this:
matrix = [[None]*3 for i in range(2)]

Resources