Why does merging unequally matched rows not work on local dataset? - python-3.x
I have a pandas dataframe with questions (type = 1) and answers (type = 2). col section_id and type are integer. all other col are string. I want to merge the "answer rows" with their corresponding "question rows" (equal values in section_id) before appending some of the answer rows' values as extra columns (Ans, ans_t) to their corresponding "question rows".
c = ['pos', 'Ans', 'Q_ID', 'leg', 'que_l', 'ans_l', 'par', 'ans_f', 'que_date', 'ask', 'M_ID', 'part', 'area', 'que_t', 'ans_t', 'ISO', 'con', 'id', 'section_id', 'type', 'dep', 'off']
d = [[None, None, '16-17/1/2017-11-15/1', '16-17', '14.0', None, 'aaa', 'hhh', '2016-11-20', 'Peter Muller', '41749', 'bbb', 'ccc', 'ddd', None, 'eee', 'fff', '111865.q2', 24339851, 1, None, None],
[None, None, '16-17/24/17-11-09/1', '16-17', '28.0', None, 'aaa', 'hhh', '2016-11-20', 'Peter Muller', '41749', 'bbb', 'ccc', 'ppp', None, 'eee', 'fff', '111867.q1', 24339851, 1, None, None],
[None, None, '16-17/73/17-10-09/1', '16-17', '69.0', None, 'aaa', 'hhh', '2016-11-20', 'Peter Muller', '41749', 'bbb', 'ccc', 'lll', None, 'eee', 'fff', '111863.r0', 24339851, 1, None, None],
['erg', 'wer', '16-17/42/16-10-09/1', '16-17', None, 67.0, 'aaa', 'hhh', '2016-11-20', None, '46753', 'bbb', 'ccc', None, 'ttt', 'eee', 'asd', '111863.r0', 24339851, 2, None, None],
[None, None, '16-17/12/16-12-08/1', '16-17', '37.0', None, 'aaa', 'hhh', '2016-10-10', 'Peter Muller', '41749', 'bbb', 'qqq', 'rrr', None, 'eee', 'fff', '108143.r0', 24303320, 1, None, None],
['erg', 'wer', '16-17/12/16-12-07/1', '16-17', None, 64.0, 'aaa', 'hhh', '2016-10-10', None, '46753', 'bbb', 'qqq', None, 'uuu', 'eee', 'asd', '108143.r0', 24303320, 2, None, None],
[None, None, '16-17/77/16-12-04/1', '16-17', '46.0', None, 'aaa', 'hhh', '2016-10-08', 'Markus John', '34567', 'ztr', 'yyy', 'nnn', None, 'eee', 'www', '127193.q0', 10343145, 1, None, None],
['qwe', 'wer', '16-17/37/17-11-07/1', '16-17', None, 60.0, 'aaa', 'hhh', '2016-12-12', None, '19745', 'bbb', 'gtt', None, 'ooo', 'eee', 'asd', '906213.r0', 23222978, 2, None, None]]
data = pd.DataFrame(d,columns=c)
data.loc[data['type'] == 2, 'Ans.1'] = data['Ans']
data.loc[data['type'] == 2, 'ans_t.1'] = data['ans_t']
my_cols = ['que_t','ans_t','Ans','ans_t','Ans.1','ans_t.1']
data[my_cols] = data.sort_values(['section_id','type']).groupby('section_id')[my_cols].transform(lambda x: x.bfill())
data.dropna(subset=['que_t'],inplace=True)
data.reset_index(drop=True,inplace=True)
print(data)
The code works fine on the minimal reproducible example. Unfortunately the dataset is too large to account for every detail, which is why this example may not necessarily be representative.
Problem: When i run the code on the actual dataset, nothing gets merged, even though i manually checked for section_id duplicates
Before executing the code, i remove empty cells from the dataset
data.where(pd.notnull(data), None)
data.replace(r'^\s+$', np.nan, regex=True, inplace=True)
which doesent solve the problem
Question: How do i need to adjust my code in order to account for details (e.g. encoding, formats, ..) in the dataset that could cause it not to merge?
Appendix:
Someone told me to remove data from the dataset gradually, checking each time that the testcase is still reproducible. If some removal results in the testcase not working then reinstate it and remove something else instead. When there's absolutely nothing that can be removed, you have your minimal data set.
Someone else said i should apply a parsing function to parse the data. Didnt help
def parse(x):
try:
return int(x)
except ValueError:
return np.nan
data['que_t'] = data['que_t'].apply(parse)
data['ans_t'] = data['ans_t'].apply(parse)
data.dtypes
Or should I search for non-number string and replace when with NaN ?
replaced_with_nan = data['col_name'].replace(re.compile('\D+'), np.nan)
data['col_name'] = replaced_with_nan.astype(np.float)
Here is another approach which like the answer from Andrej Kesely returns an empty dataframe when used on the actual dataframe
df1 = data.loc[df.type == 1].copy()
df2 = data.loc[df.type == 2].copy()
merged_df = pd.merge(df1, df2, on='section_id', how='outer')
merged_df = merged_df.loc[:,['section_id','que_t_x','ans_t_y','Ans_x','Ans_y']]
merged_df.rename(columns={'que_t_x':'que_t','ans_t_y':'ans_t','Ans_x':'Ans','Ans_y':'Ans.1'}, inplace=True)
If I've understand you correctly, you can filter the dataframe and do .merge:
x = (
df[df["que/ans"] == 1]
.merge(
df[df["que/ans"] == 2],
on="section_id",
how="outer",
suffixes=("_que", "_ans"),
)
.drop(columns=["ans_t_que", "name_que", "ans_len_que", "que_t_ans"])
)
print(x)
Prints:
que/ans_que section_id que_t_que date_que part_que que/ans_ans ans_t_ans name_ans date_ans part_ans ans_len_ans
0 1 444 qtext1 456 bbb 2.0 atext2 Markus 654.0 eee 64.0
1 1 444 qtext3 987 ddd 2.0 atext2 Markus 654.0 eee 64.0
2 1 123 qtext2 789 ccc 2.0 atext1 Alex 123.0 aaa 78.0
3 1 555 qtext4 321 fff NaN NaN NaN NaN NaN NaN
If you are reading your data from csv or excel, I would recommend you to define the dtype during reading. This is to ensure that the keys you use to merge did not have any data loss.
Example:
section_id = 00001234
After reading from csv, it could just be 1234.
df = pd.read_csv(filename, dtype={'section_id' = str})
Hope this will solve your merging issue.
Your last solution following Andrej's answer seems to be working. However, there are some missing context regarding df variable. In addition, all the strings reflecting emptiness is replaced with np.nan. Thus I rewrote it to data as follows:
data.replace(['', 'None', 'nan', None], np.nan, inplace=True)
df1 = data.loc[data.type == 1].copy()
df2 = data.loc[data.type == 2].copy()
merged_df = pd.merge(df1, df2, on='section_id', how='outer')
merged_df = merged_df[['section_id','que_t_x','ans_t_y','Ans_x','Ans_y']]
merged_df.rename(columns={'que_t_x':'que_t','ans_t_y':'ans_t','Ans_x':'Ans','Ans_y':'Ans.1'}, inplace=True)
print(merged_df)
Related
Issue in getting the comments from YAML using ruamel.yaml
Code: import ruamel.yaml yaml_str = """\ # comments start a: 52 # comment for a b: 50 # comment for b c: 50 # comment for c d: # comment for d e: 60 # comment for e f: 70 # comment for f """ yaml = ruamel.yaml.YAML() data = yaml.load(yaml_str) print(data.ca.comment) print(data.ca.items) Output: [None, [CommentToken('# comments start\n', line: 0, col: 0)]] {'a': [None, None, CommentToken('\n# comment for a\n', line: 2, col: 0), None], 'b': [None, None, CommentToken('\n# comment for b\n', line: 4, col: 0), None], 'c': [None, None, CommentToken('\n# comment for c\n', line: 6, col: 0), None], 'd': [None, None, None, [CommentToken('# comment for d\n', line: 8, col: 4)]]} Question: Why isn't it showing comments pertaining to the keys e and f? What is the correct way to retrieve the comments based on the key say for example. How to get the comments for the key e ( # comment for e) ?
In ruamel.yaml most comments are attached to the dict (or list) like structure containing the key (or element) after which the comment occurred. To get to the comments following keys e and f you need to look at the dict that is the value for d: print(data['d'].ca.items) print('comment post commment for "e":', repr(data['d'].ca.get('e', 2).value)) which gives: {'e': [None, None, CommentToken('\n # comment for e\n', line: 10, col: 3), None], 'f': [None, None, CommentToken('\n # comment for f\n', line: 12, col: 3), None]} comment post commment for "e": '\n # comment for e\n' Please note that the comment for e starts with a newline, indicating there is no end-of-line comment
Python replace None with blank with tuple data set by keeping the field name linkage
Currently have below data set which is row Iterator and values are in tuple. I'm trying to formulate code to replace None with blank and access the tuple value based on field name. Any thought how I can do current data set: print(data) Row(('Robert', 'Hoit', None, None, 'TX'), {'fname': 0, 'lname': 1, 'Age': 2, 'Gender': 3, 'State': 4}) Row(('James', 'Burns', 34, 'M', 'CA'), {'fname': 0, 'lname': 1, 'Age': 2, 'Gender': 3, 'State': 4}) Row(('Matt', 'Dan', 45, None, 'NY'), {'fname': 0, 'lname': 1, 'Age': 2, 'Gender': 3, 'State': 4}) Approach I took to replace None was convert tuple to list perform replace and convert back to tuple. but during that case, field name linkage got lost. How to replace without droping field name linkage as I want to utilize row.fname and row.lname call in downstream process for row in data: a = list(row) c = ['' if x is None else x for x in a] d = tuple(c) print(d.fname) print(d.age)
Here I'm assuming that data is a tuple that contains tuples and dictionaries. If so, then: def rem_none(row): if type(row) != dict: row = list(row) for i in range(len(row)): if row[i] is None: row[i] = '' row = tuple(row) return row return row for row in data: print(rem_none(row))
How to iterate through 2d list and alternate items with each inner list?
Given 2 dimensional list twoDList = [[a1,a2,a3,a4,a5,a6,a7],[b1,b2,b3,b4],[c1,c2,c3,c4,c5,c6,c7,c8,c9],[d1,d2,d3,d4,d5,d6],[e1,e2,e3]] How can I iterate through this 2d array as such? answer = alternate(twoDList) print(answer) '[a1,b1,c1,d1,e1,a2,b2,c2,d2,e2,a3,b3,c3,d3,e3,a4,b4,c4,d4,a5,c5,d5,a6,c6,d6,a7,c7,c8,c9]' I tried with this code: def alternateShoes(twodShoes): numOfBrands = len(twodShoes) brandCount = 0 shoecount = 0 masterList = [] if numOfBrands != 0: for shoes in itertools.cycle(twodShoes): if (brandCount == numOfBrands): masterList.append(shoes[shoecount]) brandCount = 0 shoecount = shoecount + 1 else: masterList.append(shoes[shoecount]) brandCount = brandCount + 1 return masterList But am stuck because each inner list can have different lengths. Note, there can be any number of inner lists. (0 or more inner lists)
There's also a useful function in itertools: from itertools import zip_longest zipped = list(zip_longest(*twoDList)) That gives you a layout of: [('a1', 'b1', 'c1', 'd1', 'e1'), ('a2', 'b2', 'c2', 'd2', 'e2'), ('a3', 'b3', 'c3', 'd3', 'e3'), ('a4', 'b4', 'c4', 'd4', None), ('a5', None, 'c5', 'd5', None), ('a6', None, 'c6', 'd6', None), ('a7', None, 'c7', None, None), (None, None, 'c8', None, None), (None, None, 'c9', None, None)] So then just stick them together ignoring the Nones result = [x for y in zipped for x in y if x is not None]
This is how I would do it: def mergeLists(inlst): rslt = [] lstlens = [] for l in inlst: lstlens.append(len(l)) mxlen = max(lstlens) for i in range(mxlen): for k in range(len(inlst)): if i < lstlens[k]: rslt.append(inlst[k][i]) return rslt so given the input twoDList as defined in your question, runnio9ng: print(mergeLists(twoDList)) yields: ['a1', 'b1', 'c1', 'd1', 'e1', 'a2', 'b2', 'c2', 'd2', 'e2', 'a3', 'b3', 'c3', 'd3', 'e3', 'a4', 'b4', 'c4', 'd4', 'a5', 'c5', 'd5', 'a6', 'c6', 'd6', 'a7', 'c7', 'c8', 'c9']
Ffill and interpolate koalas dataframe
Is it possible to interpolate and ffill different columns in a Koalas dataframe something like this? %%spark -s sparkenv2 kdf = ks.DataFrame({ 'id':[1,2,3,4], 'A': [None, 3, None, None], 'B': [2, 4, None, 3], 'C': [99, None, None, 1], 'D': [0, 1, 5, 4] }, columns=['id','A', 'B', 'C', 'D']) kdf['A']=kdf['A'].ffill() kdf['B']=kdf['B'].interpolate()
For ffill, this is taken from John Paton's blog from pyspark.sql import Window from pyspark.sql.functions import last spark_df = kdf.to_spark() # define the window window = Window.orderBy('id').rowsBetween(-sys.maxsize, 0) # define the forward-filled column filled_column = last(spark_df['A'], ignorenulls=True).over(window) # do the fill spark_df_filled = spark_df.withColumn('A_filled', filled_column) I have no answer for interpolate - still trying to find it myself. PS - You can switch to backfill, by changing rowsBetween(0, max.size) and using first() rather than last().
Initializing matrix python3
I don't know whether this is a bug, or I got a wrong semantic meaning of the * token in arrays: >>> arr = [None] * 5 # Initialize array of 5 'None' items >>> arr [None, None, None, None, None] >>> arr[2] = "banana" >>> arr [None, None, 'banana', None, None] >>> # right? ... >>> mx = [ [None] * 3 ] * 2 # initialize a 3x2 matrix with 'None' items >>> mx [[None, None, None], [None, None, None]] >>> # so far, so good, but then: ... >>> mx[0][0] = "banana" >>> mx [['banana', None, None], ['banana', None, None]] >>> # Huh? Is this a bug, or did I got the wrong semantic meaning of the __mult__ token?
You're copying the same reference to the list multiple times. Do it like this: matrix = [[None]*3 for i in range(2)]