Pandas: Problem in Concatenation of three Data Frames - python-3.x

I am loading three.csv files into three different Pandas data frames. The number of rows in each file should be the same, but it is not. Sometimes one file contains four to five additional rows compared to another. Sometimes two files include two to three extra rows compared to one file. I'm concatenating all files row by row, so I'm removing these additional rows to make them the same length. I wrote the following code to accomplish this.
df_ch = pd.read_csv("./file1.csv")
df_wr = pd.read_csv("./file2.csv")
df_an = pd.read_csv("./file3.csv")
# here df_wr have less number of rows than df_ch and df_an, so dropping rows from other two frames (df_ch and df_an)
df_ch.drop(df_ch.index[274299:274301], inplace=True)
df_an.drop(df_an.index[274299], inplace=True)
I did it manually in the above code, and now I want to do it automatically. One method is to use if-else to check the length of all frames and make it equal to the shortest length frame. But I'm curious whether Pandas has a faster technique to compare these frames and delete the additional rows.

Couple of way you can do:
In general,
#take the intersection of all the index:
common = df1.index.intersection(df2.index).intersection(df3.index)
pd.concat([df1, df2, df3], axis=1).reindex(common)
Or in your case, maybe just
max_rows = min(map(len, [df1,df2,df3]))
pd.concat([df1, df2, df3], axis=1).iloc[:max_rows]
Update: best option should be join:
df1.join(df2, how='inner').join(df3, how='inner')

Related

Pandas Dataframe Integer Column Data Cleanup

I have a CSV file, which I read through pandas read_csv module.
There is one column, which is supposed to have numbers only, but the data has some bad values.
Some rows (very few) have "alphanumeric" strings, few rows are empty while a few others have floating point numbers. Also, for some reason, some numbers are also being read as strings.
I want to convert it in the following way:
Alphanumeric, None, empty (numpy.nan) should be converted to 0
Floating point should be typecasted to int
Integers should remain as they are
And obvs, numbers should be read as numbers only.
How should I proceed, as I have no other idea than to read each row one by one and typecast into int, in a try-except block, while assigning 0 if exception is raised.
like:
def typecast_int(n):
try:
return int(n)
except:
return 0
for idx, row in df.iterrows:
row["number_column"] = typecast_int(row["number_column"])
But there are some issues with this approach. Firstly, iterrows is bad performance wise. And my dataframe may have upto 700k to 1M records and I have to process ~500 such CSV files. And secondly, it just doesn't feel right to do it this way.
I could do a tad better by using df.apply instead of iterrows but that is also not too different.
From your 4 conditions, there's
df.number_column = (pd.to_numeric(df.number_column, errors="coerce")
.fillna(0)
.astype(int))
This first converts the column to be numeric values only. If errors arise (e.g., due to alphanumerics) they got "coerce"d to NaN. Then we fill those NaN's with 0 and lastly cast everything to integers.

How to create a list and filter out row from another dataframe?

I know this question has been asked before, but every solution doesn't appear to work and gives me the same result. I am looking for insight into what I am doing wrong.
T_18_x2 and Tryp18_50 are large dataframes with different data (except for 2 columns). Specifically, each dataframe contains a column named 'Gene' that posses the same style sting (i.e. HSP90A_HUMAN). I would like to make a list from the Gene column in T_18_x2 to filter rows in Tryp18_50 with the same string in the "Gene" column.
My issue is that the output is simply an empty dataframe. I think it is the string (y2) because the output of this list is duplicates of the strings in the column. I am not sure why this is happening either.
List
Any help would be greatly appreciated.
input:
y2 =T18_x2['Gene'].astype(str).values.tolist()
T18 = Tryp18_50[Tryp18_50['Gene'].isin(y2)]
T18
output:
Output
** I have also tried:
T18=Tryp18_50[pd.notna(Tryp18_50['Gene']) & Tryp18_50['Gene'].astype(str).str.contains('|'.join(y2))]
with the output:
2nd Output
My mistake, I had two "Gene" columns in the first dataframe.

Pandas: get first datetime-in and last datetime-out in one row

First of all thanks in advance, there are always answers here so we learn a lot from the experts. I'm a noob using "pandas" (it's super handie for what i tried and achieved so far).
I have these data, handed to me like this (don't have access to the origin), 20k rows or more sometimes. The 'in' and 'out' columns may have one or more data per date, so when i get a 'in' the next data could be a 'out' or a 'in', depending, leaving me a blank cell, that's the problem (see first image).
I want to filter the first datetime-in, to left it in one column and the last datetime-out in another but the two in one row (see second image); the data comes in a csv file. I am doing this particular work manually with LibreOffice Calc (yeap).
So far, I have tried locating and relocating, tried merging, grouping... nothing works for me so i feel frustrated, ¿would you please lend me a hand? here is a minimal sample of the file
By the way english is not my language. ¡Thanks so much!
First:
out_column = df["out"].tolist()
This gives you all the out dates as a list, we will need that later.
in_column = df["in"].tolist() # in is used by python so I suggest renaming that row
I treat NaT as NaN (Null) in this Case.
Now we have to find what rows to keep, which we do by going through the in column and only keeping the rows after a NaN (and the first one):
filtered_df = []
tracker = False
for index, element in enumerate(in):
if index == 0 or tracker is True:
filtered_df.append(True)
tracker = False
continue
if element is None:
tracker = True
filtered_df.append(False)
Then you filter your df by this Boolean List:
df = df[filtered_df]
Now you fix up your out column by removing the null values:
while null in out_column:
out_column.remove(null)
Last but not least you overwrite your old out column with the new one:
df["out"] = out_column

Transforming multiple data frame columns into one series

I have a dataset df(250,3) 250 raws and three columns. I want to write a loop that merges the content of each column in my dataframe to have one single series(250,1) of 250 raws and 1 columns 'df_single'. The manual operation is the following:
df_single = df['colour']+" "+df['model']+" "+df['size']
How can I create df_single with a for loop, or non-manually?
I tried to write this code with TypeError
df_conc=[]
for var in cols:
cat_list=df_code_part[var]
df_conc = df_conc+" "+cat_list
TypeError: can only concatenate list (not "str") to list
I think if need join 3 columns then your solution is really good:
df_single = df['colour']+" "+df['model']+" "+df['size']
If need general solution for many columns use DataFrame.astype for convert to strings if necessary with DataFrame.add for add whitespace, sum for concatenate and last remove tralining whitespeces by Series.str.rstrip for remove traling whitespace:
cols = ['color','model','size']
df_single = df[cols].astype(str).add(' ').sum(axis=1).str.rstrip()
Or:
df_single = df[cols].astype(str).apply(' '.join, axis=1)
If you want to have spaces between columns, run:
df.apply(' '.join, axis=1)
"Ordinary" df.sum(axis=1) concatenates all columns, but without
spaces between them.
if you want the sum You need use:
df_single=df.astype(str).add(' ').sum(axis=1).str.rstrip()
if you don't want to add all the columns then you need to select them previously:
columns=['colour','model','size']
df_single=df[columns].astype(str).add(' ').sum(axis=1).str.rstrip()

Adding a zero columns to a dataframe

I have a strange problem which i am not able to figure out. I have a dataframe subset that looks like this
in the dataframe, I add "zero" columns using the following code:
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
and i get a result similar to this
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
I dont understand why sometimes i get zeros the other i get either NaNs or a mix of NaNs and zeros. Please help if you can
Thanks
I believe you need assign with dictionary for set new columns names:
subset = subset.assign(**dict.fromkeys(['IRNotional','IPNotional'], 0))
#you can define each column separately
#subset = subset.assign(**{'IRNotional': 0, 'IPNotional': 1})
Or simplier:
subset['IRNotional'] = 0
subset['IPNotional'] = 0
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
I think problem is different index values, so is necessary create same indices, else for not matched indices get NaNs:
subset['IPNotional']=pd.DataFrame(numpy.zeros(shape=(len(subset),1)), index=subset.index)

Resources