How to extract max length row with pandas - python-3.x

I would like to extract row which is max in Dataframe.
In following case, I would like to get id 2 row, because it includes max length 6 in B column bbbbbb.
|id|A |B |
|1 |abc |aaa |
|2 |abb |bbbbbb|
|3 |aadd|cccc |
|4 |aadc|ddddd |
|id|A |B |
|2 |abb |bbbbbb|
Please give me some advice. Thanks.

Let's first create the DataFrame with you example:
import pandas as pd
data = {
"id": {0: 1, 1: 2, 2: 3, 3: 4},
"A ": {0: "abc", 1: "abb", 2: "aadd", 3: "aadc"},
"B": {0: "aaa", 1: "bbbbbb", 2: "cccc", 3: "ddddd"}
}
df = pd.DataFrame(data)
Then you can get the row where B is longer and then retrive that row with:
# Index where B is longest
idx = df["B"].apply(len).idxmax()
# Get that row
df.iloc[idx, :]

Get all columns filled by object (obviously strings) by DataFrame.select_dtypes, get length with max per rows and last filter maximal by boolean indexing for match all rows with maximal lengths:
s = df.select_dtypes(object).apply(lambda x: x.str.len()).max(axis=1)
#if no missing values
#s = df.select_dtypes(object).applymap(len).max(axis=1)
df1 = df[s.eq(s.max())]
print (df1)
id A B
1 2 abb bbbbbb
Another idea for only first match by Series.idxmax and DataFrame.loc, added [] for one row DataFrame:
df1 = df.loc[[df.select_dtypes(object).apply(lambda x: x.str.len()).max(axis=1).idxmax()]]
#if no missing values
#df1 = df.loc[[df.select_dtypes(object).applymap(len).max(axis=1).idxmax()]]
print (df1)
id A B
1 2 abb bbbbbb

First, you can find the maximal length per each row and then the row index with a maximal value:
df.loc[df[['A', 'B']].apply(lambda x: x.str.len().max(), axis=1).idxmax()]

Related

how to change values in a df specifying by index contain in multiple lists, and each list for one column

I have a list where I have all the index of values to be replaced. I have to change them in 8 diferent columns with 8 diferent lists. The replacement could be a simple string.
How can I do it?
I have more than 20 diferent columns in this df
Eg:
list1 = [0,1,2]
list2 =[2,4]
list8 = ...
sustitution = 'no data'
Column A
Column B
marcos
peter
Julila
mike
Fran
Ramon
Pedri
Gavi
Olmo
Torres
OUTPUT:
| Column A | Column B |
| -------- | -------- |
| no data | peter |
| no data | mike |
| no data | no data |
| Pedri | Gavi |
| Olmo | no data |`
Use DataFrame.loc with zipped lists and columns names:
list1 = [0,1,2]
list2 =[2,4]
L = [list1,list2]
cols = ['Column A','Column B']
sustitution = 'no data'
for c, i in zip(cols, L):
df.loc[i, c] = sustitution
print (df)
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data
You can use the underlying numpy array:
list1 = [0,1,2]
list2 = [2,4]
lists = [list1, list2]
col = np.repeat(np.arange(len(lists)), list(map(len, lists)))
# array([0, 0, 0, 1, 1])
row = np.concatenate(lists)
# array([0, 1, 2, 2, 4])
df.values[row, col] = 'no data'
Output:
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data

Extract specific string from a column and place them in a sequence

I have a dataframe like this:
df = [{'id': 1, 'id1': '859A;'},
{'id': 2, 'id1': '209A/229A/509A;'},
{'id': 3, 'id1': '(105A/111A/121A/131A/201A/205A/211A/221A/231A/509A/801A/805A/811A/821A)+TZ+-494;'},
{'id': 4, 'id1': '111A/114A/121A/131A/201A/211A/221A/231A/651A+-(Y05/U17)/801A/804A/821A;'},
{'id': 5, 'id1': '(651A/851A)+U17/861A;'},
]
df = spark.createDataFrame(df)
I want to split the "id1" column into two columns.
One column needs to only extract strings which end with "A" and put them in a sequence with "/" between strings.
The other column needs to extract the remaining strings and place them in a separate column as shown below.
Taking "id3", "id5" and "id2" as example, the desired output should be:
newcolumn1
(105A1,11A,121A,131A/201A,205A,211A,221A,231A/509A/801A,805A,811A,821A)
(651A/851A,861A)
(209A,229A/509A)
newcolumn2
+TZ+-494;
+U17;
blank
All series starting with "1" and ending with "A" should be in one group, separated with comma. Every such series should be separated with "/".
Your best bet is to use regex. regexp_extract_all is not yet directly available in Python API, but you can use expr to reach it. You will also need a couple of consecutive aggregations.
from pyspark.sql import functions as F
cols = df.columns
df = df.withColumn('_vals', F.explode(F.expr(r"regexp_extract_all(id1, '\\d+A', 0)")))
df = (df
.groupBy(*cols, F.substring('_vals', 1, 1)).agg(
F.array_join(F.array_sort(F.collect_list('_vals')), ',').alias('_vals')
).groupBy(cols).agg(
F.array_join(F.array_sort(F.collect_list('_vals')), '/').alias('newcolumn1')
).withColumn('newcolumn1', F.format_string('(%s)', 'newcolumn1')
).withColumn('newcolumn2', F.regexp_replace('id1', r'\d+A|/|\(|\)', ''))
)
df.show(truncate=0)
# +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+
# |id |id1 |newcolumn1 |newcolumn2|
# +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+
# |3 |(105A/111A/121A/131A/201A/205A/211A/221A/231A/509A/801A/805A/811A/821A)+TZ+-494;|(105A,111A,121A,131A/201A,205A,211A,221A,231A/509A/801A,805A,811A,821A)|+TZ+-494; |
# |5 |(651A/851A)+U17/861A; |(651A/851A,861A) |+U17; |
# |2 |209A/229A/509A; |(209A,229A/509A) |; |
# |4 |111A/114A/121A/131A/201A/211A/221A/231A/651A+-(Y05/U17)/801A/804A/821A; |(111A,114A,121A,131A/201A,211A,221A,231A/651A/801A,804A,821A) |+-Y05U17; |
# |1 |859A; |(859A) |; |
# +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+

Pandas: With array of col names in a desired column order, select those that exist, NULL those that don't

I have an array of column names I want as my output table in that order e.g. ["A", "B", "C"]
I have an input table that USUALLY contains all of the values in the array but NOT ALWAYS (the raw data is a JSON API response).
I want to select all available columns from the input table, and if a column does not exist, I want it filled with NULLs or NA or whatever, it doesn't really matter.
Let's say my input DataFrame (call it input_table) looks like this:
+-----+--------------+
| A | C |
+-----+--------------+
| 123 | test |
| 456 | another_test |
+-----+--------------+
I want an output dataframe that has columns A, B, C in that order to produce
+-----+------+--------------+
| A | B | C |
+-----+------+--------------+
| 123 | NULL | test |
| 456 | NULL | another_test |
+-----+------+--------------+
I get a keyerror when I do input_table[["A","B","C"]]
I get a NoneType returned when I do input_table.get(["A","B","C"])
I was able to achieve what I want via:
for i in desired_columns_array:
if i not in input_dataframe:
ouput_dataframe[i] = ""
else:
output_dataframe[i] = input_dataframe[i]
But I'm wondering if there's something less verbose?
How do I get a desired output schema to match an input array when one or more columns in the input dataframe may not be present?
Transpose and reindex
df = pd.DataFrame([[123,'test'], [456, 'another test']], columns=list('AC'))
l = list('ACB')
df1 = df.T.reindex(l).T[sorted(l)]
A B C
0 123 NaN test
1 456 NaN another test
DataFrame.reindex over the column axis:
cols = ['A', 'B', 'C']
df.reindex(cols, axis='columns')
A B C
0 123 NaN test
1 456 NaN another_test

How to label encode a DataFrame column which contains both numbers and strings?

I have this DataFrame column
+-------------------------------------+--+
| df: | |
+-------------------------------------+--+
| Index Ticket* | |
| 0 254326 | |
| 1 CA345 | |
| 3 SA12 | |
| 4 267891 | |
| ' ' | |
| ' ' | |
| ' ' | |
| 700 CA356 | |
+-------------------------------------+--+
It contains two kinds of values. Some are pure numbers and others are strings having letters and numbers.
Many rows have the same letters (CA345, CA675 etc). I would like to group and label the rows with same letters with the same numbers.
Eg. All rows having "CA" labelled as 0, all rows having "SA" labelled as 1.
Remaining rows all have six digit numbers (no letters in them). I would like to label all such rows with the same number (say 2 for example)
1st Approach
Define a custom function, check if the row isinstance(val, str) and contains "SA" or "CA"
def label_ticket(row):
if isinstance(row['Ticket'], str) and 'CA' in row['Ticket']:
return 0
if isinstance(row['Ticket'], str) and 'SA' in row['Ticket']:
return 1
return 2
Apply the custom function to new column df('Label').
df['Label'] = df.apply(label_ticket, axis=1)
print(df)
Ticket Label
0 254326 2
1 CA345 0
2 SA12 1
3 267891 2
700 CA356 0
2nd Approach
Further understanding the situation, it seems you have no idea what instances will come up in df['Ticket']. In this case you can use re.split() to search all string pattern and classify them into category accordingly.
import pandas as pd
import re
df = pd.DataFrame(columns=['Ticket'],
data=[[254326],
['CA345'],
['SA12'],
[267891],
['CA356']])
df['Pattern'] = df['Ticket'].apply(lambda x: ''.join(re.split("[^a-zA-Z]*", str(x))))
df_label = pd.DataFrame(df['Pattern'].unique(), columns=['Pattern']).reset_index(level=0).rename(columns={'index': 'Label'})
df = df.merge(df_label, how='left')
print(df)
Ticket Pattern Label
0 254326 0
1 CA345 CA 1
2 SA12 SA 2
3 267891 0
4 CA356 CA 1
I have not enough knowledge of python but
you may have try pandas.Series.str.extract
and
regular expression
Like:
ptrn=r'(?P<CA>(CA[\d]+))|(?P<SA>(SA[\d]+))|(?P<DIGIT>[\d]{6})'
import pandas as pd
import numpy as np
ls={'tk':[ '254326' , 'CA345', 'SA12' , '267891' , 'CA356' ]}
df = pd.DataFrame(ls)
s=df['tk'].str.extract(ptrn,expand=False)
newDf={0:[x for x in s['CA'] if pd.isnull(x)==False],1:[x for x in s['SA'] if pd.isnull(x)==False],2:[x for x in s['DIGIT'] if pd.isnull(x)==False]}
print(newDf)
out put:
{0: ['CA345', 'CA356'], 1: ['SA12'], 2: ['254326', '267891']}
demo

Python: How to replace "datetime" in a Dataframe with only the Day represented as integer value?

I am dealing with a dataframe called "Data" containing two columns, one is the userid, the other one a datetime object:
userid | eventTime
1 | 2018-11-01 07:36:58
2 | 2018-11-07 08:04:12
.. | ....
My goal is now to replace the entry in this dataframe column eventTime with just the number of the day in the specific month as integer value, e.g. the first day (=1) in November and the seventh day (=7) in November.
So the Result shall be
userid | eventTime
1 | 1
2 | 7
.. | ....
How can I get this done?
I already extracted indices from the data frame and tried to modify it in a loop but I don't know how it shall work:
temp = Data.index.get_values() #get indices from data frame
for temp, row in Data.iterrows():
print(row['eventTime'])
df['eventTime'] = df['eventTime'].dt.day
Should work - assuming df is your dataframe
Performing a vector operation like this is the most efficient way to work on a dataframe
As an example:
df = pd.DataFrame(data={'user': ['a', 'b', 'a', 'b'],
'eventTime': [datetime(2000, 1, 1),
datetime(2000, 2, 2),
datetime(2000, 3, 3),
datetime(2000, 4, 4)]})
print(df)
# eventTime user
# 0 2000-01-01 a
# 1 2000-02-02 b
# 2 2000-03-03 a
# 3 2000-04-04 b
You can operate on a particular column with the apply() method, and datetime objects provide a datetime.day property that gives you the day of the month as an integer:
df['day'] = df.eventTime.apply(lambda x: x.day)
print(df)
# eventTime user day
# 0 2000-01-01 a 1
# 1 2000-02-02 b 2
# 2 2000-03-03 a 3
# 3 2000-04-04 b 4
If you want to replace the eventTime column instead of create a new column, just use:
df['eventTime'] = df.eventTime.apply(lambda x: x.day)

Resources