dataframe manipulation python based on conditons

dataframe manipulation python based on conditons - python-3.x

input_df1: ID MSG
id-1 'msg1'
id-2 'msg2'
id-3 'msg3'
ref_df2: ID MSG
id-1 'msg1'
id-2 'xyzz'
id-4 'msg4'
I am trying to generate an output dataframe based on the following conditions:
If both 'id' & 'msg' values in input_df match the values in ref_df = matched
If 'id' value in input_df doesn't exists in ref_df = notfound
If only 'id' value in input_df matches with 'id' value in ref_df = not_matched
sample output: ID MSG flag
id-1 'msg1' matched
id-2 'msg2' not_matched
id-3 'msg3' notfound
I can do it using lists but considering the fact that I deal with huge amounts of data, performance is important, hence I am looking for a much faster solution.
Any little help will be highly appreciated
'''

Let's use map to map the ids to the reference messages and use np.select:
ref_msg = df1['ID'].map(df2.set_index('ID')['MSG'])
df1['flag'] = np.select((ref_msg.isna(), ref_msg==df1['MSG']),
('not found', 'matched'), 'not_matched')
Output (df1):
ID MSG flag
0 id-1 'msg1' matched
1 id-2 'msg2' not_matched
2 id-3 'msg3' not found

You can also use indicator=True parameter of df.merge:
In [3867]: x = df1.merge(df2, how='outer', indicator=True).groupby('ID', as_index=False).last()
In [3864]: d = {'both':'matched', 'right_only':'not_matched', 'left_only':'notfound'}
In [3869]: x._merge = x._merge.map(d)
In [3871]: x
Out[3871]:
ID MSG _merge
0 id-1 'msg1' matched
1 id-2 'xyzz' not_matched
2 id-3 'msg3' notfound

The fastest and the most Pythonic way of doing what you want to do is to use dictionaries, as shown below:
list_ID_in = ['id-1', 'id-2', 'id-3']
list_msg_in = ['msg1', 'msg2', 'msg3']
list_ID_ref = ['id-1', 'id-2', 'id-4']
list_msg_ref = ['msg1', 'xyzz', 'msg4']
dict_in = {k:v for (k, v) in zip(list_ID_in, list_msg_in)}
dict_ref = {k:v for (k, v) in zip(list_ID_ref, list_msg_ref)}
list_out = [None] * len(dict_in)
for idx, key in enumerate(dict_in.keys()):
try:
ref_value = dict_ref[key]
if ref_value == dict_in[key]:
list_out[idx] = 'matched'
else:
list_out[idx] = 'not_matched'
except KeyError:
list_out[idx] = 'not_found'

Related

How to split data from list with more than 1 data and append bellow to Data frame - PYTHON - pandas

I did this code to extract data from pdfs and create a list in excel with 'number of PO' / 'Item' / 'Data' / archive name. But when the pdf has more than once the number of PO and item, this data is appended with a list within a list. Its ok but when I put the lists to dataframe pandas, it creates a list with more than one data and I need to split the data and include in a new column below in order.
lista_Pedido = []
lista_Data = []
lista_Item = []
nome_arquivo = []
for f in os.listdir():
col_3 = [f]
nome_arquivo.append(col_3)
reader = PdfReader(f)
page = reader.pages[0]
pdf_atual = page.extract_text(f)
col_1 = re.findall(r'\w+(?<=PO: 45)\d+',pdf_atual)
lista_Pedido.append(col_1)
col_12= re.findall(r'(?<=Item )\d+',pdf_atual)
lista_Item.append(col_12)
col_2 = re.findall(r'[?<=(Date of delivery: )|?<=(Data de fornecimento: )]\s+\d+/+\d+/+\d+',pdf_atual)
lista_Data.append(col_2)
df = pd.DataFrame(data=(), columns=['Pedido','Item','Data'])
df['Item'] = (lista_Item)
df['Data'] = (lista_Data)
df['arquivo'] = (nome_arquivo)
Wrong result = list with more than 1 data, I need to splite e append below following the order of a list
enter image description here

The reason why you are getting a list of lists is because re.findall returns a list. If you would like to add the results as individual results you can do the following.
col_1 = re.findall(r'\w+(?<=PO: 45)\d+',pdf_atual)
lista_Pedido.extend(col_1)
Or:
col_1 = re.findall(r'\w+(?<=PO: 45)\d+',pdf_atual)
for result in col_1:
lista_Pedido.append(result)

Using Pandas to get a contiguous segment of one dataframe and copy it into a new one?

Using Pandas, I'm attempting to 'slice' (Sorry if that's not the correct term) segments of a dataframe out of one DF and into a new one, where every segment is stacked one on top of the other.
Code:
import pandas as pd
df = pd.DataFrame(
{
'TYPE': ['System','VERIFY','CMD','SECTION','SECTION','VERIFY','CMD','CMD','VERIFY','CMD','System'],
'DATE': [100,200,300,400,500,600,700,800,900,1000,1100],
'OTHER': [10,20,30,40,50,60,70,80,90,100,110],
'STEP': ['Power On','Start: 2','Start: 1-1','Start: 10-7','End: 10-7','Start: 3-1','Start: 10-8','End: 1-1','End: 3-1','End: 10-8','Power Off']
})
print(df)
column_headers = df.columns.values.tolist()
col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_step = 'STEP'
segments = []
df_blank = pd.DataFrame({'TYPE': ['BLANK ROW']}, columns = column_headers)
types_to_check = ['CMD', 'VERIFY']
type_df = df[df[col_name_type].isin(types_to_check)]
for row in type_df:
if 'CMD' in row:
if 'START:' in row[col_name_step].value:
idx_start = row.iloc[::-1].str.match('VERIFY').first_valid_index() #go backwards and find first VERIFY
step_match = row[col_name_step].value[6:] #get the unique ID after Start:
idx_end = df[df[col_name_step].str.endswith(step_match, na=False)].last_valid_index() #find last instance of matching unique id
segments.append(df.loc[idx_start:idx_end, :])
segments.append(df_blank)
df_segments = pd.concat(segments)
print(df)
print(df_segments)
Nothing gets populated in my segements array so the concat function fails.
From my research I'm confident that this can be done using either .loc or .iloc, but I can't seem to get a working implementation in.
My DF:
What I am trying to make:
Any help and/or guidance would be welcome.
Edit: To clarify, I'm trying to create a new DF that is comprised of every group of rows, where the start is the "VERIFY" that comes before a "CMD" row that also contains "Start:", and the end is the matching "CMD" row that has end.
EDIT2: I think the following is something close to what I need, but I'm unsure how to get it to reliably work:
segments = []
df_blank = pd.DataFrame({'TYPE': ['BLANK ROW']}, columns = column_headers)
types_to_check = ['CMD', 'VERIFY']
cmd_check = ['CMD']
verify_check = ['VERIFY']
cmd_df = df[(df[col_name_type].isin(cmd_check))]
cmd_start_df = cmd_df[(cmd_df[col_name_step].str.contains('START:'))]
for cmd_idx in cmd_start_df.index:
step_name = df.loc[cmd_idx, col_name_step][6:]
temp_df = df.loc[:cmd_idx,]
idx_start = temp_df[col_name_type].isin(verify_check).last_valid_index()
idx_end = cmd_df[cmd_df[col_name_type].str.endswith(step_name, na=False)].last_valid_index()
segments.append(df.loc[idx_start:idx_end, :])
segments.append(df_blank)
df_segments = pd.concat(segments)

you can use str.contains
segmented_df = df.loc[df['STEP'].str.contains('Start|End')]
print(segmented_df )

I created some code to accomplish the 'slicing' I wanted:
for cmd_idx in cmd_start_df.index:
step_name = df.loc[cmd_idx, col_name_step][6:]
temp_df = df.loc[:cmd_idx,:]
temp_list = temp_df[col_name_type].values.tolist()
if 'VERIFY' in temp_list:
idx_start = temp_df[temp_df[col_name_type].str.match('VERIFY')].last_valid_index()
else:
idx_start = cmd_idx
idx_end = cmd_df[cmd_df[col_name_step].str.endswith(step_name, na=False)].last_valid_index()
slides.append(df.loc[idx_start:idx_end, :])
slides.append(df_blank)
I essentially create a new DF that is a subset of the old DF up to my first START index, then I find the last_valid_index that has VERIFY, then I use that index to create a filtered DF from idx_start to idx_end and then eventually concat all those slices into one DF.
Maybe there's an easier way, but I couldn't find it.

Replace empty string with null values in RDD

Hello i would like to convert empty string to 0 of my RDD.
I have read 20 files and they are in like this formation.
YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
2015,2,6,5,OO,6271,N937SW,FAR,DEN,1712,1701,-11,15,1716,123,117,95,627,1751,7,1815,1758,-17,0,0,,,,,,
2015,1,19,1,AA,1605,N496AA,DFW,ONT,1740,1744,4,15,1759,193,198,175,1188,1854,8,1853,1902,9,0,0,,,,,,
2015,3,8,7,NK,1068,N519NK,LAS,CLE,2220,2210,-10,12,2222,238,229,208,1824,450,9,518,459,-19,0,0,,,,,,
2015,9,21,1,AA,1094,N3EDAA,DFW,BOS,1155,1155,0,12,1207,223,206,190,1562,1617,4,1638,1621,-17,0,0,,,,,,
i would like to fill these empty strings with the number 0 to them
def import_parse_rdd(data):
# create rdd
rdd = sc.textFile(data)
# remove the header
header = rdd.first()
rdd = rdd.filter(lambda row: row != header) #filter out header
# split by comma
split_rdd = rdd.map(lambda line: line.split(','))
row_rdd = split_rdd.map(lambda line: Row(
YEAR = int(line[0]),MONTH = int(line[1]),DAY = int(line[2]),DAY_OF_WEEK = int(line[3])
,AIRLINE = line[4],FLIGHT_NUMBER = int(line[5]),
TAIL_NUMBER = line[6],ORIGIN_AIRPORT = line[7],DESTINATION_AIRPORT = line[8],
SCHEDULED_DEPARTURE = line[9],DEPARTURE_TIME = line[10],DEPARTURE_DELAY = (line[11]),TAXI_OUT = (line[12]),
WHEELS_OFF = line[13],SCHEDULED_TIME = line[14],ELAPSED_TIME = (line[15]),AIR_TIME = (line[16]),DISTANCE = (line[17]),WHEELS_ON = line[18],TAXI_IN = (line[19]),
SCHEDULED_ARRIVAL = line[20],ARRIVAL_TIME = line[21],ARRIVAL_DELAY = line[22],DIVERTED = line[23],CANCELLED = line[24],CANCELLATION_REASON = line[25],AIR_SYSTEM_DELAY = line[26],
SECURITY_DELAY = line[27],AIRLINE_DELAY = line[28],LATE_AIRCRAFT_DELAY = line[29],WEATHER_DELAY = line[30])
)
return row_rdd
the above is the code i am running.
I am working with RDD ROW OBJECTS not a dataframe

You can use na.fill("0") to replace all nulls with "0" strings.
spark.read.csv("path/to/file").na.fill(value="0").show()
In case you need integers, you can change the schema to convert string columns to integers.

You could add this to your dataframe to apply the change to a column named 'col_name'
from pyspark.sql import functions as F
(...)
.withColumn('col_name', F.regexp_replace('col_name', ' ', 0))
You could use this syntax directly in your code

You can add if-else condition while creating Row.
Let's consider WEATHER_DELAY.
row_rdd = split_rdd.map(lambda line: Row(#allothercols,
WEATHER_DELAY = 0 if "".__eq__(line[30]) else line[30])

Please allow me another try for your problem, using foreach() method dedicated to rdd.
def f(x) = x.replace(' ', 0)
(...)
row_rdd = row_rdd.foreach(f) # to be added at the end of your script

Most efficient way to compare two panda data frame and update one dataframe based on condition

I have two dataframe df1 and df2. df2 consist of "tagname" and "value" column. Dictionary "bucket_dict" holds the data from df2.
bucket_dict = dict(zip(df2.tagname,df2.value))
In a df1 there are millions of row.3 columns are there "apptag","comments" and "Type" in df1. I want to match between this two dataframes like, if
"dictionary key" from bucket_dict contains in df1["apptag"] then update the value of df1["comments"] = corresponding dictionary key
and df1["Type"] = corresponding bucket_dict["key name"]
. I used below code:
for each_tag in bucket_dict:
df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "comments"] = each_tag
df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "Type"] = bucket_dict[each_tag]
Is there any efficient way to do this since it's taking longer time.
Bucketing df from which dictionary has been created:
bucketing_df = pd.DataFrame([["pen", "study"], ["pencil", "study"], ["ersr","study"],["rice","grocery"],["wht","grocery"]], columns=['tagname', 'value'])
other dataframe:
output_df = pd.DataFrame([["test123-pen", "pen"," "], ["test234-pencil", "pencil"," "], ["test234-rice","rice", " "], columns=['apptag', 'comments','type'])
Required output:

You can do this by calling an apply on your comments column along with a loc on your bucketing_df in this manner -
def find_type(a):
try:
return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['value'].values[0]
except:
return ""
def find_comments(a):
try:
return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['tagname'].values[0]
except:
return ""
output_df['type'] = output_df['apptag'].apply(lambda a: find_type(a))
output_df['comments'] = output_df['apptag'].apply(lambda a:find_comments(a))
Here I had to make them separate functions so it could handle cases where no tagname existed in apptag
It gives you this as the output_df -
apptag comments type
0 test123-pen pen study
1 test234-pencil pencil study
2 test234-rice rice grocery
All this code uses is the existing bucketing_df and output_df you provided at the end of your question.

Combining several dataframe results with for Loop in Python Pandas

Lets say I have these functions:
def query():
dict = (
{ "NO" : 1, "PART" : "ALPHA" },
{ "NO" : 2, "PART" : "BETA" }
)
finalqueryresult = pandas.DataFrame()
for info in dict: #I use this loop to request query depends on the dict data, in this example twice (2 records from dict)
finalqueryresult.append( sendquery(info["NO"], info["PART"]) )
def sendquery( no, part):
*some code to request query to server and save it under reqresult variable*
*.....*
*.....*
return reqresult
For example above, when sending first query (record with "NO" = 1) it will return: (lets say this is df1)
NAME COUNTRY
1 RYO JPN
2 JON NZ
and the last query (record with "NO" = 2): (lets say this df2)
NAME COUNTRY
1 TING CN
2 ASHYU INA
and what I want is finalqueryresult will be like this: (df1 combined with df2):
NAME COUNTRY
1 RYO JPN
2 JON NZ
3 TING CN
4 ASHYU INA
But I failed, the finalqueryresult is always empty. I suppose something is wrong with this:
for info in dict:
finalqueryresult.append( sendquery(info["NO"], info["PART"]) )

I think you need first append all DataFrames to list dfs and then use concat:
dfs= []
for info in dict:
#sendquery(info["NO"], info["PART"] return DataFrame
dfs.append( sendquery(info["NO"], info["PART"]) )
finalqueryresult = pd.concat(dfs, ignore_index=True)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

dataframe manipulation python based on conditons - python-3.x

Related

How to split data from list with more than 1 data and append bellow to Data frame - PYTHON - pandas

Using Pandas to get a contiguous segment of one dataframe and copy it into a new one?

Replace empty string with null values in RDD

Most efficient way to compare two panda data frame and update one dataframe based on condition

Combining several dataframe results with for Loop in Python Pandas

Categories

Resources