Extracting Data From Pandas DataFrame - python-3.x

I have two pandas dataframe named df1 and df2. I want to extract same named files from both of the dataframe and put extracted in two columns in a data frame. I want the take, files name from df1 and match with df2 (df2 has more files than df1). There is only one column in both dataframe (df1 and df2). The "BOLD" one started with letter s**** is the common matching alpha-numeric characters. We have to match both dataframe on that.
df1["Text_File_Location"] =
0 /home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt
1 /home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt
df2["Image_File_Location"]=
0 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg'
1 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg

In Python 3.4+, you can use pathlib to handily work with filepaths. You can extract the filename without extension ("stem") from df1 and then you can extract the parent folder name from df2. Then, you can do an inner merge on those names.
import pandas as pd
from pathlib import Path
df1 = pd.DataFrame(
{
"Text_File_Location": [
"/home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt",
"/home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt",
]
}
)
df2 = pd.DataFrame(
{
"Image_File_Location": [
"/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg",
"/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg",
"/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/foo/bar.jpg",
]
}
)
df1["name"] = df1["Text_File_Location"].apply(lambda x: Path(str(x)).stem)
df2["name"] = df2["Image_File_Location"].apply(lambda x: Path(str(x)).parent.name)
df3 = pd.merge(df1, df2, on="name", how="inner")

Related

Iteratively append new data into pandas dataframe column and join with another dataframe

I have been doing data extract from many API. I would like to add a common column among all APIs.
And I have tried below
df = pd.DataFrame()
for i in range(1,200):
url = '{id}/values'.format(id=i)
res = request.get(url,headers=headers)
if res.status_code==200:
data =json.loads(res.content.decode('utf-8'))
if data['success']:
df['id'] = i
test = pd.json_normalize(data[parent][child])
df = df.append(test,index=False)
But data-frame id column I'm getting only the last iterated id only. And in case of APIs has many rows I'm getting invalid data.
From performance reasons it would be better first storing data in a dictionary and then create from this dictionary dataframe:
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
for i in range(1,200):
# simulate dataframe retrieved from pd.json_normalize() call
row = pd.DataFrame({'id': [i], 'field1': [f'f1-{i}'], 'field2': [f'f2-{i}'], 'field3': [f'f3-{i}']})
for k, v in row.to_dict().items():
d[k].append(v[0])
df = pd.DataFrame(d)

Merge based on multiple columns of all excel files from a directory in Python

Say I have a dataframe df, and a directory ./ which has the following excel files inside:
path = './'
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(('.xls', '.xlsx')):
print(os.path.join(root, file))
# dfs.append(read_dfs(os.path.join(root, file)))
# df = reduce(lambda left, right: pd.concat([left, right], axis = 0), dfs)
Out:
df1.xlsx,
df2.xlsx,
df3.xls
...
I want to merge df with all files from path based on common columns date and city. It works with the following code, but it's not concise enough.
So I raise a question for improving the code, thank you.
df = pd.merge(df, df1, on = ['date', 'city'], how='left')
df = pd.merge(df, df2, on = ['date', 'city'], how='left')
df = pd.merge(df, df3, on = ['date', 'city'], how='left')
...
Reference:
pandas three-way joining multiple dataframes on columns
The following code may works:
from functools import reduce
dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left, right: pd.merge(left, right, on=['date', 'city']), dfs)

looping through list of pandas dataframes and make it empty dataframe

I have a multiple pandas dataframe. I want empty each dataframes like below
df1 = pd.DataFrame()
df2 = pd.DataFrame()
Instead of doing it individually, is there any way to do it in one line of code.
If I understood correctly, this will work:
df_list = []
for i in range (0,10):
df = pd.DataFrame()
df_list.append(df)
print(df_list[0].head())

converting the values in a text file and making new text file in python

I have a text file like this example:
example:
"class" "Name" "Access" "CF33456_12.RCC" "CF33457_05.RCC" "CF33458_04.RCC"
"ff" "edi" "ff" "kju" 2444.91910958478 1669.55827263364 699.627215729572
"gg" "edi" "gg" "uhy" 2002.95278984564 369.565070720533 351.056685823175
in this file there are 6 columns (based on the headers) ,so the 1st column is rows name. I would like to change the numbers (the last 3 columns) to log2 value and make a new file with exactly similar structure. here is the expected output:
expected output:
"class" "Name" "Access" "CF33456_12.RCC" "CF33457_05.RCC" "CF33458_04.RCC"
"ff" "edi" "ff" "kju" 11.2555710189065 10.7052507333626 9.45044260143907
"gg" "edi" "gg" "uhy" 10.9679127014901 8.52968459728736 8.45556019395986
I am tryint to do that in python using this code:
df = pd.read_table("myfile.txt", index_col=0)
import numpy as np
df2 = df.iloc[:, [3,4,5]]
df3 = np.array(df2)
df4 = np.log2(df3)
final = pd4.DataFrame(df4)
it convert to log2 value but it does not return a file with the same structure. do you know how to fix it?
In your example the original dataframe (which has the structure of the input table) can be changed using this code:
df = pd.read_table("myfile.txt", index_col=0)
import numpy as np
df2 = df.iloc[:, [3:5]]
df3 = np.array(df2)
df4 = np.log2(df3)
df.iloc[:, [3:5]] = df4
final = df
(it is obvious that df4 has another format - it is a slice of the table and the indexes are removed while converting to numpy array)

PySpark: Search For substrings in text and subset dataframe

I am brand new to pyspark and want to translate my existing pandas / python code to PySpark.
I want to subset my dataframe so that only rows that contain specific key words I'm looking for in 'original_problem' field is returned.
Below is the Python code I tried in PySpark:
def pilot_discrep(input_file):
df = input_file
searchfor = ['cat', 'dog', 'frog', 'fleece']
df = df[df['original_problem'].str.contains('|'.join(searchfor))]
return df
When I try to run the above, I get the following error:
AnalysisException: u"Can't extract value from original_problem#207:
need struct type but got string;"
In pyspark, try this:
df = df[df['original_problem'].rlike('|'.join(searchfor))]
Or equivalently:
import pyspark.sql.functions as F
df.where(F.col('original_problem').rlike('|'.join(searchfor)))
Alternatively, you could go for udf:
import pyspark.sql.functions as F
searchfor = ['cat', 'dog', 'frog', 'fleece']
check_udf = F.udf(lambda x: x if x in searchfor else 'Not_present')
df = df.withColumn('check_presence', check_udf(F.col('original_problem')))
df = df.filter(df.check_presence != 'Not_present').drop('check_presence')
But the DataFrame methods are preferred because they will be faster.

Resources