Pivot issue in databricks - azure

I have dataframe table having values :
id Country Interest
00 Russian Digestion;Destillation
I want to pivot the Interest column and name new column in azure databricks in python like this :
id Country Int Interest
00Q7 Russ Digestion Digestion;Destillation
00Q7 Russ Destillation Digestion;Destillation
Please advise how it can be done
Regards
RK

I have created a sample dataframe similar to yours using the following code:
data = [['00Q7','Russian Federation','Digestion;Destillation'],['00Q6','United States','Oils;Automobiles']]
df = spark.createDataFrame(data=data,schema = ['id','country','interests'])
display(df)
To get the desired output (like yours), first I have split the data in interests column using pyspark.sql.functions.split.
from pyspark.sql.functions import split,col
df1 = df.withColumn("interest", split(col("interests"), ";"))
display(df1)
Now I have exploded the new column interest using pyspark.sql.functions.explode to get the required output.
from pyspark.sql.functions import explode
op = df1.withColumn('interest',explode(col('interest')))
display(op)
UPDATE:
data = [['00Q7','Russian Federation','01_Digestion;02_Destillation']]
df = spark.createDataFrame(data=data,schema = ['id','country','interests'])
#display(df)
from pyspark.sql.functions import split,col
df1 = df.withColumn("interest", split(col("interests"), ";"))
from pyspark.sql.functions import explode
op = df1.withColumn('interest',explode(col('interest')))
#UPDATE
from pyspark.sql.functions import concat,lit
op.withColumn("set",concat(lit('Set'),split(col('interest'),'_').getItem(0))).show(truncate=False)
UPDATE-2:
pdf['set']= pdf['interest'].str.split('_').str[0]
import numpy as np
pdf["set"] = np.where(pdf["set"].astype(int)<10 , 'Set'+pdf['set'].str[1], 'Set'+pdf['set'])

Related

Python function to iterate each unique column and transform using pyspark

I'm building the following global function in Pyspark to go through each column in my CSV that is in different formats and convert them all to one unique format separated by "-."
I am new to the python world, I am getting
TypeError: Column is not iterable
employeesDF =is reading csv file from local sys
I tried the below code:
def colrename(df):
for col in employeesDF.columns:
F.col(col).alias(col.replace('/s,#', '_'))
return employeesDF
ndf = colrename (employeesDF.columns)
Input:
OutPut:
This will work
import re
def colrename(column):
reg = re.sub(r'\s|#', '_',column)
return reg
df2 = df2.toDF(*(colrename(c) for c in df2.columns))
In case any one interested, I used the code below to do it. I hope this information is useful. Thanks
from pyspark.sql import *
import re
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
df=spark.read.format('csv')\
.option('header',True)\
.option('inferschema',True)\
.load('C:\\bigdata\\datasets\\employee10000_records.csv')
def colrename(df):
for names in df.schema.names:
df = df.withColumnRenamed(names, re.sub(r'([^A-Za-z0-9])','_',names))
return df
colrename (df).show()

finding latest trip information from a large data frame

I have one requirement:
I have a dataframe "df_input" having 20M rows which includes trip details. columns are "vehicle-no", "geolocation","start","end".
For each of the vehicle number there are multiple rows each having different geolocation for different trips.
Now I want to create a new dataframe df_final which will have only the first record for all of the vehicle-no. How can do that in efficient way?
I used something like below which is taking more than 5 hours to complete:
import dfply as dp
from dfply import X
output_df_columns = ["vehicle-no","start", "end", "geolocations"]
df_final = pd.DataFrame(columns = output_df_columns) #create empty dataframe
unique_vehicle_no = list(df_input["vehicle-no"].unique())
df_input.sort_values(["start"],inplace=True)
for each_vehicle in unique_vehicle_no:
df_temp = (df_input >> dp.mask(X.vehicle-no == each_vehicle))
df_final = df_final.append(df_temp.head(1),ignore_index=True, sort=False)
I think this will work out
import pandas as pd
import numpy as np
df_input=pd.DataFrame(np.random.randint(10,size=(1000,3)),columns=['Geolocation','start','end'])
df_input['vehicle_number']=np.random.randint(100,size=(1000))
print(df_input.shape)
print(df_input['vehicle_number'].nunique())
df_final=df_input.groupby('vehicle_number').apply(lambda x : x.head(1)).reset_index(drop=True)
print(df_final['vehicle_number'].nunique())
print(df_final.shape)

How can I find out which operands are supported before looping through pandas dataframe?

I am attempting to iterate over rows in a Series within a Pandas DataFrame. I would like to take the value in each row of the column csv_df['Strike'] and plug it into variable K, which gets called in function a.
Then, I want the output a1 and a2 to be put into their own columns within the DataFrame.
I am receiving the error: TypeError: unsupported operand type(s) for *: 'int' and 'zip', and I figure that if I can find out which operands are supported, I could convert a1 and a2to that.
Am I thinking about this correctly?
Note: S is just a static number as the df is just one row, while K has many rows.
Code is below:
from scipy.stats import norm
from math import sqrt, exp, log, pi
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
import fix_yahoo_finance as yf
yf.pdr_override()
import numpy as np
import datetime
from pandas_datareader import data, wb
import matplotlib.pyplot as plt
#To get data:
start = datetime.datetime.today()
end = datetime.datetime.today()
df = data.get_data_yahoo('AAPL', start, end) #puts data into a pandas dataframe
csv_df = pd.read_csv('./AAPL_TEST.csv')
for row in csv_df.itertuples():
def a(S, K):
a1 = 100 * K
a2 = S
return a1
S = df['Adj Close'].items()
K = csv_df['strike'].items()
a1, a2 = a(S, K)
df['new'] = a1
df['new2'] = a2
It seems an alternate way of doing what you want would be to apply your method to each data frame separately, as in:
df = data.get_data_yahoo('AAPL', start, end)
csv_df = pd.read_csv('./AAPL_TEST.csv')
df['new'] = csv_df['strike'].apply(lambda x: 100 * x)
df['new2'] = df['Adj Close']
Perhaps, applying the calculation directly to the Pandas Series (a column of your data frame) is a way to avoid defining a method that is used only once.
Plus, I wouldn't define a method within a loop as you have.
Cheers
ps. I believe you have forgotten to return both values in your method.
def a(S, K):
a1 = 100 * K
a2 = S
return (a1, a2)

Is there a loop function for automating data pull request from Google Trends?

I want to create a loop that helps me to pull data from Google Trends via PyTrends. I need to iterate through a lot of keywords but Google Trends allows only to compare five keywords at the time, hence I need to iterate through the keywords manually and create a dataframe in pandas. However, it seems something is off.
I get data but my dataframe with pandas creates the dataframe with values that are shifted in different rows and with duplicate "NaN" values.
instead of 62 rows I get 372 rows(with duplicate "NaN").
from pytrends.request import TrendReq
import pandas as pd
pytrend = TrendReq()
kw_list = ['cool', 'fun', 'big','house', 'phone', 'garden']
df1 = pd.DataFrame()
for i in kw_list:
kw_list = i
pytrend.build_payload([kw_list], timeframe='2015-10-14 2015-12-14', geo='FR')
df1 = df1.append(pytrend.interest_over_time())
print(df1.head)
I want to have one coherent dataframe, with the columns 'cool', 'fun', 'big','house', 'phone', 'garden' and their respective values in each column on the same row. Like e.g. a dataframe with 62 rows and 6 columns.
I'm probably going to lose some rep points because this is an old question (oldish, anyway), but I was struggling with the same problem and I solved it like this:
import pytrends
import pandas as pd
from pytrends.request import TrendReq
pytrend = TrendReq()
kw_list = ['cool', 'fun', 'big','house', 'phone', 'garden']
df_gtrends_kw = {}
df_gtrends = pd.DataFrame()
for kw in kw_list:
pytrend.build_payload(kw_list = [kw], timeframe='today 12-m')
df_gtrends_kw[kw] = pytrend.interest_by_region(resolution='COUNTRY')
df_gtrends = pd.concat([df_gtrends_kw[key] for key in kw_list], join = 'inner', axis = 1)
According to the official doc, one has to specify the axes along which one is to glue the dataframes; in this case, the Column names, since the index name is the same for each dataframe.

How to get SalesForce data to Python Panda dataframes

Currently we are taking SalesForce data in to CSV file and reading this CSV file in Pandas using read_csv, to_csv methods. Do we have any other way to get data from SalesForce to pandas dataframe.
With Python - you can download a package called Simple Salesforce and write SOQL queries to return data
https://github.com/simple-salesforce/simple-salesforce
Here's an example of how to do this:
from simple_salesforce import Salesforce
sf = Salesforce(username='<enter username>', password='<enter password>',
security_token = '<enter your access token from your profile>')
a_query= pd.DataFrame(sf.query(
"SELECT Name, CreatedDate FROM User")['records'])
In my case, to display the information as a dataframe I had to use the following code:
# Import libraries
import simple_salesforce as ssf, pandas
# Create the connection
session_id, instance = ssf.SalesforceLogin(username='<username>', password='<password>', security_token='<token>', sandbox=False)
sf_ = ssf.Salesforce(instance=instance, session_id=session_id)
# Query to execute
sql_code = "SELECT id, name FROM main_table"
# Store query result as dataframe
information = sf_.query(query= sql_code)
table = pandas.DataFrame(information['records']).drop(columns='attributes')
Adding up to the original answer,
the function below is also suitable for simple joins.
def sf_results_to_dataframe(results, drop_index=True) -> pd.DataFrame:
df = pd.DataFrame(results['records'])
df.drop('attributes', axis=1, inplace=True) # clean up from technical info
df.set_index('Id', drop=drop_index, inplace=True)
for table in ['Account', 'Contact', 'Lead', 'Opportunity']:
if table in results['records'][0].keys(): # detect JOIN
local_keys = list(results['records'][0][table].keys()) # keys from the joined table
if 'attributes' in local_keys:
local_keys.remove('attributes')
global_keys = [table + key for key in local_keys] # name for the fields in the output table
# fields of the joined table and the record index
table_records = [{'Id': record['Id'],
**{global_key:record[table][local_key] for global_key, local_key in zip(global_keys, local_keys)}}
for record in results['records']]
df_extra = pd.DataFrame(table_records)
df_extra.set_index('Id', drop=True, inplace=True) # match index
df.drop(table, axis=1, inplace=True) # drop duplicated info
df = df.merge(df_extra, left_index=True, right_index=True) # merge on index
return df
Example:
import pandas as pd
from simple_salesforce import Salesforce
SALESFORCE_EMAIL = '...'
SALESFORCE_TOKEN = '...'
SALESFORCE_PASSWORD = '...'
sf = Salesforce(username=SALESFORCE_EMAIL, password=SALESFORCE_PASSWORD, security_token=SALESFORCE_TOKEN)
query = """SELECT Id, Name, Account.Name
FROM Contact
LIMIT 1
"""
results = sf.query(query)
df = sf_results_to_dataframe(results)

Resources