Using langdetect output to be imported into a new column in my dataframe - python-3.x

Being rather new to programming with python I tried to language detect segments of text in pandas data frame.
So first I made a function for the 'langdetect' package
import pandas as pd
from langdetect import detect
def language_detect(x):
lang = detect(x)
print(lang)
My second step would be to feed in the data frame for processing. All the segments that need detecting are in separate rows in the dataframe under the same column header.
result = [language_detect(x) for x in df['column_name']]
df['l_detect'] = pd.append(result)
In the output I see the texts being recognized properly.
But when I try to print result.
it returns me with only the value for every entry 'none'
So my questions are:
why do I get 'none' when the the print output from the function has the right values
How can I attach this to my current data frame, since when I try to append it I get 'none' on
every field as well.
Thanks in advance.

The problem is that result is empty because your function language_detect() doesn't return anything (it is only printing the results).
import pandas as pd
from langdetect import detect
lst = [('this is a test', 1), ('what language is this?', 4), ('stackoverflow is a website', 23)]
df = pd.DataFrame(lst, columns = ['text', 'something'])
def language_detect(x):
lang = detect(x)
print(lang)
result = [language_detect(x) for x in df['text']]
result
#Output:[None, None, None]
Just give it a return value:
def language_detect(x):
lang = detect(x)
return lang
df['l_detect'] = df['text'].apply(language_detect)
df.head()
#Output:
# text something l_detect
#0 this is a test 1 en
#1 what language is this? 4 en
#2 stackoverflow is a website 23 en
and it will work as expected.

Related

Python function to iterate each unique column and transform using pyspark

I'm building the following global function in Pyspark to go through each column in my CSV that is in different formats and convert them all to one unique format separated by "-."
I am new to the python world, I am getting
TypeError: Column is not iterable
employeesDF =is reading csv file from local sys
I tried the below code:
def colrename(df):
for col in employeesDF.columns:
F.col(col).alias(col.replace('/s,#', '_'))
return employeesDF
ndf = colrename (employeesDF.columns)
Input:
OutPut:
This will work
import re
def colrename(column):
reg = re.sub(r'\s|#', '_',column)
return reg
df2 = df2.toDF(*(colrename(c) for c in df2.columns))
In case any one interested, I used the code below to do it. I hope this information is useful. Thanks
from pyspark.sql import *
import re
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
df=spark.read.format('csv')\
.option('header',True)\
.option('inferschema',True)\
.load('C:\\bigdata\\datasets\\employee10000_records.csv')
def colrename(df):
for names in df.schema.names:
df = df.withColumnRenamed(names, re.sub(r'([^A-Za-z0-9])','_',names))
return df
colrename (df).show()

How to check which row in producing LangDetectException error in LangDetect?

I have a dataset of tweets that contains tweets mainly from English but also have several tweets in Indian Languages (such as Punjabi, Hindi, Tamil etc.). I want to keep only English language tweets and remove rows with different language tweets.
I tried this [https://stackoverflow.com/questions/67786493/pandas-dataframe-filter-out-rows-with-non-english-text] and it worked on the sample dataset. However, when I tried it on my dataset it showed error:
LangDetectException: No features in text.
Also, I have already checked other question [https://stackoverflow.com/questions/69804094/drop-non-english-rows-pandasand] where the accepted answer talks about this error and mentioned that empty rows might be the reason for this error, so I already cleaned my dataset to remove all the empty rows.
Simple code which worked on sample data but not on original data:
from langdetect import detect
import pandas as pd
df = pd.read_csv('Sample.csv')
df_new = df[df.text.apply(detect).eq('en')]
print('New df is: ', df_new)
How can I check which row is producing error?
Thanks in Advance!
Use custom function for return True if function detect failed:
df = pd.read_csv('Sample.csv')
def f(x):
try:
detect(x)
return False
except:
return True
s = df.loc[df.text.apply(f), 'text']
Another idea is create new column filled by detect, if failed return NaN, last filtr rows with missing values to df1 and also df_new with new column filled by ouput of function detect:
df = pd.read_csv('Sample.csv')
def f1(x):
try:
return detect(x)
except:
return np.nan
df['new'] = df.text.apply(f1)
df1 = df[df.new.isna()]
df_new = df[df.new.eq('en')]

Why substring cannot be found in the target string?

To understand the values of each variable, I improved a script for replacement from Udacity class. I convert the codes in a function into regular codes. However, my codes do not work while the codes in the function do. I appreciate it if anyone can explain it. Please pay more attention to function "tokenize".
Below codes are from Udacity class (CopyRight belongs to Udacity).
# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet'])
# import statements
import re
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
def load_data():
df = pd.read_csv('corporate_messaging.csv', encoding='latin-1')
df = df[(df["category:confidence"] == 1) & (df['category'] != 'Exclude')]
X = df.text.values
y = df.category.values
return X, y
def tokenize(text):
detected_urls = re.findall(url_regex, text) # here, "detected_urls" is a list for sure
for url in detected_urls:
text = text.replace(url, "urlplaceholder") # I do not understand why it can work while does not work in my code if I do not convert it to string
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
clean_tokens = []
for tok in tokens:
clean_tok = lemmatizer.lemmatize(tok).lower().strip()
clean_tokens.append(clean_tok)
return clean_tokens
X, y = load_data()
for message in X[:5]:
tokens = tokenize(message)
print(message)
print(tokens, '\n')
Below is its output:
I want to understand the variables' values in function "tokenize()". Following are my codes.
X, y = load_data()
detected_urls = []
for message in X[:5]:
detected_url = re.findall(url_regex, message)
detected_urls.append(detected_url)
print("detected_urs: ",detected_urls) #output a list without problems
# replace each url in text string with placeholder
i = 0
for url in detected_urls:
text = X[i].strip()
i += 1
print("LN1.url= ",url,"\ttext= ",text,"\n type(text)=",type(text))
url = str(url).strip() #if I do not convert it to string, it is a list. It does not work in text.replace() below, but works in above function.
if url in text:
print("yes")
else:
print("no") #always show no
text = text.replace(url, "urlplaceholder")
print("\nLN2.url=",url,"\ttext= ",text,"\n type(text)=",type(text),"\n===============\n\n")
The output is shown below.
The outputs for "LN1" and "LN2" are same. The "if" condition always output "no". I do not understand why it happens.
Any further help and advice would be highly appreciated.

Finding Specific word in a pandas column and assigning to a new column and replicate the row

I am trying to find specific words from a pandas column and assign it to a new column and column may contain two or more words. Once I find it I wish to replicate the row by creating it for that word.
import pandas as pd
import numpy as np
import re
wizard=pd.read_excel(r'C:\Python\L\Book1.xlsx'
,sheet_name='Sheet1'
, header=0)
test_set = {'941', '942',}
test_set2={'MN','OK','33/3305'}
wizard['ZTYPE'] = wizard['Comment'].apply(lambda x: any(i in test_set for i in x.split()))
wizard['ZJURIS']=wizard['Comment'].apply(lambda x: any(i in test_set2 for i in x.split()))
wizard_new = pd.DataFrame(np.repeat(wizard.values,3,axis=0))
wizard_new.columns = wizard.columns
wizard_new.head()
I am getting true and false, however unable to split it.
Above is how the sample data reflects. I need to find anything like this '33/3305', Year could be entered as '19', '2019', and quarter could be entered are 'Q1'or '1Q' or 'Q 1' or '1 Q' and my test set lists.
ZJURIS = dict(list(itertools.chain(*[[(y_, x) for y_ in y] for x, y in wizard.comment()])))
def to_category(x):
for w in x.lower().split(" "):
if w in ZJURIS:
return ZJURIS[w]
return None
Finally, apply the method on the column and save the result to a new one:
wizard["ZJURIS"] = wizard["comment"].apply(to_category)
I tried the above solution well it did not
Any suggestions how to do I get the code to work.
Sample data.
data={ 'ID':['351362278576','351539320880','351582465214','351609744560','351708198604'],
'BU':['SBS','MAS','NAS','ET','SBS'],
'Comment':['940/941/w2-W3NYSIT/SUI33/3305/2019/1q','OK SUI 2Q19','941 - 3Q2019NJ SIT - 3Q2019NJ SUI/SDI - 3Q2019','IL,SUI,2016Q4,2017Q1,2017Q2','1Q2019 PA 39/5659 39/2476','UT SIT 1Q19-3Q19']
}
df = pd.DataFrame(data)
Based on the data sample data set attached is the output.

How to import files using a for loop with path names in dictionary in Python?

I want to create a dictionary which has all the information needed to import files, parse dates etc. Then I want to use a for loop to import all these files. But after the for loop is finished I'm only left with the last dataset in the dictionary. As if it overwrites them.
I execute the file in the path folder so that's not a problem.
I tried creating a new dictionary where I add each import but that makes it much harder for later when I need to reference them. I want them as separate dataframes in the variable explorer.
Here's the code:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import AutoMinorLocator # for time series visualisation
# Import data
#PATH = r"C:\Users\sherv\OneDrive\Documents\GitHub\Python-Projects\Research Project\Data"
data = {"google":["multiTimeline.csv", "Month"],
"RDPI": ["RealDisposableIncome-2004-1_Present-Mon-US(Grab-30-11-18).csv", "DATE"],
"CPI": ["CPI.csv", "DATE"],
"GDP": ["GDP.csv", "DATE"],
"UE": ["Unemployment_2004_Present_US(Grab-5-12-18).csv", "DATE"],
"SP500": ["S&P500.csv", "Date"],
"IR": ["InterestRate_2004-1-1_Present_US(Grab-5-12-18).csv", "DATE"],
"PPI": ["PPIACO.csv", "DATE"],
"PMI": ["ISM-MAN_PMI.csv", "Date"]}
for dataset in data.keys():
dataset = pd.read_csv("%s" %(data[dataset][0]), index_col="%s" %(data[dataset][1]), parse_dates=["%s" %(data[dataset][1])])
dataset = dataset.loc["2004-01-01":"2018-09-01"]
# Visualise
minor_locator = AutoMinorLocator(12)
# Investigating overall trendSS
def google_v_X(Data_col, yName, title):
fig, ax1 = plt.subplots()
google["Top5"].plot(ax=ax1,color='b').xaxis.set_minor_locator(minor_locator)
ax1.set_xlabel('Date')
ax1.set_ylabel('google (%)', color='b')
ax1.tick_params('y', colors='b')
plt.grid()
ax2 = ax1.twinx()
Data_col.plot(ax=ax2,color='r')
ax2.set_ylabel('%s' %(yName), color='r')
ax2.tick_params('%s' %(yName), colors='r')
plt.title("Google vs %s trends" %(title))
# Google-CPI
google_v_X(CPI["CPI"], "CPI 1982-1985=100 (%)", "CPI")
# Google-RDPI
google_v_X(RDPI["DSPIC96"], "RDPI ($)", "RDPI")
# Google-GDP
google_v_X(GDP["GDP"], "GDP (B$)", "GDP")
# Google-UE
google_v_X(UE["Value"], "Unemployed persons", "Unemployment")
# Google-SP500
google_v_X(SP500["Close"], "SP500", "SP500")
# Google-PPI
google_v_X(PPI["PPI"], "PPI")
# Google-PMI
google_v_X(PMI["PMI"], "PMI", "PMI")
# Google-IR
google_v_X(IR["FEDFUNDS"], "Fed Funds Rate (%)", "Interest Rate")
I also tried creating a function to read and parse and then use that in a loop like:
def importdata(key, path ,parseCol):
key = pd.read_csv("%s" %(path), index_col="%s" %(parseCol), parse_dates=["%s" %(parseCol)])
key = key.loc["2004-01-01":"2018-09-01"]
for dataset in data.keys():
importdata(dataset, data[dataset][0], data[dataset][0])
But I get an error because it doesn't recognise the path as a string and it says its not defined.
How can I get them to not overwrite each other or how can I get python to recognise the input to the function as a string? Any help is appreciated, Thanks
The for loop is referencing the same dataset variable so each time the loop is executed the variable is replaced with the newly imported dataset. You need to store the result somewhere, whether thats as a new variable each time or in a dictionary. Try something like this:
googleObj = None
RDPIObj = None
CPIObj = None
data = {"google":[googleObj, "multiTimeline.csv", "Month"],
"RDPI": [RDPIObj,"RealDisposableIncome-2004-1_Present-Mon-US(Grab-30-11-18).csv", "DATE"],
"CPI": [CPIObj, "CPI.csv", "DATE"]}
for dataset in data.keys():
obj = data[dataset][0]
obj = pd.read_csv("%s" %(data[dataset][1]), index_col="%s" %(data[dataset][2]), parse_dates=["%s" %(data[dataset][2])])
obj = dataset.loc["2004-01-01":"2018-09-01"]
This way you will have a local dataframe object for each of your datasets. The downside is that you have to define each variable.
Another option is making a second dictionary like you mentioned, something like this:
data = {"google":["multiTimeline.csv", "Month"],
"RDPI": ["RealDisposableIncome-2004-1_Present-Mon-US(Grab-30-11-18).csv", "DATE"],
"CPI": ["CPI.csv", "DATE"]}
output_data = {}
for dataset_key in data.keys():
dataset = pd.read_csv("%s" %(data[dataset_key][0]), index_col="%s" %(data[dataset_key][1]), parse_dates=["%s" %(data[dataset_key][1])])
dataset = dataset.loc["2004-01-01":"2018-09-01"]
output_data[dataset_key] = dataset
Reproducible example (however you should be very careful with using "exec"):
# Generating data
import os
import pandas as pd
os.chdir(r'C:\Windows\Temp')
df1 = pd.DataFrame([['a',1],['b',2]], index=[0,1], columns=['col1','col2'])
df2 = pd.DataFrame([['c',3],['d',4]], index=[2,3], columns=['col1','col2'])
# Exporting data
df1.to_csv('df1.csv', index_label='Month')
df2.to_csv('df2.csv', index_label='DATE')
# Definition of Loading metadata
loading_metadata = {
'df1_loaded':['df1.csv','Month'],
'df2_loaded':['df2.csv','DATE'],
}
# Importing with accordance to loading_metadata (caution for indentation)
for dataset in loading_metadata.keys():
print(dataset, loading_metadata[dataset][0], loading_metadata[dataset][1])
exec(
"""
{0} = pd.read_csv('{1}', index_col='{2}').rename_axis('')
""".format(dataset, loading_metadata[dataset][0], loading_metadata[dataset][1])
)
Exported data (df1.csv):
Month,col1,col2
0,a,1
1,b,2
Exported data (df2.csv):
DATE,col1,col2
2,c,3
3,d,4
Loaded data:
df1_loaded
col1 col2
0 a 1
1 b 2
df2_loaded
col1 col2
2 c 3
3 d 4

Resources