Remove Stopwords in a RDD, Pyspark - apache-spark

I have a RDD containing text read from a text file. I would like to remove all the stop words in the text files. There is a pyspark.ml.feature.StopWordsRemover which does the same functionality on a Dataframe but I would like to do it on a RDD. Is there a way to do it?
Steps:
txt = sc.textFile('/Path')
txt.collect()
which outputs :
["23890098\tShlykov, a hard-working taxi driver and Lyosha"]
I want to remove all the stop words present in the txt RDD.
Desired Output :
["23890098\tShlykov, hard-working taxi driver Lyosha"]

You can list out the stop-words, and then use lambda functions to map and filter the output.
stop_words = ['a','and','the','is']
txt = sc.textFile('/Path')
filtered_txt = txt.flatMap(lambda x: x.split()).filter(lambda x: x not in stop_words)
filtered_txt.first()

Related

Apache pyspark remove stopwords and calculate

I have the following .csv file (ID, title, book title, author etc):
I want to compute all the n-combinations (from each title I want all the 4-word combinations) from the titles (column 2) of the articles (with n=4), after I remove the stopwords.
I have created the dataframe:
df_hdfs = sc.read.option('delimiter', ',').option('header', 'true')\.csv("/user/articles.csv")
I have created an rdd with the titles column:
rdd = df_hdfs.rdd.map(lambda x: (x[1]))
and it seems like this:
Now, I realize that I have to tokenize each string of RDD into words and then remove the stopwords. I would need a little help on how to do this and how to compute the combinations.
Thanks.

Live updating graph from increasing amount of csv files

I need to analyse some spectral data in real-time and plot it as a self-updating graph.
The program I use outputs a text file every two seconds.
Usually I do the analysis after gathering the data and the code works just fine. I create a dataframe, where each csv file represents a column. The problem is, with several thousands of csv files, the import becomes very slow and creating a dataframe out of all the csv files takes usually more than half an hour.
Below the code for creating the dataframe from multiple csv files.
''' import, append and concat files into one dataframe '''
all_files = glob.glob(os.path.join(path, filter + "*.txt")) # path to the files by joining path and file name
all_files.sort(key=os.path.getmtime)
data_frame = []
name = []
for file in (all_files):
creation_time = os.path.getmtime(file)
readible_date = datetime.fromtimestamp(creation_time)
df = pd.read_csv(file, index_col=0, header=None, sep='\t', engine='python', decimal=",", skiprows = 15)
df.rename(columns={1: readible_date}, inplace=True)
data_frame.append(df)
full_spectra = pd.concat(data_frame, axis=1)
for column in full_spectra.columns:
time_step = column - full_spectra.columns[0]
minutes = time_step.total_seconds()/60
name.append(minutes)
full_spectra.columns = name
return full_spectra
The solution I thought of was using the watchdog module and everytime a new textfile is created it gets appended as a new column to the existing dataframe and the updated dataframe is plotted. Because then, I do not need to loop over all csv files all the time.
I found a very nice example on how to use watchdog here
My problem is, I could not find a solution how after detecting the new file with watchdog, to read it and append it to the existing dataframe.
A minimalistic example code should look something like this:
def latest_filename():
"""a function that checks within a directoy for new textfiles"""
return(filename)
df = pd.DataFrame() #create a dataframe
newdata = pd.read_csv(latest_filename) #The new file is found by watchdog
df["newcolumn"] = newdata["desiredcolumn"] #append the new data as column
df.plot() #plot the data
The plotting part should be easy and my thoughts were to adapt the code presented here. I am more concerned with the self-updating dataframe.
I appreciate any help or other solutions that would solve my issue!

Convert files to DataFrame and then applying a function for multiple DataFrames

I have three files well1.las, well2.las, well3.las (".las" is similar to a ".txt") and want to transform them into different DataFrames (8x4) as I will apply some functions on them later.
Inside them there is a string "~A" that I want to exclude and stay just with the logs names ('DEPTH', 'GR', 'NPHI', 'RHOB') as column names.
I already managed to separate the las file to a dataframe, but couldn't do this to all. How can I do that?
I think it should be better if I could put the dataframes in a list or dict as I will need to do some calculations with their values.
Each las file looks like this:
~A DEPTH GR NPHI RHOB
2869.6250 143.5306 0.1205 2.4523
2869.7500 143.9227 0.1221 2.4497
2869.8750 144.5697 0.1180 2.4564
2870.0000 145.3994 0.1128 2.4650
2870.1250 146.3611 0.1378 2.4239
2870.2500 147.3796 0.1535 2.3981
2870.3750 148.4387 0.1288 2.4387
2870.5000 149.5223 0.1195 2.4539
'''
dl = os.listdir(r"C:\Users\laguiar\Desktop\LASfiles")
for filename in dl:
if filename.endswith('.las'):
with open(filename) as f:
for l in f:
if l.startswith('~A'):
logs = l.split()[1:]
break
data = pd.read_csv(f, names=logs, sep='~A', engine='python')
df = data['DEPTH'].str.split(expand = True)
df = df.astype(float)
df.columns = logs
'''

How to split pandas dataframe into multiple dataframes based on unique string value without aggregating

I have a df with multiple country codes in a column (US, CA, MX, AU...) and want to split this one df into multiple ones based on these country code values, but without aggregating it.
I've tried a for loop but was only able to get one df and it was aggregated with groupby().
I gave up trying to figure it out so I split them based on str.match and wrote one line for each country code. Is there a nice for loop that could achieve the same as below code? If it would write a csv file for each new df that would be fantastic.
us = df[df['country_code'].str.match("US")]
mx = df[df['country_code'].str.match("MX")]
ca = df[df['country_code'].str.match("CA")]
au = df[df['country_code'].str.match("AU")]
.
.
.
We can write a for loop which takes each code and uses query to get the correct part of the data. Then we write it to csv with to_csv also using f-string:
codes = ['US', 'MX', 'CA', 'AU']
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
temp.to_csv(f'df_{code}.csv')
note: f_string only work if Python >= 3.5
To keep the dataframes:
codes = ['US', 'MX', 'CA', 'AU']
dfs=[]
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
dfs.append(temp)
temp.to_csv(f'df_{code}.csv')
Then you can acces them with the index, for example: print(dfs[0]) or print(dfs[1]).

How to convert specific rows in a column into a separate column using pyspark and enumerate each row with an increasing numerical index? [duplicate]

I'm trying to read in retrosheet event file into spark. The event file is structured as such.
id,TEX201403310
version,2
info,visteam,PHI
info,hometeam,TEX
info,site,ARL02
info,date,2014/03/31
info,number,0
info,starttime,1:07PM
info,daynight,day
info,usedh,true
info,umphome,joycj901
info,attendance,49031
start,reveb001,"Ben Revere",0,1,8
start,rollj001,"Jimmy Rollins",0,2,6
start,utlec001,"Chase Utley",0,3,4
start,howar001,"Ryan Howard",0,4,3
start,byrdm001,"Marlon Byrd",0,5,9
id,TEX201404010
version,2
info,visteam,PHI
info,hometeam,TEX
As you can see for each game the events loops back.
I've read the file into a RDD, and then via a second for loop added a key for each iteration, which appears to work. But I was hoping to get some feedback on if there was a cleaning way to do this using spark methods.
logFile = '2014TEX.EVA'
event_data = (sc
.textFile(logfile)
.collect())
idKey = 0
newevent_list = []
for line in event_dataFile:
if line.startswith('id'):
idKey += 1
newevent_list.append((idKey,line))
else:
newevent_list.append((idKey,line))
event_data = sc.parallelize(newevent_list)
PySpark since version 1.1 supports Hadoop Input Formats.You can use textinputformat.record.delimiter option to use a custom format delimiter as below
from operator import itemgetter
retrosheet = sc.newAPIHadoopFile(
'/path/to/retrosheet/file',
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter': '\nid,'}
)
(retrosheet
.filter(itemgetter(1))
.values()
.filter(lambda x: x)
.map(lambda v: (
v if v.startswith('id') else 'id,{0}'.format(v)).splitlines()))
Since Spark 2.4 you can also read data into DataFrame using text reader
spark.read.option("lineSep", '\nid,').text('/path/to/retrosheet/file')

Resources