Select only first line from files under a directory in pyspark - apache-spark

I want to collect all the first line from the files under a directory using Pyspark i tried using
file=sc.wholeTextFiles("Location").map(lambda x: x[0]).collect()
but this is giving me list of files under directory.I want some thing like this below lets say i have two files
file1.csv file2.csv
x,y,z q,r,s
1,2,3 4,5,6
a,b,c d,e,f
I want collect the first lines of the files {x,y,z} and {q,r,s}.Please help me, how can i get only first line from multiple files under a directory

You can do something like the following:
def read_firstline(filename):
with open(filename, 'rb') as f:
return f.readline()
# files is a list of filenames
rdd_of_firstlines = sc.parallelize(files).flatMap(read_firstline)

Related

Can you read a CSV file as one column?

I know this sounds silly, but is it possible to read a CSV file containing multiple columns and combine all the data into one column? Let's say I have a CSV file with 6 columns and they have different delimiters. Is it possible to read these files, but spit out the first 100 rows into one column, without specifying a delimiter? My understanding is that this isn't possible if using pandas.
I don't know if this helps, but to add context to my question, I'm trying to use Treeview from Tkinter to display the first 100 rows of a CSV file. The Treeview window should display this data as 1 column if a delimiter isn't specified. Otherwise, it will automatically split the data based on a delimiter from the user input.
This is the data I have:
This should be the result:
Pandas isn't the only way to read a CSV file. There is also the built in csv module in the python standard lib as well as the basic builtin function open that will work just as well. Both of these methods can generate single rows of data like your question indicates.
Using open function
filepath = "/path/to/file.csv"
with open(filepath, "rt", encoding="utf-8") as fd:
header = next(fd)
for row in fd:
# .... do something with row data
# row will be a string of all the data for a single row.
# example: "Information,44775.4541667,MicrosoftWindowsSecurity,16384..."
# then you can break at any time you want to stop reading.
or using the csv module:
import csv
reader = csv.reader(open("/path/to/file.csv", "rt", encoding="utf8"), delimeter=',')
header = next(reader)
for row in reader:
# this time the row will be a list split by the delimiter which
# by default is a comma but you can change it in the call to the reader
you can use
open('file.csv') as f: data=list(f.readlines())
to read file line by line
As other answers have explained, you can use various ways to read first n-lines of text from a file. But if you insist on using pandas then there is a trick you can use.
Find a character which will never appear in your text and use it as dummy delimiter to read_csv(), so that all text will be read as one column. Use nrows parameter to control number of lines to read:
pd.read_csv("myfile.csv", sep="~", nrows=100)

Linux Sort words alphabetically and make a file for each letter

I want to write a shell script which creates automatically 26 dictionary files, where the first file should contain all the words starting with a or A, the second all the words starting with b or B, ... etc. Where each dictionary file is sorted. For example, if I had a file that had the words Lime, Apple, Orange, Avacado, Apricot, Lemon. Then I want a new file that contains in order Apple, Apricot, Avacado, a file that contains just Orange, and a file that contains Lemon, Lime.
I thought about doing this using sort, so it could be:
sort sample.txt
but that would not put each section of words into a new file. So I thought of doing:
sort sample.txt > [a-z].txt
but that just makes one new file titled [a-z].txt
How do I make different alphabetically sorted files from the list of words in the file? I want it to be like a.txt, b.txt, etc with each containing all the words that start with that letter.
You can do this with awk:
awk '{ print $0 >> toupper(substr($0,1,1))"_wordsfile" }' <(sort wordsfilemaster)
Where wordsfilemaster contains the original dictionary file, run sort on the file and redirect the output back into awk. Append the line to a file generated by taking the first character of the line, converting it to upper case and then appending "_wordsfile" e.g.
files get created as A_wordsfile or O_wordsfile.

match string in columns with other lists, add match results in new column/row

i'm new to pandas :D
recently i got a task, which i needed to process and analyze the data in a csv file.
Now in the last step I need to match the data from one of the columns in the csv file with two existing lists, and if the matching is successful, write the corresponding element of the lists to the corresponding column in the csv file.
for example:
data source
list1 = ['I like cats.', 'Jim hates frog.']
list2 = ['Cats are cute.', 'Cats eat fish.', 'Dogs are cute.', 'My cat has nice fur.', 'Sandy sent me a nice cat toy.']
and the ideal output is:
ideal output
i've tried "for-loop" in "for-loop", but the real file is very huge, so it works pretty slow...
thank you very much!

comparing multiple tab delimited csv files in python

As a start, I want to compare the first two columns of two .csv files then write what is common in these files to an output file, say common.csv, then also write the differences in each file to different output files, say f1.csv and f4.csv.
So far I have tried to using set(), difflib, and also taking the two files, create lists from the files then comparing the first two columns in each file. This gave me the output for what is common but not for what the differences are in each file when compared to each other. I have tried most of the solutions posted that seemed like the problem was similar to mine but I am still stuck. Can someone please assist?
this is the headers in my files and only want to compare the first two columns but write out the entire line to the output file.
fieldnames = (["Chromosome" ,"GenomicPosition", "ReferenceBase",
"AlternateBase", "GeneName", "GeneID",
"TrancriptID", "Varianteffect-Variantimpact",
"Biotype", "TranscriptBiotype" , "Referencebase",
"Alternatebase", "Depth coverage"])
One solution is to use pandas, which is very powerful.
To convert csv <-> pandas dataframes:
import pandas as pd
df = pd.read_csv('csv_file.csv') # csv -> pandas
df.to_csv('csv_file.csv', index=False) # pandas -> csv
To compare pandas dataframes on columns, this post should point you in the right direction: https://stackoverflow.com/a/47107164/2667536

Writing pandas data frame to csv but no space between columns- sep'\t' argument is ignored in Python 3

I have a problem and found many related questions asked here and read them all, but still can`t solve it. So far I didn't get any answer.
I have two files one is .csv and the other is .xlsx. They have a different number of rows and columns. I would like to merge these two according to filenames. Very simplified the two files look like as follows;
The csv file;
the excel file;
First i converted them to panda data frame;
import pandas as pd
import csv,xlrd
df1 = pd.read_csv('mycsv.csv')
df2=pd.read_excel(myexcel.xlsx', sheetname=0)
To merge the two files on the same column I remove the white space in column names in df2 using the first line below and, then I merge them and print the merged data frame in csv file.
df2.columns=df2.columns.str.replace(' ', '')
df=pd.merge(df1, df2, on="filename")
df.to_csv('myfolder \\merged_file.csv', sep="\t ")
When I check my folder, I see merged_file.csv exists but when I opened it there is no space between columns and values. I want to see nice normal csv or excel look, like my example files above. Just to make sure I tried everything, I also converted the Excel file to a csv file and then merged two csv but still merged data is written without spaces. Again, the above files are very simplified, but my real merged data look like this;
Finally, figured it out. I am putting the answer here just in case if anyone else also manages the same mistake as me. Just remove the sep="\t" and use below line instead;
df.to_csv('myfolder \\merged_file.csv')
Just realized the two csv files were comma separated and using tab delimiter for merge didn`t work.

Resources