Transforming one row into many rows using Spark - apache-spark

I'm trying to use Spark to turn one row into many rows. My goal is something like a SQL UNPIVOT.
I have a pipe delimited text file that is 360GB, compressed (gzip). It has over 1,620 columns. Here's the basic layout:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
There are over 800 of these property name/value fields. There are roughly 280 million rows. The file is in an S3 bucket.
The users want me to unpivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
This is my first time using Spark. I'm struggling to figure out a good way to do this.
What is a good way to do this in Spark?
Thanks.

The idea is to generate a list of lines from each input line as you have shown. This will give an RDD of lists of lines. Then use flatMap to get an RDD of individual lines:
If your file is loaded in as rdd1, then the following should give you what you want:
rdd1.flatMap(break_out)
where the function for processing lines is defined as:
def break_out(line):
# split line into individual fields/values
line_split=line.split("|")
# get the values for the line
vals=line_split[::2]
# field names for the line
keys=line_split[1::2]
# first value is primary key
primary_key=vals[0]
# get list of field values, pipe delimited
return(["|".join((primary_key, keys[i], vals[i+1])) for i in range(len(keys))])
You may need some additional code to deal with header lines etc, but this should work.

Related

Can you read a CSV file as one column?

I know this sounds silly, but is it possible to read a CSV file containing multiple columns and combine all the data into one column? Let's say I have a CSV file with 6 columns and they have different delimiters. Is it possible to read these files, but spit out the first 100 rows into one column, without specifying a delimiter? My understanding is that this isn't possible if using pandas.
I don't know if this helps, but to add context to my question, I'm trying to use Treeview from Tkinter to display the first 100 rows of a CSV file. The Treeview window should display this data as 1 column if a delimiter isn't specified. Otherwise, it will automatically split the data based on a delimiter from the user input.
This is the data I have:
This should be the result:
Pandas isn't the only way to read a CSV file. There is also the built in csv module in the python standard lib as well as the basic builtin function open that will work just as well. Both of these methods can generate single rows of data like your question indicates.
Using open function
filepath = "/path/to/file.csv"
with open(filepath, "rt", encoding="utf-8") as fd:
header = next(fd)
for row in fd:
# .... do something with row data
# row will be a string of all the data for a single row.
# example: "Information,44775.4541667,MicrosoftWindowsSecurity,16384..."
# then you can break at any time you want to stop reading.
or using the csv module:
import csv
reader = csv.reader(open("/path/to/file.csv", "rt", encoding="utf8"), delimeter=',')
header = next(reader)
for row in reader:
# this time the row will be a list split by the delimiter which
# by default is a comma but you can change it in the call to the reader
you can use
open('file.csv') as f: data=list(f.readlines())
to read file line by line
As other answers have explained, you can use various ways to read first n-lines of text from a file. But if you insist on using pandas then there is a trick you can use.
Find a character which will never appear in your text and use it as dummy delimiter to read_csv(), so that all text will be read as one column. Use nrows parameter to control number of lines to read:
pd.read_csv("myfile.csv", sep="~", nrows=100)

How to ingest comma separated csv with unquoted commas in some columns in ADF

Hi everyone I have not seen this particular issue pop up, I've seen a few related but none address this.
I have very big CSVs (up to 8gb) with comma as delimiter, free text in some columns, and commas in some of that free text.
As requirements, I cannot generate or ask for the CSVs to be generated again with another delimiter, and I have to use Data Flow to achieve this.
I would like to learn how to deal with text such as:
A, some text 2132,ALL, free text 00001,2020-11-29 - 2020-12-05
A, some text 2132,ALL, free text\,more text 0002,2018-12-09 - 2018-12-15
A, some text 2132,ALL, free text\,more text 00003,2018-12-09 - 2018-12-15
Things I have tried:
I tried making both simple Data Flows and Copy Activities in order to see if the parser did the operation properly, which it did not, didn't matter what combination of configuration of dataset as csv I tried.
I tried reading the whole csv as one column, writing to file with the "," regexed out, this has the issue of "losing" the commas from the csv so I have spaces as delimiter so I am back to square one, not having a proper delimiter, since text has spaces and would break.
Actually, data factory can't deal with the csv file which column data have the character same with column delimiter. It will cause the schema/column missing.
Even with Data Flow, Data Factory will always recognize the first row as the schema according the delimiter number.
As you said you can't change the source csv file and can't use data flow. then I'm afraid to say I we can't achieve it in Data Factory.
What I did for this to work (did it it twice with different results so I am still missing things, but one of them it worked).
Create a dataset with no delimiter, so the whole CSV row is read as a column. Use dataflow replace function there to make the problematic string dissapear. Write to disk as CSV.
Read as CSV with proper delimiter. Do whatever data needs done, write to disk as parquet. That appears to work.

Problem when importing table from pdf to python using tabula

When importing data from pdf using tabula with Python, in some cases, I obtain two or more columns merged in one. It does not happen with all the files obtained from the same pdf.
In this case, this is the code used to read the pdf:
from tabula import wrapper
tables = wrapper.read_pdf("933884 cco Saupa 1.pdf",multiple_tables=True,pages='all')
i=1
for table in tables:
table.to_excel('output'+str(i)+'.xlsx',index=False)
i=i+1
For example, when I print the first item of the dataframe obtained from one of these excel files, named "output_pd":
print (output_pd[0][1])
I obtain:
76) 858000015903708 77) 858000013641969 78)
The five numbers are in a single column, so I cannot treat them individually.
Is it possible to improve the data handling in these cases?
You could try manually editing the data in excel. If you use text to columns under the data tab in excel it allows you to split one column into multiple columns without too much work, but you would need to do it for every excel file which could be a pain.
Iterating in each item of each column of each dataframe in the list obtained with tabula
wrapper.read_pdf(file)
in this case
tables
it is possible to obtain clean data.
In this case:
prueba =[]
i = 0
for table in tables:
for columna in table.columns:
for item in (str(table[columna]).split(" ")):
if "858" in str(item):
prueba.append(item[0:15])
print (prueba[0:5])
result in:
['858000019596025', '858000015903707', '858000013641975', '858000000610864', '858000013428853']
But
tabula.wrapper.read_pdf
does not read the whole initial pdf. 2 values are left in the last page. So, it is still neccesary to manually make a little edit.

Spark - Have I read from csv correctly?

I read a csv file into Spark using:
df = spark.read.format(file_type).options(header='true', quote='\"',
ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
When I tried it with sample csv data from another source and did diplsay(df) it showed a neatly displayed header row followed by data.
When I try it on my main data, which has 40 columns, and millions of rows, it simply displays the first 20 column headers and no data rows.
Is this normal behavior or is it reading it wrong?
Update:
I shall mark the question as answered as the tips below are useful. However my results from doing:
df.show(5, truncate=False)
currently shows:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |��"periodID","DAXDate","Country Name","Year","TransactionDate","QTR","Customer Number","Customer Name","Customer City","Document Type Code","Order Number","Product Code","Product Description","Selling UOM","Sub Franchise Code","Sub Franchise Description","Product Major Code","Product Major Description","Product Minor Code","Product Minor Description","Invoice Number","Invoice DateTime","Class Of Trade ID","Class Of Trade","Region","AmountCurrencyType","Extended Cost","Gross Trade Sales","Net Trade Sales","Total(Ext Std Cost)","AdjustmentType","ExcludeComment","CurrencyCode","fxRate","Quantity","FileName","RecordCount","Product Category","Direct","ProfitCenter","ProfitCenterRegion","ProfitCenterCountry"| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I shall have to go back to basics an preview the csv in a text editor to find out what the correct format is for this file to figure out what's going wrong. Note, I had to update my code to the following to deal with pipe delimter:
df = spark.read.format(file_type).options(header='true', quote='\"', delimiter='|',ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
Yes this is normal beheaviour. The dataframe function show() has a default value to display 20 rows. You can set a different value for that (but keep in mind that it doesn't make sense to print all rows of your file) and also stop it from truncating. For example:
df.show(100, truncate=False)
It is a normal behaviour. You can view the content of your data in different ways:
show(): Show you in a formatted way the first 20 rows. You can specify as argument the number of rows you want to display (if you provide a value much higher that your data is ok!). Columns will be truncated too, as a default configuration. You can specify truncate=False to show all the columns. (like #cronoik correctly said in his answer).
head(): The same as show(), but it prints the date in a "row" format. Does not provide a nice formatted table, it is useful for a quick complete look of your data, for example with head(1) to show only the first row.
describe().show(): you can show a summary that gives you an insight of the data. For example, shows you the count of elements, the min/max/avg value of each column.
It is normal for Spark dataframes to display limited rows and columns. Your reading of the data should not be a problem. However, to confirm that you have read the csv correctly you can try to see the number of rows and columns in the df, using
len(df.columns)
or
df.columns
For number of rows
df.count()
In case you need to see the content in detail you can use the option stated by cronoik.

Removing duplicates between multiple large CSV files

I am trying to find the best way to remove duplicates from large CSV files.
I receive CSV files of around 5/6 million rows every month.
I need to adjust these (I only need some of the columns, and I need to add some others).
The files also contain a lot of duplicate, and incomplete rows.
I've come up with a solution in python where I use a set and check for each row if it's in the set. And change what needs changing.
Now, I get the second file, and it contains a lot of duplicates that are in the previous file.
I'm trying to find an efficient solution to remove duplicates within the file, and between the different files. In the end I want to have a list (table or csv file) that contains only the new entries for that month.
I would like use python, and I was thinking about using a sqlite database for storing the data. But I'm unsure which way would be most efficient.
I would use numpy.unique():
import numpy as np
data = np.vstack((np.loadtxt("path/to/file1.csv"), np.loadtxt("path/to/file2.csv")))
#this will stack both arrays on top of each other, creating one giant array
data = np.unique(data, axis=0)
np.unique takes the entire array and returns only the unique elements. Make sure you set axis=0 so that it goes row by row and not cell by cell.
One caveat: This should work, but if there are several million rows, it may take a while. Still better than doing it by hand though! Good luck!

Resources