I wrote a file to parquet containing 1,000,000 rows.
When I read the parquet file back, the result is 1,000,000 rows.
df = spark.read.parquet(parquet_path)
df.count()
>>> 1000000
When I save the parquet file to CSV, read it back, then count the rows, the output is 1,000,365.
df.write.csv(csv_path, sep='\t', header=False, mode='overwrite')
df_csv = spark.read.csv(csv_path, sep='\t', header=False)
df_csv.count()
>>> 1000365
Why is Spark adding the extra columns to dataset?
I tried a variety of separators and both False and True for the header.
I also tried to coalesce and repartition but the same number keeps appearing.
Does anyone know why Spark would add the extra columns?
This occurs while reading csv files which includes newlines. The newlines will split into multiple records by default.
To read the multiline csv correctly, you need to first make sure the field is properly quoted.
1,short text,"long text with newline\n quote within this field should be \"escaped\"."
Although double quote is the default, the quote char could be other characters than double quote.
Check the default csv options here: https://spark.apache.org/docs/latest/sql-data-sources-csv.html
Then, when you read this csv including the newline (\n). You need to add multiLine=True option.
spark.read.csv(csv_path, multiLine=True)
# or
spark.read.option('multiLine', True).csv(csv_path)
Related
I know this sounds silly, but is it possible to read a CSV file containing multiple columns and combine all the data into one column? Let's say I have a CSV file with 6 columns and they have different delimiters. Is it possible to read these files, but spit out the first 100 rows into one column, without specifying a delimiter? My understanding is that this isn't possible if using pandas.
I don't know if this helps, but to add context to my question, I'm trying to use Treeview from Tkinter to display the first 100 rows of a CSV file. The Treeview window should display this data as 1 column if a delimiter isn't specified. Otherwise, it will automatically split the data based on a delimiter from the user input.
This is the data I have:
This should be the result:
Pandas isn't the only way to read a CSV file. There is also the built in csv module in the python standard lib as well as the basic builtin function open that will work just as well. Both of these methods can generate single rows of data like your question indicates.
Using open function
filepath = "/path/to/file.csv"
with open(filepath, "rt", encoding="utf-8") as fd:
header = next(fd)
for row in fd:
# .... do something with row data
# row will be a string of all the data for a single row.
# example: "Information,44775.4541667,MicrosoftWindowsSecurity,16384..."
# then you can break at any time you want to stop reading.
or using the csv module:
import csv
reader = csv.reader(open("/path/to/file.csv", "rt", encoding="utf8"), delimeter=',')
header = next(reader)
for row in reader:
# this time the row will be a list split by the delimiter which
# by default is a comma but you can change it in the call to the reader
you can use
open('file.csv') as f: data=list(f.readlines())
to read file line by line
As other answers have explained, you can use various ways to read first n-lines of text from a file. But if you insist on using pandas then there is a trick you can use.
Find a character which will never appear in your text and use it as dummy delimiter to read_csv(), so that all text will be read as one column. Use nrows parameter to control number of lines to read:
pd.read_csv("myfile.csv", sep="~", nrows=100)
I have a Pyspark dataframe that has commas in one of the field.
Sample data:
+--------+------------------------------------------------------------------------------------+
|id |reason |
+--------+------------------------------------------------------------------------------------+
|123-8aab|Request for the added "Hello Ruth, How are, you, doing and Other" abc. Thanks! |
|125-5afs|Hi Prachi, I added an "XYZA", like josa.coam, row. "Uid to be eligible" for clarity.|
+--------+------------------------------------------------------------------------------------+
When I am writing this in csv, the data is spilling on to the next column and is not represented correctly. Code I am using to write data and output:
df_csv.repartition(1).write.format('csv').option("header", "true").save(
"s3://{}/report-csv".format(bucket_name), mode='overwrite')
How data appears in csv:
Any help would really be appreciated. TIA.
NOTE : I think if the field has just commas, its exporting properly, but the combination of quotes and commas is what is causing the issue.
What worked for me was-->
df_csv.repartition(1).write.format('csv').option("header", "true").option("quote", "\"").option("escape", "\"").save("s3://{}/report-csv".format(bucket_name), mode='overwrite')
More detailed explanation in this post:
Reading csv files with quoted fields containing embedded commas
I have a problem and found many related questions asked here and read them all, but still can`t solve it. So far I didn't get any answer.
I have two files one is .csv and the other is .xlsx. They have a different number of rows and columns. I would like to merge these two according to filenames. Very simplified the two files look like as follows;
The csv file;
the excel file;
First i converted them to panda data frame;
import pandas as pd
import csv,xlrd
df1 = pd.read_csv('mycsv.csv')
df2=pd.read_excel(myexcel.xlsx', sheetname=0)
To merge the two files on the same column I remove the white space in column names in df2 using the first line below and, then I merge them and print the merged data frame in csv file.
df2.columns=df2.columns.str.replace(' ', '')
df=pd.merge(df1, df2, on="filename")
df.to_csv('myfolder \\merged_file.csv', sep="\t ")
When I check my folder, I see merged_file.csv exists but when I opened it there is no space between columns and values. I want to see nice normal csv or excel look, like my example files above. Just to make sure I tried everything, I also converted the Excel file to a csv file and then merged two csv but still merged data is written without spaces. Again, the above files are very simplified, but my real merged data look like this;
Finally, figured it out. I am putting the answer here just in case if anyone else also manages the same mistake as me. Just remove the sep="\t" and use below line instead;
df.to_csv('myfolder \\merged_file.csv')
Just realized the two csv files were comma separated and using tab delimiter for merge didn`t work.
This question already has answers here:
Load CSV file with PySpark
(13 answers)
Closed 4 years ago.
I have a csv file containing commas within a column value. For example,
Column1,Column2,Column3
123,"45,6",789
The values are wrapped in double quotes when they have extra commas in the data. In the above example, the values are Column1=123, Column2=45,6 and Column3=789 But, when trying to read the data, it gives me 4 values because of extra comma in Column2 field.
How to get the right values when reading this data in PySpark? I am using Spark 1.6.3
I am currently doing the below to create a rdd and then a data frame from rdd.
rdd = sc.textFile(input_file).map(lambda line: line.split(','))
df = sqlContext.createDataFrame(rdd)
You can directly read it to an DF using an SQLContext:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv')
.options(header='true', inferschema='true', quote='"', delimiter=',')
.load(input_file)
As Delimiter ',' and Quote '"' are the defaults you can also omit them. Commas inside quotes are ignored by default. An description of the parameters can be found here: https://github.com/databricks/spark-csv
Edit:
Without relying on Databricks, I can only think of a more tricky solution - this might not be the best approach:
Replace commas in numbers with points
Split using remaining commas
So, you could keep your original code, and add the REGEX replace
import re
rdd = sc.textFile(input_file).map(lambda line: (re.sub(r'\"(\d+),(\d+)\"',r'\1.\2', line)).split(','))
df.sqlContext.createDataFrame(rdd)
The supplied REGEX also gets rid of the double-quotes.
I'm trying to use Spark to turn one row into many rows. My goal is something like a SQL UNPIVOT.
I have a pipe delimited text file that is 360GB, compressed (gzip). It has over 1,620 columns. Here's the basic layout:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
There are over 800 of these property name/value fields. There are roughly 280 million rows. The file is in an S3 bucket.
The users want me to unpivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
This is my first time using Spark. I'm struggling to figure out a good way to do this.
What is a good way to do this in Spark?
Thanks.
The idea is to generate a list of lines from each input line as you have shown. This will give an RDD of lists of lines. Then use flatMap to get an RDD of individual lines:
If your file is loaded in as rdd1, then the following should give you what you want:
rdd1.flatMap(break_out)
where the function for processing lines is defined as:
def break_out(line):
# split line into individual fields/values
line_split=line.split("|")
# get the values for the line
vals=line_split[::2]
# field names for the line
keys=line_split[1::2]
# first value is primary key
primary_key=vals[0]
# get list of field values, pipe delimited
return(["|".join((primary_key, keys[i], vals[i+1])) for i in range(len(keys))])
You may need some additional code to deal with header lines etc, but this should work.