This post does a great job of showing how parse a fixed width text file into a Spark dataframe with pyspark (pyspark parse text file).
I have several text files I want to parse, but they each have slightly different schemas. Rather than having to write out the same procedure for each one like the previous post suggests, I'd like to write a generic function that can parse a fixed width text file given the widths and column names.
I'm pretty new to pyspark so I'm not sure how to write a select statement where the number of columns, and their types is variable.
Any help would be appreciated!
Say we have a text file like the one in the example thread:
00101292017you1234
00201302017 me5678
in "/tmp/sample.txt". And a dictionary containing for each file name, a list of columns and a list of width:
schema_dict = {
"sample": {
"columns": ["id", "date", "string", "integer"],
"width" : [3, 8, 3, 4]
}
}
We can load the dataframes and split them into columns iteratively, using:
import numpy as np
input_path = "/tmp/"
df_dict = dict()
for file in schema_dict.keys():
df = spark.read.text(input_path + file + ".txt")
start_list = np.cumsum([1] + schema_dict[file]["width"]).tolist()[:-1]
df_dict[file] = df.select(
[
df.value.substr(
start_list[i],
schema_dict[file]["width"][i]
).alias(schema_dict[file]["columns"][i]) for i in range(len(start_list))
]
)
+---+--------+------+-------+
| id| date|string|integer|
+---+--------+------+-------+
|001|01292017| you| 1234|
|002|01302017| me| 5678|
+---+--------+------+-------+
Related
My csv file has below columns:
AFM_reversal_indicator,Alert_Message,axiom_key,_timediff,player,__mv_Splunk_Alert_Id,__mv_nbr_plastic,__mv_code_acct_stat_demo.
I want to remove columns starting with "__mv".
I saw some posts where pandas are used to filter out columns.
Is it possible to do it using csv module in python, If yes how ?
Also, with Pandas what regex should I give:
df.filter(regex='')
df.to_csv(output_file_path)
P.S I am using python3.8
You mean with standard python? You can use a list comprehension, e.g.
import csv
with open( 'data.csv', 'r' ) as f:
DataGenerator = csv.reader( f )
Header = next( DataGenerator )
Header = [ Col.strip() for Col in Header ]
Data = list( DataGenerator )
if Data[-1] == []: del( Data[-1] )
Data = [ [Row[i] for i in range( len( Header ) ) if not Header[i].startswith( "__mv" ) ] for Row in Data ]
Header = [ Col for Col in Header if not Col.startswith( "__mv" ) ]
However, this is just a simple example. You'll probably have further things to consider, e.g. what type your csv columns have, whether you want to read all the data at once like I do here, or one-by-one from the generator to save on memory, etc.
You could also use the builtin filter command instead of the inner list comprehension.
Also, if you have numpy installed and you wanted something more 'numerical', you can always use "structured numpy arrays" (https://numpy.org/doc/stable/user/basics.rec.html). They're quite nice. (personally I prefer them to pandas anyway). Numpy also has its own csv-reading functions (see: https://www.geeksforgeeks.org/how-to-read-csv-files-with-numpy/)
You don't need to use .filter for that. You can just find out which are those columns and then drop them from the DataFrame
import pandas as pd
# Load the dataframe (In our case create a dummy one with your columns)
df = pd.DataFrame(columns = ["AFM_reversal_indicator", "Alert_Message,axiom_key", "_timediff,player", "__mv_Splunk_Alert_Id", "__mv_nbr_plastic", "__mv_code_acct_stat_demo"])
# Get a list of all column names starting with "__mv"
mv_columns = [col for col in df.columns if col.startswith("__mv")]
# Drop the columns
df = df.drop(columns=mv_columns)
# Save the updated dataframe to a CSV file
df.to_csv("cleaned_data.csv", index=False)
The mv_columns will iterate through the columns in your DataFrame and pick those that starts with "__mv". Then the .drop will just remove those from it.
If for some reason you want to use csv package only, then the solution might not be as elegant as with pandas. But here is a suggestion:
import csv
with open("original_data.csv", "r") as input_file, open("cleaned_data.csv", "w", newline="") as output_file:
reader = csv.reader(input_file)
writer = csv.writer(output_file)
header_row = next(reader)
mv_columns = [col for col in header_row if col.startswith("__mv")]
mv_column_indices = [header_row.index(col) for col in mv_columns]
new_header_row = [col for col in header_row if col not in mv_columns]
writer.writerow(new_header_row)
for row in reader:
new_row = [row[i] for i in range(len(row)) if i not in mv_column_indices]
writer.writerow(new_row)
So, first, you read the first row that supposed to be your headers. With a similar logic as before, you find those columns that starts with "__mv" and then you get their indices. You write the new columns to your output file with those columns that don't exist to the "__mv" columns. Then you need to iterate through the rest of the CSV and remove those columns as you go.
I need to add a column to 40 excel files. The new column in each file will be filled with a name.
This is what I have:
files=[16686_Survey.xlsx, 16687_Survey.xlsx, 16772_Survey.xlsx, ...] (40 files with more than 200 rows each)
filenames=['name1', 'name2', 'name3', ...] (40 names)
I need to add a column to each excel file and write its corresponding name along the new column.
With the following code I got what I need for one file.
import pandas as pd
df = pd.read_excel('16686_Survey.xlsx')
df.insert(0, "WellName", "Name1")
writer = pd.ExcelWriter('16686_Survey.xlsx')
df.to_excel(writer, index = False)
writer.save()
But it would be inefficient if I do it 40 times, and I would like to learn how to use a loop to address this type of problem because I have been in the same situation many times.
The image is what I got with the code above. The first table in the image is what I have. The second table is what I want
Thank you for your help!
I'm not a 100% sure I understand your question but I think you're looking for this:
import pandas as pd
files=['16686_Survey.xlsx', '16687_Survey.xlsx', '16772_Survey.xlsx', ...]
filenames=['name1', 'name2', 'name3', ...]
for excel_file, other_name in zip(files, filenames):
df = pd.read_excel(excel_file)
df.insert(0, "WellName", other_name)
writer = pd.ExcelWriter(excel_file)
df.to_excel(writer, index = False)
writer.save()
I combined both the lists (I assumed they were the same length) using the zip function. The zip function takes items from the lists one by one and combines them so that all the first items are together and all the second and so forth.
I am trying concat several csv files by customer group using the below code:
files = glob.glob(file_from + "/*.csv") <<-- Path where the csv resides
df_v0 = pd.concat([pd.read_csv(f) for f in files]) <<-- Dataframe that concat all csv files from files mentioned above
The problem is the number of column in the csv varies by customer and they do not have a header file.
I am trying to see if I could add in a dummmy header column with labels such as col_1, col_2 ... depending on the number of columns in that csv.
Could anyone guide as to how could I get this done. Thanks.
Update on trying to search for a specific string in the Dataframe:
Sample Dataframe
col_1,col_2,col_3
fruit,grape,green
fruit,watermelon,red
fruit,orange,orange
fruit,apple,red
Trying to filter out rows having the word red and expect it to return rows 2 and 4.
Tried the below code:
df[~df.apply(lambda x: x.astype(str).str.contains('red')).any(axis=1)]
Use parameters header=None for default range columns 0, 1, 2 and skiprows=1 if necessary remove original columns names:
df_v0 = pd.concat([pd.read_csv(f, header=None, skiprows=1) for f in files])
If want also change columns names add rename:
dfs = [pd.read_csv(f, header=None, skiprows=1).rename(columns = lambda x: f'col_{x + 1}')
for f in files]
df_v0 = pd.concat(dfs)
I have a dataframe with 1000+ columns. I need to save this dataframe as .txt file(not as .csv) with no header,mode should be "append"
used below command which is not working
df.coalesce(1).write.format("text").option("header", "false").mode("append").save("<path>")
error i got
pyspark.sql.utils.AnalysisException: 'Text data source supports only a single column,
Note: Should not use RDD to save. Becouse i need to save files multiple times in the same path.
If you want to write out a text file for a multi column dataframe, you will have to concatenate the columns yourself. In the example below I am separating the different column values with a space and replacing null values with a *:
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([("foo", "bar"), ("baz", None)],
('a', 'b'))
def myConcat(*cols):
concat_columns = []
for c in cols[:-1]:
concat_columns.append(F.coalesce(c, F.lit("*")))
concat_columns.append(F.lit(" "))
concat_columns.append(F.coalesce(cols[-1], F.lit("*")))
return F.concat(*concat_columns)
df_text = df.withColumn("combined", myConcat(*df.columns)).select("combined")
df_text.show()
df_text.coalesce(1).write.format("text").option("header", "false").mode("append").save("output.txt")
This gives as output:
+--------+
|combined|
+--------+
| foo bar|
| baz *|
+--------+
And your output file should look likes this
foo bar
baz *
You can concatenate the columns easily using the following line (assuming you want a positional file and not a delimited one, using this method for a delimited file would require that you had delimiter columns between each data column):
dataFrameWithOnlyOneColumn = dataFrame.select(concat(*dataFrame.columns).alias('data'))
After concatenating the columns, your previous line should work just fine:
dataFrameWithOnlyOneColumn.coalesce(1).write.format("text").option("header", "false").mode("append").save("<path>")
I've a stream of JSONs with following structure that gets converted to dataframe
{
"a": 3936,
"b": 123,
"c": "34",
"attributes": {
"d": "146",
"e": "12",
"f": "23"
}
}
The dataframe show functions results in following output
sqlContext.read.json(jsonRDD).show
+----+-----------+---+---+
| a| attributes| b| c|
+----+-----------+---+---+
|3936|[146,12,23]|123| 34|
+----+-----------+---+---+
How can I split attributes column (nested JSON structure) into attributes.d, attributes.e and attributes.f as seperate columns into a new dataframe, so I can have columns as a, b, c, attributes.d, attributes.e and attributes.f in the new dataframe?
If you want columns named from a to f:
df.select("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")
If you want columns named with attributes. prefix:
df.select($"a", $"b", $"c", $"attributes.d" as "attributes.d", $"attributes.e" as "attributes.e", $"attributes.f" as "attributes.f")
If names of your columns are supplied from an external source (e.g. configuration):
val colNames: Seq("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")
df.select(colNames.head, colNames.tail: _*).toDF(colNames:_*)
Using the attributes.d notation, you can create new columns and you will have them in your DataFrame. Look at the withColumn() method in Java.
Use Python
Extract the DataFrame by using the pandas Lib of python.
Change the data type from 'str' to 'dict'.
Get the values of each features.
Save the results to a new file.
import pandas as pd
data = pd.read_csv("data.csv") # load the csv file from your disk
json_data = data['Desc'] # get the DataFrame of Desc
data = data.drop('Desc', 1) # delete Desc column
Total, Defective = [], [] # setout list
for i in json_data:
i = eval(i) # change the data type from 'str' to 'dict'
Total.append(i['Total']) # append 'Total' feature
Defective.append(i['Defective']) # append 'Defective' feature
# finally,complete the DataFrame
data['Total'] = Total
data['Defective'] = Defective
data.to_csv("result.csv") # save to the result.csv and check it