Problem in converting csv to json in python - python-3.x

I am trying to convert csv file to json in python and i have an issue where in one column data has a comma but it is enclosed in double quotes. When considering it as a csv file, data is loading properly without any issues. But while converting to json it is failing saying "Too few arguments passed".
sample Data:
col1,col2,col3
apple,Fruit,good for health
banana,Fruit,"good for weight gain , good for calcium"
Brinjal,Vegetable,good for skin
while converting the above file to json, it is failed considering 2nd row has 4 columns.
Error statement: pandas.errors.ParserError: Too many columns specified: expected 3 and found 4
data=pd.read_csv(sampledata.csv,header=None)
data_json = json.loads(data.to_json(orient='records'))
with open(filename.json,'w',encoding='utf-8')as jsonf:
jsonf.write(json.dumps(data_json,indent=4))

This works:
df = pd.read_csv("test.csv")
df_json = df.to_json()

Related

How do I convert my response with byte characters to readable CSV - PYTHON

I am building an API to save CSVs from Sharepoint Rest API using python 3. I am using a public dataset as an example. The original csv has 3 columns Group,Team,FIFA Ranking with corresponding data in the rows.For reference. the original csv on sharepoint ui looks like this:
after using data=response.content the output of data is:
b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\nA,Netherlands,8\r\nB,England,5\r\nB,Iran,20\r\nB,United States,16\r\nB,Wales,19\r\nC,Argentina,3\r\nC,Saudi Arabia,51\r\nC,Mexico,13\r\nC,Poland,26\r\nD,France,4\r\nD,Australia,38\r\nD,Denmark,10\r\nD,Tunisia,30\r\nE,Spain,7\r\nE,Costa Rica,31\r\nE,Germany,11\r\nE,Japan,24\r\nF,Belgium,2\r\nF,Canada,41\r\nF,Morocco,22\r\nF,Croatia,12\r\nG,Brazil,1\r\nG,Serbia,21\r\nG,Switzerland,15\r\nG,Cameroon,43\r\nH,Portugal,9\r\nH,Ghana,61\r\nH,Uruguay,14\r\nH,South Korea,28\r\n'
how do I convert the above to csv that pandas can manipulate with the columns being Group,Team,FIFA and then the corresponding data dynamically so this method works for any csv.
I tried:
data=response.content.decode('utf-8', 'ignore').split(',')
however, when I convert the data variable to a dataframe then export the csv the csv just returns all the values in one column.
I tried:
data=response.content.decode('utf-8') or data=response.content.decode('utf-8', 'ignore') without the split
however, pandas does not take this in as a valid df and returns invalid use of dataframe constructor
I tried:
data=json.loads(response.content)
however, the format itself is invalid json format as you will get the error json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Given:
data = b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\n' #...
If you just want a CSV version of your data you can simply do:
with open("foo.csv", "wt", encoding="utf-8", newline="") as file_out:
file_out.writelines(data.decode())
If your objective is to load this data into a pandas dataframe and the CSV is not actually important, you can:
import io
import pandas
foo = pandas.read_csv(io.StringIO(data.decode()))
print(foo)

pyspark read json file as one column of stringType

I'd like to convert a JSON file to a list of rows of the full JSON string. I would use the text format to read the JSON files, but these files are not newline separated and I can't change that.
Input JSON:
{"key":"value1"},{"key":"value2"}
Expected Output:
record
{"key":"value1"}
{"key":"value2"}
You can try using
df = spark.read.options(lineSep = ",").json(filePathOfJson)
By default it is "\n"

Join different DataFrames using loop in Pyspark

I have 5 CSV files in a file, and want to join them in one data frame in Pyspark: I use the code below:
name_file =['A', 'B', 'C', 'D', 'V']
for n in name_file:
n= spark.read.csv(fullpath+n+'.csv'
,header=False,
inferSchema= True)
full_data=full_data.join(n,["id"])
Error: I got an unexpected result > The last dataframe joined just with itself.
Expected Result: There should be 6 columns, each CSV has 2 data frames one of them in common with others. The join should be on this column. As a result, the final data frame should have a common column and 5 special columns from each CSV file.
There seem to be several things wrong with the code or perhaps you have not provided the complete code.
Have you defined fullpath?
You have set header=False then how will spark know that there is an
"id" column?
Your indentation looks wrong under the for loop.
full_data has not been defined yet, so how are you using it on the
right side of the evaluation within the for loop? I suspect you have
initialized this to the first csv file and then attempting to join
it with first csv again.
I ran a small test on the below code which worked for me and addresses the questions I've raised above. You can adjust it to your need.
fullpath = '/content/sample_data/'
full_data = spark.read.csv(fullpath+'Book1.csv'
,header=True,
inferSchema= True)
name_file =['Book2', 'Book3']
for n in name_file:
n= spark.read.csv(fullpath+n+'.csv'
,header=True,
inferSchema= True)
full_data=full_data.join(n,["id"])
full_data.show(5)

Loading data from csv to pandas dataframe gives NaNs

I am using the following code to read a csv file into pandas but when I move it into a data frame, I am only getting Nans. I need to put it into dataframe to work on loading it into SQL Server.
I am using the following code to load the data into csv file:
for file in z.namelist():
df1=pd.read_csv(z.open(file),sep='\t',skiprows=[1,2])
print(df1)
This gives me the intended results:
But when I try to put the data into a dataframe, I am getting only NaNs. This is the code that I am using to load the data into data frame after the step above.
df1 = pd.DataFrame(df1,columns=['ResponseID','ResponseSet','IPAddress','StartDate','EndDate',
'RecipientLastName','RecipientFirstName','RecipientEmail','ExternalDataReference','Finished',
'Status','EmbeddedData','License Type','Organization','Reference ID','Q16','Q3#1_1_1_TEXT',
'Q3#1_1_2_TEXT','Q3#1_1_3_TEXT','Q3#1_2_1_TEXT','Q3#1_2_3_TEXT','Q3#1_3_1_TEXT','Q3#1_3_2_TEXT',
'Q3#1_3_3_TEXT','Q3#1_4_1_TEXT','Q3#1_4_2_TEXT','Q3#1_4_3_TEXT','Q3#1_5_1_TEXT','Q3#1_5_2_TEXT',
'Q3#1_5_3_TEXT','Q3#1_6_1_TEXT','Q3#1_6_2_TEXT','Q3#1_6_3_TEXT','Q4#1_5_1_TEXT','Q18','Q19#1_1_1_TEXT',
'Q19#1_2_1_TEXT','Q19#1_3_1_TEXT','Q19#1_4_1_TEXT','Q19#1_6_1_TEXT','Q14#1_4_1_TEXT','Q14#1_5_1_TEXT',
'Q14#1_8_1_TEXT','Q20','Q29','Q21','Q22','Q23','Q24','LocationLatitude','LocationLongitude','LocationAccuracy'])
print(df1)
I am getting only NaNs on for this.
What should I do to get the data from csv into my data frame and what is wrong with my code?
I was able to resolve this by using "," as a separator for my read_csv.
df1=pd.read_csv(z.open(file),sep=',',skiprows=[1,2])
I got rid of the NaNs by using the following:
df1 = df1.replace({np.nan: None})

How to convert dataframe to a text file in spark?

I unloaded snowflake table and created a data frame.
this table has data of various datatype.
I tried to save it as a text file but got an error:
Text data source does not support Decimal(10,0).
So to resolve the error, I casted my select query and converted all columns to string datatype.
Then I got the below error:
Text data source supports only single column, and you have 5 columns.
my requirement is to create a text file as follows.
"column1value column2value column3value and so on"
You can use a CSV output with a space delimiter:
import pyspark.sql.functions as F
df.select([F.col(c).cast('string') for c in df.columns]).write.csv('output', sep=' ')
If you want only 1 output file, you can add .coalesce(1) before .write.
You need to have one column if you want to write using spark.write.text. You can use csv instead as suggested in #mck's answer or you can concatenate all columns into one before you write:
df.select(
concat_ws(" ", df.columns.map(c => col(c).cast("string")): _*).as("value")
).write
.text("output")

Resources