Related
I am building an API to save CSVs from Sharepoint Rest API using python 3. I am using a public dataset as an example. The original csv has 3 columns Group,Team,FIFA Ranking with corresponding data in the rows.For reference. the original csv on sharepoint ui looks like this:
after using data=response.content the output of data is:
b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\nA,Netherlands,8\r\nB,England,5\r\nB,Iran,20\r\nB,United States,16\r\nB,Wales,19\r\nC,Argentina,3\r\nC,Saudi Arabia,51\r\nC,Mexico,13\r\nC,Poland,26\r\nD,France,4\r\nD,Australia,38\r\nD,Denmark,10\r\nD,Tunisia,30\r\nE,Spain,7\r\nE,Costa Rica,31\r\nE,Germany,11\r\nE,Japan,24\r\nF,Belgium,2\r\nF,Canada,41\r\nF,Morocco,22\r\nF,Croatia,12\r\nG,Brazil,1\r\nG,Serbia,21\r\nG,Switzerland,15\r\nG,Cameroon,43\r\nH,Portugal,9\r\nH,Ghana,61\r\nH,Uruguay,14\r\nH,South Korea,28\r\n'
how do I convert the above to csv that pandas can manipulate with the columns being Group,Team,FIFA and then the corresponding data dynamically so this method works for any csv.
I tried:
data=response.content.decode('utf-8', 'ignore').split(',')
however, when I convert the data variable to a dataframe then export the csv the csv just returns all the values in one column.
I tried:
data=response.content.decode('utf-8') or data=response.content.decode('utf-8', 'ignore') without the split
however, pandas does not take this in as a valid df and returns invalid use of dataframe constructor
I tried:
data=json.loads(response.content)
however, the format itself is invalid json format as you will get the error json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Given:
data = b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\n' #...
If you just want a CSV version of your data you can simply do:
with open("foo.csv", "wt", encoding="utf-8", newline="") as file_out:
file_out.writelines(data.decode())
If your objective is to load this data into a pandas dataframe and the CSV is not actually important, you can:
import io
import pandas
foo = pandas.read_csv(io.StringIO(data.decode()))
print(foo)
following is my sample csv file.
id,name,gender
1,isuru,male
2,perera,male
3,kasun,male
4,ann,female
i converted above csv file into apache parquet using pandas library. following is my code.
import pandas as pd
df = pd.read_csv('./data/students.csv')
df.to_parquet('students.parquet')
after that i uploaded the parquet file into the s3 and created a external table like below.
create external table imp.s1 (
id integer,
name varchar(255),
gender varchar(255)
)
stored as PARQUET
location 's3://sample/students/';
after that i just run select query, but i got following error.
select * from imp.s1
Spectrum Scan Error. File 'https://s3.ap-southeast-2.amazonaws.com/sample/students/students.parquet'
has an incompatible Parquet schema for column 's3://sample/students.id'.
Column type: INT, Parquet schema:\noptional int64 id [i:0 d:1 r:0]
(s3://sample/students.parquet)
Could you please help me to figure out what's the problem in here ?
For NULLable integer values, Pandas use the dtype Int64 that correspond to Bigint in Parquet Amazon S3.
Parquet Amazon S3 File Data Type
Transformation
Description
Int32
Integer
-2,147,483,648 to 2,147,483,647 (Precision of 10, scale of 0)
Int64
Bigint
-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 (Precision of 19, scale of 0)
You need to explicitly set the column type of id when calling pandas.read_csv.
df = pd.read_csv('./data/students.csv', dtype={'id': 'int32'})
I am using spark 2.4 and using the below code to cast the string datetime column(rec_dt) in a dataframe(df1) to timestamp(rec_date) and create another dataframe(df2).
All the datetime values are getting parsed correctly except for the values where there are daylight saving datetime values.
The time zone of my session is 'Europe/London' and I do not want to store the data as UTC time zone and finally I have to write data as 'Europe/London' time zone only.
spark_session.conf.get("spark.sql.session.timeZone")
# Europe/London
Code :
df2 = df1.withColumn("rec_date", to_timestamp("rec_dt","yyyy-MM-dd-HH.mm.ss"))
output :
Please help.
I'm trying to load a csv from Google Cloud Storage into Bigquery using schema autodetect.
However I'm getting stumped by a parsing error on one of my columns. I'm perplexed why bigquery can't parse the field. In the documentation, it should be able to parse fields that look like YYYY-MM-DD HH:MM:SS.SSSSSS (which is exactly what my BQInsertTimeUTC column is).
Here's my code:
from google.cloud import bigquery
from google.oauth2 import service_account
project_id = "<my_project_id>"
table_name = "<my_table_name>"
gs_link = "gs://<my_bucket_id>/my_file.csv"
creds = service_account.Credentials.from_service_account_info(gcs_creds)
bq_client = bigquery.Client(project=project_id, credentials=creds)
dataset_ref = bq_client.dataset(<my_dataset_id>)
# create job_config object
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
source_format="CSV",
write_disposition="WRITE_TRUNCATE",
)
# prepare the load_job
load_job = bq_client.load_table_from_uri(
gs_link,
dataset_ref.table(table_name),
job_config=job_config,
)
# execute the load_job
result = load_job.result()
Error Message:
Could not parse '2021-07-07 23:10:47.989155' as TIMESTAMP for field BQInsertTimeUTC (position 4) starting at location 64 with message 'Failed to parse input string "2021-07-07 23:10:47.989155"'
And here's the csv file that is living in GCS:
first_name, last_name, date, number_col, BQInsertTimeUTC, ModifiedBy
lisa, simpson, 1/2/2020T12:00:00, 2, 2021-07-07 23:10:47.989155, tim
bart, simpson, 1/2/2020T12:00:00, 3, 2021-07-07 23:10:47.989155, tim
maggie, simpson, 1/2/2020T12:00:00, 4, 2021-07-07 23:10:47.989155, tim
marge, simpson, 1/2/2020T12:00:00, 5, 2021-07-07 23:10:47.989155, tim
homer, simpson, 1/3/2020T12:00:00, 6, 2021-07-07 23:10:47.989155, tim
Loading CSV files to BigQuery assumes that all the timestamp fields are going to follow the same format. In your CSV file, since the first timestamp value is "1/2/2020T12:00:00" so it is going to consider the timestamp format that the CSV file uses is [M]M-[D]D-YYYYT[H]H:[M]M:[S]S[.F]][time zone].
Therefore, it complains that the value "2021-07-07 23:10:47.989155" could not be parsed. If you change "2021-07-07 23:10:47.989155" to "7/7/2021T23:10:47.989155", it will work.
To fix this, you can either
Create a table with date column's type and BQInsertTimeUTC column's type as STRING. Load the CSV into it. And then expose a view which will have the expected TIMESTAMP column types for date and BQInsertTimeUTC, using SQL to transform the data from the base table.
Open the CSV file and transform either the "date" values or "BQInsertTimeUTC" values to make their formats consistent.
By the way, the CSV sample you pasted here has extra space after the delimiter ",".
Working version:
first_name,last_name,date,number_col,BQInsertTimeUTC,ModifiedBy
lisa,simpson,1/2/2020T12:00:00,7/7/2021T23:10:47.989155,tim
bart,simpson,1/2/2020T12:00:00,3,7/7/2021T23:10:47.989155,tim
maggie,simpson,1/2/2020T12:00:00,4,7/7/2021T23:10:47.989155,tim
marge,simpson,1/2/2020T12:00:00,5,7/7/2021T23:10:47.989155,tim
homer,simpson,1/3/2020T12:00:00,6,7/7/2021T23:10:47.989155,tim
As per the limitaions mentioned here,
When you load JSON or CSV data, values in TIMESTAMP columns must use a dash - separator for the date portion of the timestamp, and the date must be in the following format: YYYY-MM-DD (year-month-day). The hh:mm:ss (hour-minute-second) portion of the timestamp must use a colon : separator.
So can you try passing the BQInsertTimeUTC as 2021-07-07 23:10:47 without the milli seconds instead of 2021-07-07 23:10:47.989155
If you want to still use different Date formats you can do the following:
Load the CSV file as-is to BigQuery (i.e. your schema should be modified to BQInsertTimeUTC:STRING)
Create a BigQuery view that transforms the shipped field from a string to a recognized date format.
Do a PARSE_DATE for the BQInsertTimeUTC and use that view for your analysis
I am reading an xlsx file using Python's Pandas pd.read_excel(myfile.xlsx,sheet_name="my_sheet",header=2) and writing the df to a csv file using df.to_csv.
The excel file contains several columns with percentage values in it (e.g. 27.44 %). In the dataframe the values are getting converted to 0.2744, i don't want any modification in data. How can i achieve this?
I already tried:
using lambda function to convert back value from 0.2744 to 27.44 % but this i don't want this because the column names/index are not fixed. It can be any col contain the % values
df = pd.read_excel(myexcel.xlsx,sheet_name="my_sheet",header=5,dtype={'column_name':str}) - Didn't work
df = pd.read_excel(myexcel.xlsx,sheet_name="my_sheet",header=5,dtype={'column_name':object}) - Didn't work
Tried xlrd module also, but that too converted % values to float.
df = pd.read_excel(myexcel.xlsx,sheet_name="my_sheet")
df.to_csv(mycsv.csv,sep=",",index=False)
from your xlsx save the file directly in csv format
To import your csv file use pandas library as follow:
import pandas as pd
df=pd.read_csv('my_sheet.csv') #in case your file located in the same directory
more information on pandas.read_csv