Dataframe saving in Python - python-3.x

I have splitted a string into date and time and i am satifsied how it was done.
The output are data splitted into 4 columns).
Now i would like to save the dataframe into a csv file, but every time i do it, the old / origin data format will be stored (with 2 columns).
Splitting a string into date and time (4 Columns)
unable to save the chnages i made on the data (2 Columns)

Related

pyspark csv format - mergeschema

I have a large dump of data that spans in TB's. The files contain activity data on a daily basis.
Day 1 can have 2 columns and Day 2 can have 3 columns. The file dump was in csv format. Now I need to read all these files and load it into a table. Problem is the format is csv and I am not sure how to merge the schema so as to lose not any columns. I know this can be achieved in parquet through mergeschema, but I cant convert these files one by one into parquet as the data is huge. Is there any way to merge schema with format as csv?

Azure Data Factory DYNAMICALLY partition a csv/txt file based on rowcount

I am using azure dataflow to transform delimited files (csv/txt) to json. But I want to separate the files dynamically based on a max row count of 5,000 because I will not know the row count every time. So if I have a csv file with 10,000 rows the pipeline will output two equal json files, file1.json and file2.json. What is the best way to actually get the row count of my sources and the correct n number of partitions based on that row count within Azure Data Factory?
One way to achieve this is to use the mod or % operator.
To start with set a surrogate key on the CSV file or use any sequential key in the data.
Add a aggregate step with a group by clause that is your key % row count
Set the Aggregates function to collect()
Your output should now be a array of rows with the expected count in each.
We can't specify the row number to split the csv file. The closest workaround is specify the partition of the sink.
For example, I have a csv file contains 700 rows data. I successfully copy to two equal json files.
My source csv data in Blob storage:
Sink settings: each partition output a new file: json1.json and json2.json:
Optimize:
Partition operation: Set partition
Partition type: Dynamic partition
Number of partitions: 2 (means split the csv data to 2 partitions)
Stored ranges in columns: id(split based on the id column)
Run the Data flow and the csv file will split to two json files which each contains 350 rows data.
For your situation, the csv file with 10,000 rows the pipeline will output two equal json files(each contains 5,000 row data).

Insert list of multiple values in SQL table using pyodbc

I have a list of length greater than 5 million rows. I need to insert this list into a table in my database using pyodbc. I tried 2 methods but both take a lot of time.
My list looks like this:
a=[3000200,3000201,3000202, ...]
Code which I have tried so far (Assume table TEMP created already and mylist.csv has all values of list a):
Create a csv file and insert from that (takes 1.5 hrs to create csv file):
cursor.execute("INSERT INTO TEMP SELECT * FROM EXTERNAL 'mylist.csv' USING (DELIMITER ',' REMOTESOURCE 'ODBC')")
Insert from list by using loops(ETA is over 120 hrs!):
b='INSERT INTO TEMP VALUES (?)'
for i in tqdm(a):
cursor.execute(b,i)
Is there an easier way to perform this insert without waiting so long ?
EDIT
For Code 1, I was using a 'for' loop to insert the values of a string into a csv file and this was taking far too long than necessary.
As mentioned by Gord in comments below, I converted the list into a dataframe and loaded it (Using code 1) with 'to_csv' method and it takes less than a minute now!

Converting data from .dat to parquet using Pyspark

Why the number of rows is different after converting from .dat to parquet data format using pyspark? Even when I repeat the conversion on the same file multiple times, I get a different result (slightly more or slightly less or equal to the original rows count)!
I am using my Macbook pro with 16 gb
.dat file size is 16.5 gb
spark-2.3.2-bin-hadoop2.7.
I already have the rows count from my data provider (45 million rows).
First I read the .dat file
2011_df = spark.read.text(filepath)
Second, I convert it to parquet, a process that takes about two hours.
2011_df.write.option("compression","snappy").mode("overwrite").save("2011.parquet")
Afterwards, I read the converted parquet file
de_parq = spark.read.parquet(filepath)
Finally, I use "count" to get rows numbers.
de_parq.count()

How can I write content different column sized data frames into single final output text file using spark2

I have 6 Dataset of each different sized columns, i need put all those 6 Datasets content into one single text file, how can i do it in Spark2 java.
I tried union , but that is not working as column count is not matching for any Dataset

Resources