Create string from all row elements pandas - python-3.x

I have a csv file and I would like to create a string with all the elements of each row. Lets say that I have the following csv...
trump,clinton
google,microsoft,linkedin
linux,windows,osx
data science,operating systems
I would like to create a string like so; trump&clinton | google&microsoft&linkedin and so forth. I did import the file and create a df with pandas. The solution doesn't have to be with pandas, if can be done with import csv that is acceptable as well.
I need one string per row... each row will become its own string.

Try
df.apply('&'.join, axis=1)

Related

How do I convert my response with byte characters to readable CSV - PYTHON

I am building an API to save CSVs from Sharepoint Rest API using python 3. I am using a public dataset as an example. The original csv has 3 columns Group,Team,FIFA Ranking with corresponding data in the rows.For reference. the original csv on sharepoint ui looks like this:
after using data=response.content the output of data is:
b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\nA,Netherlands,8\r\nB,England,5\r\nB,Iran,20\r\nB,United States,16\r\nB,Wales,19\r\nC,Argentina,3\r\nC,Saudi Arabia,51\r\nC,Mexico,13\r\nC,Poland,26\r\nD,France,4\r\nD,Australia,38\r\nD,Denmark,10\r\nD,Tunisia,30\r\nE,Spain,7\r\nE,Costa Rica,31\r\nE,Germany,11\r\nE,Japan,24\r\nF,Belgium,2\r\nF,Canada,41\r\nF,Morocco,22\r\nF,Croatia,12\r\nG,Brazil,1\r\nG,Serbia,21\r\nG,Switzerland,15\r\nG,Cameroon,43\r\nH,Portugal,9\r\nH,Ghana,61\r\nH,Uruguay,14\r\nH,South Korea,28\r\n'
how do I convert the above to csv that pandas can manipulate with the columns being Group,Team,FIFA and then the corresponding data dynamically so this method works for any csv.
I tried:
data=response.content.decode('utf-8', 'ignore').split(',')
however, when I convert the data variable to a dataframe then export the csv the csv just returns all the values in one column.
I tried:
data=response.content.decode('utf-8') or data=response.content.decode('utf-8', 'ignore') without the split
however, pandas does not take this in as a valid df and returns invalid use of dataframe constructor
I tried:
data=json.loads(response.content)
however, the format itself is invalid json format as you will get the error json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Given:
data = b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\n' #...
If you just want a CSV version of your data you can simply do:
with open("foo.csv", "wt", encoding="utf-8", newline="") as file_out:
file_out.writelines(data.decode())
If your objective is to load this data into a pandas dataframe and the CSV is not actually important, you can:
import io
import pandas
foo = pandas.read_csv(io.StringIO(data.decode()))
print(foo)

Issue when exporting dataframe to csv

I'm working on a mechanical engineering project. For the following code, the user enters the number of cylinders that their compressor has. A dataframe is then created with the correct number of columns and is exported to Excel as a CSV file.
The outputted dataframe looks exactly like I want it to as shown in the first link, but when opened in Excel it looks like the image in the second link:
1.my dataframe
2.Excel Table
Why is my dataframe not exporting properly to Excel and what can I do to get the same dataframe in Excel?
import pandas as pd
CylinderNo=int(input('Enter CylinderNo: '))
new_number=CylinderNo*3
list1=[]
for i in range(1,CylinderNo+1):
for j in range(0,3):
Cylinder_name=str('CylinderNo ')+str(i)
list1.append(Cylinder_name)
df = pd.DataFrame(list1,columns =['Kurbel/Zylinder'])
list2=['Triebwerk', 'Packung','Ventile']*CylinderNo
Bauteil = {'Bauteil': list2}
df2 = pd.DataFrame (Bauteil, columns = ['Bauteil'])
new=pd.concat([df, df2], axis=1)
list3=['Nan','Nan','Nan']*CylinderNo
Bewertung={'Bewertung': list3}
df3 = pd.DataFrame (Bewertung, columns = ['Bewertung'])
new2=pd.concat([new, df3], axis=1)
Empfehlung={'Empfehlung': list3}
df4 = pd.DataFrame (Empfehlung, columns = ['Empfehlung'])
new3=pd.concat([new2, df4], axis=1)
new3.set_index('Kurbel/Zylinder')
new3 = new3.set_index('Kurbel/Zylinder', append=True).swaplevel(0,1)
#export dataframe to csv
new3.to_csv('new3.csv')
To be clear, a comma-separated values (CSV) file is not an Excel format type or table. It is a delimited text file that Excel like other applications can open.
What you are comparing is simply presentation. Both data frames are exactly the same. For multindex data frames, Pandas print output does not repeat index values for readability on the console or IDE like Jupyter. But such values are not removed from underlying data frame only its presentation. If you re-order indexes, you will see this presentation changes. The full complete data frame is what is exported to CSV. And ideally for data integrity, you want the full data set exported with to_csv to be import-able back into Pandas with read_csv (which can set indexes) or other languages and applications.
Essentially, CSV is an industry format to store and transfer data. Consider using Excel spreadsheets, HTML markdown, or other reporting formats for your presentation needs. Therefore, to_csv may not be the best method. You can try to build text file manually with Python i/o write methods, with open('new.csv', 'w') as f, but will be an extensive workaround See also #Jeff's answer here but do note the latter part of solution does remove data.

Pyspark Obtain Substring from Filename and Store as New Column

I am processing CSV files from S3 using pyspark, however I wish to incorporate filename as a new column for which I am using the below code:
spark.udf.register("filenamefunc", lambda x: x.rsplit('/', 1)[-2])
df=spark.read.csv("s3a://exportcsv-battery/S5/243/101*",sep=',',header=True,inferSchema=True)
df=df.withColumn("filename", 'filenamefunc(input_file_name())')
But instead of filename, I want a substring of it, for example, if this is the input_file_name:-
s3a://exportcsv-battery/S5/243/101_002932_243_AAA_A_T01_AAA_AAA_0_0_0_0_2_10Hz.csv
I only want 243 to be extracted and stored in a new column for which I defined a UDF as:
spark.udf.register("filenamefunc", lambda x: x.rsplit('/', 1)[-2])
But it doesn't seem to work. Is there something I can do to fix it or a different approach? Thanks!
You can use split() function
import pyspark.sql.functions as f
[...]
df = df.withColumn('filename', f.split(f.input_file_name(), '/')[4])

Read excel into pandas dataframe without modifying the values of excel?

I am reading an xlsx file using Python's Pandas pd.read_excel(myfile.xlsx,sheet_name="my_sheet",header=2) and writing the df to a csv file using df.to_csv.
The excel file contains several columns with percentage values in it (e.g. 27.44 %). In the dataframe the values are getting converted to 0.2744, i don't want any modification in data. How can i achieve this?
I already tried:
using lambda function to convert back value from 0.2744 to 27.44 % but this i don't want this because the column names/index are not fixed. It can be any col contain the % values
df = pd.read_excel(myexcel.xlsx,sheet_name="my_sheet",header=5,dtype={'column_name':str}) - Didn't work
df = pd.read_excel(myexcel.xlsx,sheet_name="my_sheet",header=5,dtype={'column_name':object}) - Didn't work
Tried xlrd module also, but that too converted % values to float.
df = pd.read_excel(myexcel.xlsx,sheet_name="my_sheet")
df.to_csv(mycsv.csv,sep=",",index=False)
from your xlsx save the file directly in csv format
To import your csv file use pandas library as follow:
import pandas as pd
df=pd.read_csv('my_sheet.csv') #in case your file located in the same directory
more information on pandas.read_csv

Access third value of first key in dictionary python

I have created a dictionary where one key has multiple values - start_time_C, duration_pre_val, value_T. All are input from an excel sheet.
Then I have sorted the dictionary.
pre_dict = {}
pre_dict.setdefault(rows,[]).append(start_time_C)
pre_dict.setdefault(rows,[]).append(duration_pre_val)
pre_dict.setdefault(rows,[]).append(value_T)
pre_dict_sorted = sorted(pre_dict.items(), key = operator.itemgetter(1))
Now, I want to compare a value (Column T of the excel sheet) with value_T.
How do I access value_T from the dictionary?
Many thanks!
Let's break this into two parts:
Reading in the spreadsheet
I/O stuff like this is best handled with pandas; if you'll be working with spreadsheets and other tabular data in Python, get acquainted with this package. You can do something like
import pandas as pd
#read the excel file into a pandas dataframe
my_data = pd.read_excel('/your/path/filename.xlsx', sheetname='Sheet1')
Accessing elements of the data, creating a dict
Your spreadsheet's content is now in the pandas DataFrame "my_data". From here you can reference DataFrame elements like this
#pandas: entire column
my_data['value_T']
#pandas: 2nd row, 0th column
my_data.iloc[2, 0]
and create Python data structures
#create a dict from the dataframe
my_dict = my_data.set_index(my_data.index).to_dict()
#access the values associated with the 'value_T key of the dict
my_dict['value_T']

Resources