I'm working with analysis of feelings and after having got twitter data with twython and saving them in a txt file in json format, I need to write them in CSV format. I can do this but special characters are not written, for example "Inclusão" is written "Inclus \ xc3 \ xa3o"
here is the code:
import json
from csv import writer
with open('data.txt') as data_file:
data = json.load(data_file)
tweets = data['statuses']
#variables
times = [tweet['created_at'] for tweet in tweets]
users = [tweet['user']['name'] for tweet in tweets]
texts = [tweet['text'] for tweet in tweets]
#output file
out = open('tweets_file.csv', 'w')
print(out, 'created,user,text')
rows = zip(times,users,texts)
csv = writer(out)
for row in rows:
values = [value.encode('utf8') for value in row]
csv.writerow(values)
out.close()
i already solved the problem guys, thank you! the problem is that my text has already been encoded and i was trying to do this again.
Related
I have written the code to extract the numbers and the company name from the extracted pdf file.
sample pdf content:
#88876 - Sample1, GTRHEUSKYTH, -99WED,-0098B
#99945 - SAMPLE2, DJWHVDFWHEF, -8876D,-3445G
The above example is what my pdf file contains. I wanted to extract the App number which is after # (i.e) five numbers(88876) and App name which is after the (-) (i.e) Sample1. An write that to an excel file as separate columns which is App_number and App_name.
Please refer the below code which I have tried.
import PyPDF2, re
import csv
for k in range(1,100):
pdfObj = open(r"C:\\Users\merge.pdf",'rb')
object = PyPDF2.PdfFileReader("C:\\Users\merge.pdf")
pdfReader = PyPDF2.PdfFileReader(pdfObj)
NumPages = object.getNumPages()
pdfReader.numPages
for i in range(0, NumPages):
pdfPageObj = pdfReader.getPage(i)
text = pdfPageObj.extractText()
x=re.findall('(?<=#).[0-9]+', text)
y=re.findall("(?<=\- )(.*?)(?=,)", text)
print(x)
print(y)
with open("out.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(x)
Please pour some suggestions.
Try this:
text = '#88876 - Sample1, GTRHEUSKYTH'
App_number = re.search('(?<=#).[0-9]+', text).group()
App_name = re.search("(?<=\- )(.*?)(?=,)", text).group()
In the first regex you get the first consecutive digits after #, in the second one you get everything between - and ,
Hope it helped
I am trying to write a simple program that should give the following output when it reads csv file which contains several email ids.
email_id = ['emailid1#xyz.com','emailid2#xyz.com','emailid3#xyz.com'] #required format
but the problem is the output I got is like this following:
[['emailid1#xyz.com']]
[['emailid1#xyz.com'], ['emailid2#xyz.com']]
[['emailid1#xyz.com'], ['emailid2#xyz.com'], ['emailid3#xyz.com']] #getting this wrong format
here is my piece of code that I have written: Kindly suggest me the correction in the following piece of code which would give me the required format. Thanks in advance.
import csv
email_id = []
with open('contacts1.csv', 'r') as file:
reader = csv.reader(file, delimiter = ',')
for row in reader:
email_id.append(row)
print(email_id)
NB.: Note my csv contains only one column that has email ids and has no header. I also tried the email_id.extend(row) but It did not work also.
You need to move your print outside the loop:
with open('contacts1.csv', 'r') as file:
reader = csv.reader(file, delimiter = ',')
for row in reader:
email_id.append(row)
print(sum(email_id, []))
The loop can also be like this (if you only need one column from the csv):
for row in reader:
email_id.append(row[0])
print(email_id)
I am trying to scrape data from a webpage and save the scraped text in JSON format.
I have reached until the step where i can gather text which i want but then i cant save it in expected format. Csv or txt format is also sufficient if possible
Please help me how to save scraped text in JSON. Here is my code which i have extracted
for k in range(0, len(op3)):
selectweek.select_by_index(k)
table = driver.find_element_by_xpath("//table[#class='list-table']")
for row in table.find_elements_by_xpath('//*[#id="dvFixtureInner"]/table/tbody/tr[2]/td[6]/a'):
row.click()
mainpage = driver.window_handles[0]
print(mainpage)
popup = driver.window_handles[1]
driver.switch_to.window(popup)
time.sleep(3)
#Meta details of match
team1 = driver.find_element_by_xpath('//*[#id="match-details"]/div/div[1]/div/div[2]/div[1]/div[1]/a') #Data to save
team2 = driver.find_element_by_xpath('//*[#id="match-details"]/div/div[1]/div/div[2]/div[3]/div[1]/a') #Data to save
ht = driver.find_element_by_xpath('//*[#id="dvHTScoreText"]') #Data to save
ft = driver.find_element_by_xpath('//*[#id="dvScoreText"]') #Data to save
Create dictionary and convert it into JSON format using json module.
import json
dictionary = {"team1" : team1, "team2": team2, "ht": ht, "ft": ft}
json_dump = json.dumps(dictionary)
with open("YourFilePath", "w") as f:
f.write(json_dump)
You can create a dictionary and add key-value to it. I don't know the structure of the json but this can give an idea:
json_data = dict()
ht = 1
ft = 2
json_data["team1"] = {"ht": ht, "ft": ft}
print(json_data)
>>> {'team1': {'ht': 1, 'ft': 2}}
I'm able to import text as a string. I understand also read_csv.
with open('text.txt', 'r') as file:
text = file.read().replace('\n', '')
My question is if I data frame with many records, and I have the text location. How can bulk import text as strings to a new column?
Example data frame:
Filename,Text Path
File1,C:\Text\File1.txt
File2,C:\Text\File2.txt
File3,C:\Text\File3.txt
Example Result:
Filename,Text Path,Text
File1,C:\Text\File1.txt,This is some text.
File2,C:\Text\File2.txt,Other kinds of text.
File3,C:\Text\File3.txt,Even more text.
I'm not aware of any library that can do this directly. I think you need to step through each row of the dataframe and add the text to a new column. Assuming you are using pandas and your example dataframe is "df":
for i in range(len(df['Text Path'])):
with open(df.loc[i,'Text Path'], 'r') as file:
df.loc[i,'Text'] = file.read()
EDIT:
this could be a bit faster (apply a function to generate the new column):
def readtxt(f):
with open(f, 'r') as file:
return file.read()
df['Text'] = df['Text Path'].apply(readtxt)
I am trying to extract data from a csv file with python 3.6.
The data are both numbers and text (it's url addresses):
file_name = [-0.47, 39.63, http://example.com]
On multiple forums I found this kind of code:
data = numpy.genfromtxt(file_name, delimiter=',', skip_header=skiplines,)
But this works for numbers only, the url addresses are read as NaN.
If I add dtype:
data = numpy.genfromtxt(file_name, delimiter=',', skip_header=skiplines, dtype=None)
The url addresses are read correctly, but they got a "b" at the beginning of the address, such as:
b'http://example.com'
How can I remove that? How can I just have the simple string of text?
I also found this option:
file = open(file_path, "r")
csvReader = csv.reader(file)
for row in csvReader:
variable = row[i]
coordList.append(variable)
but it seems it has some issues with python3.