I could not write the data frame including twitter hashtag to the csv file. When I subset some of the variables, csv file does not show variables - subset

I extracted the twitter data to the R script via packages (rtweet) and (tidyverse) through spesific hashtag. After sucessfully get the data, I need to subset some of the variables that I want to analyze. I code the subset function and console shows the subsetted variables. Despite this, when I tried to write this to the csv file, written csv shows the whole variables instead to show only subsetted variables. Codes that I entered as follows.
twitter_data_armenian_issue_iki <- search_tweets("ermeniler", n=1000, include_rts = FALSE)
view(twitter_data_armenian_issue_iki)
twitter_data_armenian_issue_iki_clean <- cbind(twitter_data_armenian_issue_iki, users_data(twitter_data_armenian_issue_iki)[,c("id","id_str","name", "screen_name")])
twitter_data_armenian_issue_iki_clean <-twitter_data_armenian_issue_iki_clean[,! duplicated(colnames(twitter_data_armenian_issue_iki_clean))]
view(twitter_data_armenian_issue_iki_clean)
data_bir <-data.frame(twitter_data_armenian_issue_iki_clean)
data_bir[ , c("created_at", "id", "id_str", "full_text", "name", "screen_name", "in_reply_to_screen_name")]
write.csv(data_bir, "newdata.csv", row.names = FALSE, )
If anyone want to help me, I will be more pleased. Thank you
I tried to get twitter data with only some spesific columns that I want to analyze. In order to do this, I entried the subset function and run. But when I tried to write this to the csv, written csv file shows the wole variable. I controlled the environment panel, I could not see the subsetted data.
My question is How can I add the subsetted data to the environment and write this to the csv without any error.

Related

Script for Automation to pull from CSV

hope all is well.
I'm here working a script for automation to pull data from a csv and post to an api.
However how I dump the csv it is stored with [] and I would like to remove them before submitting.
df = pd.read_csv('test1.csv')
for index, data in df.iterrows():
payload =json.dumps({"adapters": data ["MAC Addresses"], "typeLabel": "Windows","iconType": "windows","operatingSystem": "Windows", "hostName": data ["Endpoint Name"],"role": "Staff"})
print (payload)
This is my script, where adapters is, that's the data I want to manipulate before posting.
In csv file, the data save as this:
['08:92:04:0a:ec:00']
whenever the code runs, it shows as this
"['08:92:04:0a:ec:00']"
But I want it to be like this ["08:92:04:0a:ec:00"], the API only accepts this.
Is there anyway this can be accomplish, much apperciation.
I tried everything, only solution is to learn from the experts.

Problem with .xls file validation on e-commerce platform

you may have noted that this is a long question, that was because I really put an effort to explain how many WTF's I am facing, and, maybe, is not that good yet, anyway, I appreciate your help!
Context
I'm doing an integration project for a client that handles a bunch of data to generate Excel files in .xls format, notice that extension!
While developing the project I was using the xlrd and xlwt python extensions, because, again, I need to create a .xls file. But at some time I had to download and extract a file and was in .csv format (but, in reality, the file contains an HTML table :c).
So I decided to use padas to read the HTML, create a data frame so I can manipulate and return a .xls excel file.
The Problem
after coding the logic and checking that the data was correct, I tried to upload this file to the e-commerce plataform.
What happened is that the platform doesn't validate my archive.
First I will briefly explain how the site work: He accepts .xls and only .xls file, probably manipulate and use them to update the database, I have access to nothing from the code source.
When I upload the file, the site takes me to a configuration page where, if I want or the site didn't relate right, I could relate excel columns to be the id or values that would be updated on the database.
The 'generico4' field expects 'smallint(5) unsigned' on the type.
An important fact is that I sent the file to my client so he could validate the data, and after many conversations between us was discovered that if he, just by downloading my file, opening, and saving, the upload works fine (the second image from my slide), important to note that he has a MacBook and me, Ubuntu. I tried to do the same thing but not worked.
He sent me this file and I tried to see the difference between both and I found nothing, the type of the numbers are the same, that is 'float', and printed via excel with the formula =TYPE(cell) returned 1.
I already tried many other things but nothing works :c
The code
Follow the code so you can have an idea of the logic
def stock_xls(data_file_path):
# This is my logic to manipulate the data
df = pd.read_html(data_file_path)[0]
df = df[[1,2]]
df.rename(columns={1:'sku', 2:'stock'}, inplace=True)
df = df.groupby(['sku']).sum()
df.reset_index(inplace=True)
df.loc[df['stock'] > 0, 'stock'] = 1
df.loc[df['stock'] == 0, 'stock'] = 2
# I create a new Worbook (via pandas was not working too)
wb_out = xlwt.Workbook()
ws_out = wb_out.add_sheet(sheetname='stock')
# Set the columns name
ws_out.write(0, 0, 'sku')
ws_out.write(0, 1, 'generico4')
# Copy DataFrame data to the WorkBook
for index, value in df.iterrows():
ws_out.write(index + 1, 0, str(value['sku']))
ws_out.write(index + 1, 1, int(value['stock']))
path = os.path.join(BASE_DIR, f'src/xls/temp/')
Path(path).mkdir(parents=True, exist_ok=True)
file_path = os.path.join(path, "stock.xls")
wb_out.save(file_path)
return file_path

How to include SQLAlchemy Data types in Python dictionary

I've written an application using Python 3.6, pandas and sqlalchemy to automate the bulk loading of data into various back-end databases and the script works well.
As a brief summary, the script reads data from various excel and csv source files, one at a time, into a pandas dataframe and then uses the df.to_sql() method to write the data to a database. For maximum flexibility, I use a JSON file to provide all the configuration information including the names and types of source files, the database engine connection strings, the column titles for the source file and the column titles in the destination table.
When my script runs, it reads the JSON configuration, imports the specified source data into a dataframe, renames source columns to match the destination columns, drops any columns from the dataframe that are not required and then writes the dataframe contents to the database table using a call similar to:
df.to_sql(strTablename, con=engine, if_exists="append", index=False, chunksize=5000, schema="dbo")
The problem I have is that I would like to also specify the data types in the df.to_sql method for columns and provide them as inputs from the JSON configuration file however, this doesn't appear to be possible as all the strings in the JSON file need to be be enclosed in quotes and they don't then translate when read by my code. This is how the df.to_sql call should look:
df.to_sql(strTablename, con=engine, if_exists="append", dtype=dictDatatypes, index=False, chunksize=5000, schema="dbo")
The entries that form the dtype dictionary from my JSON file look like this:
"Data Types": {
"EmployeeNumber": "sqlalchemy.types.NVARCHAR(length=255)",
"Services": "sqlalchemy.types.INT()",
"UploadActivities": "sqlalchemy.types.INT()",
......
and there a many more, one for each column.
However, when the above is read in as a dictionary, which I pass to the df.to_sql method, it doesn't work as the alchemy datatypes shouldn't be enclosed in quotes but, I can't get around this in my JSON file. The dictionary values therefore aren't recognised by pandas. They look like this:
{'EmployeeNumber': 'sqlalchemy.types.INT()', ....}
And they really need to look like this:
{'EmployeeNumber': sqlalchemy.types.INT(), ....}
Does anyone have experience of this to suggest how I might be able to have the sqlalchemy datatypes in my configuration file?
You could use eval() to convert the string names to objects of that type:
import sqlalchemy as sa
dict_datatypes = {"EmployeeNumber": "sa.INT", "EmployeeName": "sa.String(50)"}
pprint(dict_datatypes)
"""console output:
{'EmployeeName': 'sa.String(50)', 'EmployeeNumber': 'sa.INT'}
"""
for key in dict_datatypes:
dict_datatypes[key] = eval(dict_datatypes[key])
pprint(dict_datatypes)
"""console output:
{'EmployeeName': String(length=50),
'EmployeeNumber': <class 'sqlalchemy.sql.sqltypes.INTEGER'>}
"""
Just be sure that you do not pass untrusted input values to functions like eval() and exec().

Import Excel file into ngx-datatable - Angular 8

I have seen multiple posts on exporting ngx-datatable to csv/xlsx. However, I did not come across any post which says Import Excel file into ngx-datatable which is basically what I need. I need to read an excel file that user uploads and display into ngx-datatable (so basically excel file acting as source for ngx-datatable)
Any guidelines / help links to proceed will be a great help.
If you can transform into an csv file, there is this lib called ngx-csv-parser (https://www.npmjs.com/package/ngx-csv-parser) that helps to format data in object array, the way you need to send to a ngx-datatable. It says it is designed for Angular 13 but has compability with previous versions. I've tested it in Angular 10 and it does work.
It has a setting to use header or no. If you do, you can shape your columns prop from ngx-datatable with the same name of the header.
Example:
Lets say you have a csv file like this:
ColumnA,ColumnB,ColumnC
a,b,c
The output of using this lib (the way it is said in its readme) with header= true will be:
csvRecords = [{ColumnA: a, ColumnB: b, ColumnC: c}]
Lets say you also have an array of columns:
columns = [
{name="A", prop: ColumnA},
{name="B", prop: ColumnB},
{name="C", prop: ColumnC}
]
Then use columns and csvRecords in your html.
<ngx-datatable
class="material"
[rows]="csvRecords"
[columns]="columns"
>
</ngx-datatable>
Your table will be filled with data from your csv.

How to load different files into different tables, based on file pattern?

I'm running a simple PySpark script, like this.
base_path = '/mnt/rawdata/'
file_names = ['2018/01/01/ABC1_20180101.gz',
'2018/01/02/ABC2_20180102.gz',
'2018/01/03/ABC3_20180103.gz',
'2018/01/01/XYZ1_20180101.gz'
'2018/01/02/XYZ1_20180102.gz']
for f in file_names:
print(f)
So, just testing this, I can find the files and print the strings just fine. Now, I'm trying to figure out how to load the contents of each file into a specific table in SQL Server. The thing is, I want to do a wildcard search for files that match a pattern, and load specific files into specific tables. So, I would like to do the following:
load all files with 'ABC' in the name, into my 'ABC_Table' and all files with 'XYZ' in the name, into my 'XYZ_Table' (all data starts on row 2, not row 1)
load the file name into a field named 'file_name' in each respective table (I'm totally fine with the entire string from 'file_names' or the part of the string after the last '/' character; doesn't matter)
I tried to use Azure Data Factory for this, and it can recursively loop through all files just fine, but it doesn't get the file names loaded, and I really need the file names in the table to distinguish which records are coming from which files & dates. Is it possible to do this using Azure Databricks? I feel like this is an achievable ETL process, but I don't know enough about ADB to make this work.
Update based on Daniel's recommendation
dfCW = sc.sequenceFile('/mnt/rawdata/2018/01/01/ABC%.gz/').toDF()
dfCW.withColumn('input', input_file_name())
print(dfCW)
Gives me:
com.databricks.backend.daemon.data.common.InvalidMountException:
What can I try next?
You can use input_file_name from pyspark.sql.functions
e.g.
withFiles = df.withColumn("file", input_file_name())
Afterwards you can create multiple dataframes by filtering on the new column
abc = withFiles.filter(col("file").like("%ABC%"))
xyz = withFiles.filter(col("file").like("%XYZ%"))
and then use regular writer for both of them.

Resources