I have to import around 1000 data I can't do it one by one.
Is there a way to make the csv files import as integers instead of strings? It always changes to string when I use mongoimport.
Mongoimport --host localhost -- db database -- collection -- collections -- type csv -- file. 1000data.csv --headerline
It is possible to do this from version 3.4, check out here: https://docs.mongodb.com/manual/reference/program/mongoimport/#cmdoption--columnsHaveTypes
Related
I have been working on a project to import the Danish 2.5Million ATM transaction data set to derive some visualizations.
The data is hosted on a mysql server provided by the university. The objective is to import the data using Sqoop and then apply a few transformations to it using pyspark.
Link to the dataset here : https://www.kaggle.com/sparnord/danish-atm-transactions
The Sql server, that hosts this information has a few rows which are intentionally or unintentionally mangled.
Example:
So I have a very basic sqoop command which gets the details from the source database. However I run into an issue where there are values which have a double-quote " especially in the column message_text
Sqoop Command :
sqoop import --connect jdbc:mysql:{source-connection-string} --table SRC_ATM_TRANS --username {username}--password {password} --target-dir /user/root/etl_project --fields-terminated-by '|' --lines-terminated-by "\n" -m 1
Here is sample row that is imported in the transaction.
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction|0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds
However the expected output should be
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction,0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds|Cloudy
At first I was okay with this hoping that pyspark would handle the mangled data since the delimiters are specified.
But now I run into issues when populating my dataframe.
transactions = spark.read.option("sep","|").csv("/user/root/etl_project/part-m-00000", header = False,schema = transaction_schema)
However when I inspect my rows I see that the mangled data has caused the dataframe to put these affected values into a single column!
transactions.filter(transactions.message_code == "4017").collect()
Row(year=2017, month=u'January', day=1, weekday=u'Sunday', hour=17, atm_status=u'Active', atm_id=u'35', atm_manufacturer=u'NCR', atm_location=u'Aabybro', atm_streetname=u'\xc3\u0192\xcb\u0153stergade', atm_street_number=6, atm_zipcode=9440, atm_lat=57.162, atm_lon=9.73, currency=u'DKK', card_type=u'MasterCard', transaction_amount=7387, service=u'Withdrawal', message_code=u'4017', message_text=u'Suspected malfunction|0.000|57.158|10|2625037|0.000|276|1021|83|4|319.000|0|0|800|Clear', weather_lat=None, weather_lon=None, weather_city_id=None, weather_city_name=None, temp=None, pressure=None, humidity=None, wind_speed=None, wind_deg=None, rain_3h=None, clouds_all=None, weather_id=None, weather_main=None, weather_description=None)
At this point I am not sure on what to do?
Do I go ahead and create temporary columns to manage this and use a regex replacement to fill in these values ?
Or is there any better way I can import the data and manage these mangled values either in sqoop or in pyspark ?
I'm trying to get the same result as a pandas to_csv called without a path argument. Currently I'm saving the dataframe as a csv to then read it and I'd like to avoid this step.
path_or_buf: str or file handle, default None
File path or object, if None is provided the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.
Having a big dataset, the toPandas function doesn't work.
Does someone know if it's possible in pyspark or know a work around ?
You can use to_csv:
csv_string = df.agg(F.concat_ws('\n', F.collect_list(F.to_csv(F.struct(df.columns))))).head()[0]
You can just use to_csv to convert list of columns to csv as below
from pyspark.sql import functions as f
df.select(f.to_csv(f.struct(df.columns))).show(truncate=False)
I've written an application using Python 3.6, pandas and sqlalchemy to automate the bulk loading of data into various back-end databases and the script works well.
As a brief summary, the script reads data from various excel and csv source files, one at a time, into a pandas dataframe and then uses the df.to_sql() method to write the data to a database. For maximum flexibility, I use a JSON file to provide all the configuration information including the names and types of source files, the database engine connection strings, the column titles for the source file and the column titles in the destination table.
When my script runs, it reads the JSON configuration, imports the specified source data into a dataframe, renames source columns to match the destination columns, drops any columns from the dataframe that are not required and then writes the dataframe contents to the database table using a call similar to:
df.to_sql(strTablename, con=engine, if_exists="append", index=False, chunksize=5000, schema="dbo")
The problem I have is that I would like to also specify the data types in the df.to_sql method for columns and provide them as inputs from the JSON configuration file however, this doesn't appear to be possible as all the strings in the JSON file need to be be enclosed in quotes and they don't then translate when read by my code. This is how the df.to_sql call should look:
df.to_sql(strTablename, con=engine, if_exists="append", dtype=dictDatatypes, index=False, chunksize=5000, schema="dbo")
The entries that form the dtype dictionary from my JSON file look like this:
"Data Types": {
"EmployeeNumber": "sqlalchemy.types.NVARCHAR(length=255)",
"Services": "sqlalchemy.types.INT()",
"UploadActivities": "sqlalchemy.types.INT()",
......
and there a many more, one for each column.
However, when the above is read in as a dictionary, which I pass to the df.to_sql method, it doesn't work as the alchemy datatypes shouldn't be enclosed in quotes but, I can't get around this in my JSON file. The dictionary values therefore aren't recognised by pandas. They look like this:
{'EmployeeNumber': 'sqlalchemy.types.INT()', ....}
And they really need to look like this:
{'EmployeeNumber': sqlalchemy.types.INT(), ....}
Does anyone have experience of this to suggest how I might be able to have the sqlalchemy datatypes in my configuration file?
You could use eval() to convert the string names to objects of that type:
import sqlalchemy as sa
dict_datatypes = {"EmployeeNumber": "sa.INT", "EmployeeName": "sa.String(50)"}
pprint(dict_datatypes)
"""console output:
{'EmployeeName': 'sa.String(50)', 'EmployeeNumber': 'sa.INT'}
"""
for key in dict_datatypes:
dict_datatypes[key] = eval(dict_datatypes[key])
pprint(dict_datatypes)
"""console output:
{'EmployeeName': String(length=50),
'EmployeeNumber': <class 'sqlalchemy.sql.sqltypes.INTEGER'>}
"""
Just be sure that you do not pass untrusted input values to functions like eval() and exec().
So I am facing the following problem:
I have a ; separated csv, which has ; enclosed in quotes, which is corrupting the data.
So like abide;acdet;"adds;dsss";acde
The ; in the "adds;dsss" is moving " dsss" to the next line, and corrupting the results of the ETL module which I am writing. my ETL is taking such a csv from the internet, then transforming it (by first loading it in Pandas data frame, doing pre-processing and then saving it), then loading it in sql server. But corrupted files are breaking the sql server schema.
Is there any solution which I can use in conjunction with Pandas data frame which allows me to fix this issue either during the read(pd.read_csv) or writing(pd.to_csv)( or both) part using Pandas dataframe?
You might need to tell the reader some fields may be quoted:
pd.read_csv(your_data, sep=';', quotechar='"')
Let's try:
from io import StringIO
import pandas as pd
txt = StringIO("""abide;acdet;"adds;dsss";acde""")
df = pd.read_csv(txt,sep=';',header=None)
print(df)
Output dataframe:
0 1 2 3
0 abide acdet adds;dsss acde
The sep parameter of pd.read_csv allows you to specify which character is used as a separator in your CSV file. Its default value is ,. Does changing it to ; solve your problem?
I am new to python and have a simple problem. In a first step, I want to load some sample data I created in Stata. In a second step, I would like to describe the data in python - that is, I'd like a list of the imported variable names. So far I've done this:
from pandas.io.stata import StataReader
reader = StataReader('sample_data.dta')
data = reader.data()
dir()
I get the following error:
anaconda/lib/python3.5/site-packages/pandas/io/stata.py:1375: UserWarning: 'data' is deprecated, use 'read' instead
warnings.warn("'data' is deprecated, use 'read' instead")
What does it mean and how can I resolve the issue? And, is dir() the right way to get an understanding of what variables I have in the data?
Using pandas.io.stata.StataReader.data to read from a stata file has been deprecated in pandas 0.18.1 version and hence you are getting that warning.
Instead, you must use pandas.read_stata to read the file as shown:
df = pd.read_stata('sample_data.dta')
df.dtypes ## Return the dtypes in this object
Sometimes this did not work for me especially when the dataset is large. So the thing I propose here is 2 steps (Stata and Python)
In Stata write the following commands:
export excel Cevdet.xlsx, firstrow(variables)
and to copy the variable labels write the following
describe, replace
list
export excel using myfile.xlsx, replace first(var)
restore
this will generate for you two files Cevdet.xlsx and myfile.xlsx
Now you go to your jupyter notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('Cevdet.xlsx')
This will allow you to read both files into jupyter (python 3)
My advice is to save this data file (especially if it is big)
df.to_pickle('Cevdet')
The next time you open jupyter you can simply run
df=pd.read_pickle("Cevdet")