While bulk document import is described in the ArangoDB documentation here, I was not able to find the equivalent documentation for bulk graph import. I suppose since vertices are documents in ArangoDB's data model that the former should be able to be used for loading vertices, but how are edges to be loaded?
Thanks for any help!
Edges in ArangoDB are also just document. So you can load both vertices and edges using the same bulk document import. Here are two examples:
– Csv documents/vertices:
arangoimp --file <path/filename> --collection <collectionName> --create-collection true --type csv --server.database <databaseName> —server.username <username>
– Csv edges:
arangoimp --file <path/filename> --collection <collectionName> --create-collection true --type csv --create-collection-type edge --server.database <databaseName> —server.username <username>
Notice that the only major difference is the create-collection-type argument set to edge when loading edges. Additionally the file containing the edge data should have the appropriate values for the _from and the _to attributes
Here are a few more options you may find useful:
Translating column names:
arangoimport --file "data.csv" --type csv --translate "from=_from" --translate "to=_to"
Ignore empty values (instead of throwing warnings and not loading data), use the flag:
--ignore-missing
ignore column in the import file:
arangoimport --file "data.csv" --type csv --remove-attribute “attributeName”
Related
I need to list all these ( ( DBSnapshotIdentifier , DBInstanceIdentifier, AvailabilityZone , SnapshotType , Encrypted , SnapshotCreateTime) properties of both automated and manual snapshots of all the available clusters and import this data into a spreadsheet.
I tried aws rds describe-db-snapshots this command but I need to get those properties alone and also import those to a spreadsheet
Use the regular cli command and save it into a json file. aws rds describe-db-snapshots > filename.json and then convert JSON to CSV via the Command Line Using JQ
I have been working on a project to import the Danish 2.5Million ATM transaction data set to derive some visualizations.
The data is hosted on a mysql server provided by the university. The objective is to import the data using Sqoop and then apply a few transformations to it using pyspark.
Link to the dataset here : https://www.kaggle.com/sparnord/danish-atm-transactions
The Sql server, that hosts this information has a few rows which are intentionally or unintentionally mangled.
Example:
So I have a very basic sqoop command which gets the details from the source database. However I run into an issue where there are values which have a double-quote " especially in the column message_text
Sqoop Command :
sqoop import --connect jdbc:mysql:{source-connection-string} --table SRC_ATM_TRANS --username {username}--password {password} --target-dir /user/root/etl_project --fields-terminated-by '|' --lines-terminated-by "\n" -m 1
Here is sample row that is imported in the transaction.
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction|0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds
However the expected output should be
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction,0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds|Cloudy
At first I was okay with this hoping that pyspark would handle the mangled data since the delimiters are specified.
But now I run into issues when populating my dataframe.
transactions = spark.read.option("sep","|").csv("/user/root/etl_project/part-m-00000", header = False,schema = transaction_schema)
However when I inspect my rows I see that the mangled data has caused the dataframe to put these affected values into a single column!
transactions.filter(transactions.message_code == "4017").collect()
Row(year=2017, month=u'January', day=1, weekday=u'Sunday', hour=17, atm_status=u'Active', atm_id=u'35', atm_manufacturer=u'NCR', atm_location=u'Aabybro', atm_streetname=u'\xc3\u0192\xcb\u0153stergade', atm_street_number=6, atm_zipcode=9440, atm_lat=57.162, atm_lon=9.73, currency=u'DKK', card_type=u'MasterCard', transaction_amount=7387, service=u'Withdrawal', message_code=u'4017', message_text=u'Suspected malfunction|0.000|57.158|10|2625037|0.000|276|1021|83|4|319.000|0|0|800|Clear', weather_lat=None, weather_lon=None, weather_city_id=None, weather_city_name=None, temp=None, pressure=None, humidity=None, wind_speed=None, wind_deg=None, rain_3h=None, clouds_all=None, weather_id=None, weather_main=None, weather_description=None)
At this point I am not sure on what to do?
Do I go ahead and create temporary columns to manage this and use a regex replacement to fill in these values ?
Or is there any better way I can import the data and manage these mangled values either in sqoop or in pyspark ?
I am working on this notebook. https://databricks.com/notebooks/simple-aws/petastorm-spark-converter-pytorch.html
I tried running the first line
df = spark.read.parquet("/databricks-datasets/flowers/parquet") \
.select(col("content"), col("label_index")) \
.limit(1000)
However I got this error
Path does not exist: dbfs:/databricks-datasets/flowers/parquet;
I am wondering where I can find the parquet version of the flowers dataset on databricks. FYI I am working on the community edition.
This dataset was converted into Delta format, so path right now is /databricks-datasets/flowers/delta, instead of /databricks-datasets/flowers/parquet, and you need to read it with the corresponding code:
df = spark.read.format('delta').load('/databricks-datasets/flowers/delta')
P.S. You can always use %fs ls path command to see what files are at given path
P.P.S. I'll ask to fix that notebook if it's possible
label_index is removed from the dataset. You can recreate is as followed
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="label", outputCol="label_index")
indexed = indexer.fit(df).transform(df)
I run arangoimp on my json formatted data like this, and the output of arangoimp states that everything ran correctly.
arangoimp --file data.json --collection newCollection --create-collection true --server.password "" --progress false
with output
Starting JSON import...
created: 28538
warnings/errors: 0
updated/replaced: 0
ignored: 0
When I view the database in the web client, there is no data there. I am able to upload data to the database with pyArango, however this is much slower and I would prefer not to. Any ideas as to why arangoimp is not working correctly?
Any insight would be appreciated, thanks!
I found the problem, the data was being inserted into _system database. Problem is easily fixed by adding flag --server.database myDB to my command.
I have to import around 1000 data I can't do it one by one.
Is there a way to make the csv files import as integers instead of strings? It always changes to string when I use mongoimport.
Mongoimport --host localhost -- db database -- collection -- collections -- type csv -- file. 1000data.csv --headerline
It is possible to do this from version 3.4, check out here: https://docs.mongodb.com/manual/reference/program/mongoimport/#cmdoption--columnsHaveTypes