I have recently been asked to take a .csv file that looks like this:
Into something like this:
Keeping in mind that there will be hundreds, if not thousands of rows due to a new row being created every time a user logs in/out, and there will be more than simply two users. My first thought was to load the .csv file into a MySQL then run a query on it. However, I really don't want to install MySQL on the machine that will be used for this.
I could do it manually for each agent in Ecxel/Open Office, but due to there being little room for error, and there are so many lines to do this, I want to automate the process. What's the best way to go about that?
This one-liner relies only on awk, and date for converting back and forth to timestamps:
awk 'BEGIN{FS=OFS=","}NR>1{au=$1 "," $2;t=$4; \
"date -u -d \""t"\" +%s"|getline ts; sum[au]+=ts;}END \
{for (a in sum){"date -u -d \"#"sum[a]"\" +%T"|getline h; print a,h}}' test.csv
having test.csv like this:
Agent,Username,Project,Duration
AAA,aaa,NBM,02:09:06
AAA,aaa,NBM,00:15:01
BBB,bbb,NBM,04:14:24
AAA,aaa,NBM,00:00:16
BBB,bbb,NBM,00:45:19
CCC,ccc,NDB,00:00:01
results in:
CCC,ccc,00:00:01
BBB,bbb,04:59:43
AAA,aaa,02:24:23
You can use this with little adjustments for extracting the date from extra columns.
Let me give you an example in case you decide to use SQLite. You didn't specify a language but I will use Python because it can be read as pseudocode. This part creates your sqlite file:
import csv
import sqlite3
con = sqlite3.Connection('my_sqlite_file.sqlite')
con.text_factory = str
cur = con.cursor()
cur.execute('CREATE TABLE "mytable" ("field1" varchar, \
"field2" varchar, "field3" varchar);')
and you use the command:
cur.executemany('INSERT INTO stackoverflow VALUES (?, ?, ?)', list_of_values)
to insert rows in your database once you have read them from the csv file. Notice that we only created three fields in the database so we are only inserting 3 values from your list_of_values. That's why we are using (?, ?, ?).
Related
I have been working on a project to import the Danish 2.5Million ATM transaction data set to derive some visualizations.
The data is hosted on a mysql server provided by the university. The objective is to import the data using Sqoop and then apply a few transformations to it using pyspark.
Link to the dataset here : https://www.kaggle.com/sparnord/danish-atm-transactions
The Sql server, that hosts this information has a few rows which are intentionally or unintentionally mangled.
Example:
So I have a very basic sqoop command which gets the details from the source database. However I run into an issue where there are values which have a double-quote " especially in the column message_text
Sqoop Command :
sqoop import --connect jdbc:mysql:{source-connection-string} --table SRC_ATM_TRANS --username {username}--password {password} --target-dir /user/root/etl_project --fields-terminated-by '|' --lines-terminated-by "\n" -m 1
Here is sample row that is imported in the transaction.
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction|0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds
However the expected output should be
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction,0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds|Cloudy
At first I was okay with this hoping that pyspark would handle the mangled data since the delimiters are specified.
But now I run into issues when populating my dataframe.
transactions = spark.read.option("sep","|").csv("/user/root/etl_project/part-m-00000", header = False,schema = transaction_schema)
However when I inspect my rows I see that the mangled data has caused the dataframe to put these affected values into a single column!
transactions.filter(transactions.message_code == "4017").collect()
Row(year=2017, month=u'January', day=1, weekday=u'Sunday', hour=17, atm_status=u'Active', atm_id=u'35', atm_manufacturer=u'NCR', atm_location=u'Aabybro', atm_streetname=u'\xc3\u0192\xcb\u0153stergade', atm_street_number=6, atm_zipcode=9440, atm_lat=57.162, atm_lon=9.73, currency=u'DKK', card_type=u'MasterCard', transaction_amount=7387, service=u'Withdrawal', message_code=u'4017', message_text=u'Suspected malfunction|0.000|57.158|10|2625037|0.000|276|1021|83|4|319.000|0|0|800|Clear', weather_lat=None, weather_lon=None, weather_city_id=None, weather_city_name=None, temp=None, pressure=None, humidity=None, wind_speed=None, wind_deg=None, rain_3h=None, clouds_all=None, weather_id=None, weather_main=None, weather_description=None)
At this point I am not sure on what to do?
Do I go ahead and create temporary columns to manage this and use a regex replacement to fill in these values ?
Or is there any better way I can import the data and manage these mangled values either in sqoop or in pyspark ?
I'm running a simple PySpark script, like this.
base_path = '/mnt/rawdata/'
file_names = ['2018/01/01/ABC1_20180101.gz',
'2018/01/02/ABC2_20180102.gz',
'2018/01/03/ABC3_20180103.gz',
'2018/01/01/XYZ1_20180101.gz'
'2018/01/02/XYZ1_20180102.gz']
for f in file_names:
print(f)
So, just testing this, I can find the files and print the strings just fine. Now, I'm trying to figure out how to load the contents of each file into a specific table in SQL Server. The thing is, I want to do a wildcard search for files that match a pattern, and load specific files into specific tables. So, I would like to do the following:
load all files with 'ABC' in the name, into my 'ABC_Table' and all files with 'XYZ' in the name, into my 'XYZ_Table' (all data starts on row 2, not row 1)
load the file name into a field named 'file_name' in each respective table (I'm totally fine with the entire string from 'file_names' or the part of the string after the last '/' character; doesn't matter)
I tried to use Azure Data Factory for this, and it can recursively loop through all files just fine, but it doesn't get the file names loaded, and I really need the file names in the table to distinguish which records are coming from which files & dates. Is it possible to do this using Azure Databricks? I feel like this is an achievable ETL process, but I don't know enough about ADB to make this work.
Update based on Daniel's recommendation
dfCW = sc.sequenceFile('/mnt/rawdata/2018/01/01/ABC%.gz/').toDF()
dfCW.withColumn('input', input_file_name())
print(dfCW)
Gives me:
com.databricks.backend.daemon.data.common.InvalidMountException:
What can I try next?
You can use input_file_name from pyspark.sql.functions
e.g.
withFiles = df.withColumn("file", input_file_name())
Afterwards you can create multiple dataframes by filtering on the new column
abc = withFiles.filter(col("file").like("%ABC%"))
xyz = withFiles.filter(col("file").like("%XYZ%"))
and then use regular writer for both of them.
So I've got a large database, contained inside csv files, there about 1000+ of them with about 24 million rows per csv. And I want to clean it up.
This is a example of data in the csv:
So as you can see there are rows that have the same 'cik' so I want to clean all of them so we get unique 'cik' and we do not have any duplicates.
I've tried to do it with python, but couldn't manage to do it.
Any suggestions would be helpful.
The tsv-uniq tool from eBay's TSV Utilities can do this type of duplicate removal (disclaimer: I'm the author). tsv-uniq is similar to the Unix uniq program, with two advantages: Data does not need to be sorted and individual fields can be used as the key. The following commands would be used to remove duplicates on the cik and cik plus ip fields:
$ # Dedup on cik field (field 5)
$ tsv-uniq -H -f 5 file.tsv > newfile.tsv
$ # Dedup on both cik and ip fields (fields 1 and 5)
$ tsv-uniq -H -f 1,5 file.tsv > newfile.tsv
The -H option preserves the header. The above forms use TAB as the field delimiter. To use comma or another character use the -d|--delimiter option as follows:
$ tsv-uniq -H -d , -f 5 file.csv > newfile.csv
tsv-uniq does not support CSV-escape syntax, but it doesn't look like your dataset needs escapes. If your dataset does use escapes, it can likely be converted to TSV format (without escapes) using the csv2tsv tool in the same package. The tools run on Unix and MacOS, there are prebuilt binaries the Releases page.
This is what I used to filter out all the duplicates with the same 'cik' and 'ip'
import pandas as pd
chunksize = 10 ** 5
for chunk in pd.read_csv('log20170628.csv', chunksize=chunksize):
df = pd.DataFrame(chunk)
df = df.drop_duplicates(subset=["cik", "ip"])
df[['ip','date','cik']].to_csv('cleanedlog20170628.csv', mode='a')
But when running the program I got this warning:
sys:1: DtypeWarning: Columns (14) have mixed types. Specify dtype option on import or set low_memory=False.`
So I am not sure does my code have a bug, or it something to do with the data from the csv.
I opened the csv to check that data its seems alright.
I have cut the number of rows from 24 million to about 5 million that was the goal from the start. But this error is bugging me...
I need to record temperatures to a SQLite-DB on a linux system (using bash)
My problem is that I get the temperature readings in an individual file.
How can I get that reading into the SQLite command
sqlite3 mydb "INSERT INTO readings (TStamp, reading) VALUES (datetime(), 'xxx');"
The file contains just one line with the value "45.7" and should replace the xxx.
Using fix data the SQL command works pretty well.
You can simply echo commands to the sqlite3, just like this:
temp=`cat file_with_temperature_value`
echo "INSERT INTO readings (TStamp, reading) VALUES (datetime(), '$temp');" | sqlite3 mydb
or do it like in your example:
temp=`cat file_with_temperature_value`
sqlite3 mydb "INSERT INTO readings (TStamp, reading) VALUES (datetime(), '$temp');"
My client uses SAS 9.3 running on an AIX (IBM Unix) server. The client interface is SAS Enterprise Guide 5.1.
I ran into this really puzzling problem: when using PROC IMPORT in combination with dbms=xlsx, it seems impossible to filter rows based on the value of a character variable (at least, when we look for an exact match).
With an .xls file, the following import works perfectly well; the expected subset of rows is written to myTable:
proc import out = myTable(where=(myString EQ "ABC"))
datafile ="myfile.xls"
dbms = xls replace;
run;
However, using the same data but this time in an .xlsx file, an empty dataset is created (having the right number of variables and adequate column types).
proc import out = myTable(where=(myString EQ "ABC"))
datafile ="myfile.xlsx"
dbms = xlsx replace;
run;
Moreover, if we exclude the where from the PROC IMPORT, the data is seemingly imported correctly. However, filtering is still not possible. For instance, this will create an empty dataset:
data myFilteredTable;
set myTable;
where myString EQ "ABC";
run;
The following will work, but is obviously not satisfactory:
data myFilteredTable;
set myTable;
where myString LIKE "ABC%";
run;
Also note that:
Using compress or other string functions does not help
Filtering using numerical columns works fine for both xls and xlsx files.
My preferred method to read spreadsheets is to use excel libnames, but this is technically not possible at this time.
I wonder if this is a known issue, I couldn't find anything about it so far. Any help appreciated.
It sounds like your strings have extra values on the end not being picked up by compress. Try using the countc function on MyString to see if any extra characters exist on the end. You can then figure out what characters to remove with compress once they're determined.