Efficient algorithm, for cleaning large csv files - python-3.x

So I've got a large database, contained inside csv files, there about 1000+ of them with about 24 million rows per csv. And I want to clean it up.
This is a example of data in the csv:
So as you can see there are rows that have the same 'cik' so I want to clean all of them so we get unique 'cik' and we do not have any duplicates.
I've tried to do it with python, but couldn't manage to do it.
Any suggestions would be helpful.

The tsv-uniq tool from eBay's TSV Utilities can do this type of duplicate removal (disclaimer: I'm the author). tsv-uniq is similar to the Unix uniq program, with two advantages: Data does not need to be sorted and individual fields can be used as the key. The following commands would be used to remove duplicates on the cik and cik plus ip fields:
$ # Dedup on cik field (field 5)
$ tsv-uniq -H -f 5 file.tsv > newfile.tsv
$ # Dedup on both cik and ip fields (fields 1 and 5)
$ tsv-uniq -H -f 1,5 file.tsv > newfile.tsv
The -H option preserves the header. The above forms use TAB as the field delimiter. To use comma or another character use the -d|--delimiter option as follows:
$ tsv-uniq -H -d , -f 5 file.csv > newfile.csv
tsv-uniq does not support CSV-escape syntax, but it doesn't look like your dataset needs escapes. If your dataset does use escapes, it can likely be converted to TSV format (without escapes) using the csv2tsv tool in the same package. The tools run on Unix and MacOS, there are prebuilt binaries the Releases page.

This is what I used to filter out all the duplicates with the same 'cik' and 'ip'
import pandas as pd
chunksize = 10 ** 5
for chunk in pd.read_csv('log20170628.csv', chunksize=chunksize):
df = pd.DataFrame(chunk)
df = df.drop_duplicates(subset=["cik", "ip"])
df[['ip','date','cik']].to_csv('cleanedlog20170628.csv', mode='a')
But when running the program I got this warning:
sys:1: DtypeWarning: Columns (14) have mixed types. Specify dtype option on import or set low_memory=False.`
So I am not sure does my code have a bug, or it something to do with the data from the csv.
I opened the csv to check that data its seems alright.
I have cut the number of rows from 24 million to about 5 million that was the goal from the start. But this error is bugging me...

Related

How to manage mangled data when importing from your source in sqoop or pyspark

I have been working on a project to import the Danish 2.5Million ATM transaction data set to derive some visualizations.
The data is hosted on a mysql server provided by the university. The objective is to import the data using Sqoop and then apply a few transformations to it using pyspark.
Link to the dataset here : https://www.kaggle.com/sparnord/danish-atm-transactions
The Sql server, that hosts this information has a few rows which are intentionally or unintentionally mangled.
Example:
So I have a very basic sqoop command which gets the details from the source database. However I run into an issue where there are values which have a double-quote " especially in the column message_text
Sqoop Command :
sqoop import --connect jdbc:mysql:{source-connection-string} --table SRC_ATM_TRANS --username {username}--password {password} --target-dir /user/root/etl_project --fields-terminated-by '|' --lines-terminated-by "\n" -m 1
Here is sample row that is imported in the transaction.
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction|0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds
However the expected output should be
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction,0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds|Cloudy
At first I was okay with this hoping that pyspark would handle the mangled data since the delimiters are specified.
But now I run into issues when populating my dataframe.
transactions = spark.read.option("sep","|").csv("/user/root/etl_project/part-m-00000", header = False,schema = transaction_schema)
However when I inspect my rows I see that the mangled data has caused the dataframe to put these affected values into a single column!
transactions.filter(transactions.message_code == "4017").collect()
Row(year=2017, month=u'January', day=1, weekday=u'Sunday', hour=17, atm_status=u'Active', atm_id=u'35', atm_manufacturer=u'NCR', atm_location=u'Aabybro', atm_streetname=u'\xc3\u0192\xcb\u0153stergade', atm_street_number=6, atm_zipcode=9440, atm_lat=57.162, atm_lon=9.73, currency=u'DKK', card_type=u'MasterCard', transaction_amount=7387, service=u'Withdrawal', message_code=u'4017', message_text=u'Suspected malfunction|0.000|57.158|10|2625037|0.000|276|1021|83|4|319.000|0|0|800|Clear', weather_lat=None, weather_lon=None, weather_city_id=None, weather_city_name=None, temp=None, pressure=None, humidity=None, wind_speed=None, wind_deg=None, rain_3h=None, clouds_all=None, weather_id=None, weather_main=None, weather_description=None)
At this point I am not sure on what to do?
Do I go ahead and create temporary columns to manage this and use a regex replacement to fill in these values ?
Or is there any better way I can import the data and manage these mangled values either in sqoop or in pyspark ?

Different delimiters on different lines in the same file for Databricks Spark

I have a file that has a mix of comma delimited lines and pipe delimited lines I need to import into Databricks.
Is it possible to indicate the use of two or more different separators when creating a sql table in Databricks/Spark?
I see lots of posts for multiple character separators, but nothing on different separators.
https://forums.databricks.com/questions/10743/how-to-read-file-in-pyspark-with-delimiter.html
Possible to handle multi character delimiter in spark
http://blog.madhukaraphatak.com/spark-3-introduction-part-1
etc.
I'm currently using something like this.
create table myschema.mytable (
foo string,
bar string
)
using csv
options (
header = "true",
delimiter = ","
);
One methood you could try is to create spark dataframe first and then make a table out of it. Giving example for a hypothetical case below using pyspark where delimiters were | and -
BEWARE: we are using split and it means that it will split everything, e.g. 2000-12-31 is a value yest it will be split. Therefor we should be very sure that no such case would ever occur in data. As general advice, one should never accept these types of files as there are accidents waiting to happen.
How sample data looks: in this case we have 2 files in our directory with | and - occurring randomly as delimiters
# Create RDD. Basically read as simple text file.
# sc is spark context
rddRead = sc.textFile("/mnt/adls/RI_Validation/ReadMulktipleDelimerFile/Sample1/")
rddRead.collect() # For debugging
import re # Import for usual python regex
# Create another rdd using simple string opertaions. This will be similar to list of lists.
# Give regex expression to split your string based on anticipated delimiters (this could be dangerous
# if those delimiter occur as part of value. e.g.: 2021-12-31 is a single value in reality.
# But this a price we have to pay for not having good data).
# For each iteration, k represents 1 element which would eventually become 1 row (e.g. A|33-Mech)
rddSplit = rddRead.map(lambda k: re.split("[|-]+", k)) # Anticipated delimiters are | OR - in this case.
rddSplit.collect() # For debugging
# This block is applicable only if you have headers
lsHeader = rddSplit.first() # Get First element from rdd as header.
print(lsHeader) # For debugging
print()
# Remove rows representing header. (Note: Have assumed name of all columns in
# all files are same. If not, then will have to filter by manually specifying
#all of them which would be a nightmare from pov of good code as well as maintenance)
rddData = rddSplit.filter(lambda x: x != lsHeader)
rddData.collect() # For debugging
# Convert rdd to spark dataframe
# Utilise the header we got in earlier step. Else can give our own headers.
dfSpark = rddData.toDF(lsHeader)
dfSpark.display() # For debugging

Cassandra : COPY data with cql function on column

I am trying to export and import data from a cassandra table for changing a timestamp column to unixepoch column ( ie type timestamp to bigint)
I tried exporting data to csv using below command
COPY raw_data(raw_data_field_id, toUnixTimestamp(dt_timestamp), value) TO 'raw_data_3_feb_19.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;
but getting error as : Improper COPY command.
How can I fix this issue or is there a better way to achieve this?
from
raw_data_field_id | dt_timestamp | value
-------------------+---------------------------------+-------
23 | 2018-06-12 07:15:00.656000+0000 | 131.3
to
raw_data_field_id | dt_unix_timestamp_epoch | value
-------------------+---------------------------------+-------
23 | 1528787700656 | 131.3
The COPY command does not support adding extra functions to process the output.
I would say you have several solutions:
export the data in csv using COPY, convert the timestamp value (using sh commands or a high level language) and import it to a new table
export using echo "select raw_data_field_id, toUnixTimestamp(dt_timestamp), value from raw.raw_data;" | ccm node1 cqlsh > output.csv, change the csv so it has a proper format and import it to a new table (this solution is from here)
write your own conversion tool using one of Cassandra drivers (python, java etc).
maybe you could try something with a UDF, but I haven't tested this.
You should be aware that COPY FROM supports datasets that have less than 2 milion rows.

Export from pig to CSV

I'm having a lot of trouble getting data out of pig and into a CSV that I can use in Excel or SQL (or R or SPSS etc etc) without a lot of manipulation ...
I've tried using the following function:
STORE pig_object INTO '/Users/Name/Folder/pig_object.csv'
USING CSVExcelStorage(',','NO_MULTILINE','WINDOWS');
It creates the folder with that name with lots of part-m-0000# files. I can later join them all up using cat part* > filename.csv but there's no header which means I have to put it in manually.
I've read that PigStorageSchema is supposed to create another bit with a header but it doesn't seem to work at all, eg, I get the same result as if it's just stored, no header file:
STORE pig_object INTO '/Users/Name/Folder/pig_object'
USING org.apache.pig.piggybank.storage.PigStorageSchema();
(I've tried this in both local and mapreduce mode).
Is there any way of getting the data out of Pig into a simple CSV file without these multiple steps?
Any help would be much appreciated!
I'm afraid there isn't a one-liner which does the job,but you can come up with the followings (Pig v0.10.0):
A = load '/user/hadoop/csvinput/somedata.txt' using PigStorage(',')
as (firstname:chararray, lastname:chararray, age:int, location:chararray);
store A into '/user/hadoop/csvoutput' using PigStorage('\t','-schema');
When PigStorage takes '-schema' it will create a '.pig_schema' and a '.pig_header' in the output directory. Then you have to merge '.pig_header' with 'part-x-xxxxx' :
1. If result need to by copied to the local disk:
hadoop fs -rm /user/hadoop/csvoutput/.pig_schema
hadoop fs -getmerge /user/hadoop/csvoutput ./output.csv
(Since -getmerge takes an input directory you need to get rid of .pig_schema first)
2. Storing the result on HDFS:
hadoop fs -cat /user/hadoop/csvoutput/.pig_header
/user/hadoop/csvoutput/part-x-xxxxx |
hadoop fs -put - /user/hadoop/csvoutput/result/output.csv
For further reference you might also have a look at these posts:
STORE output to a single CSV?
How can I concatenate two files in hadoop into one using Hadoop FS shell?
if you will store your data as PigStorage on HDFS and then merge it using -getmerge -nl:
STORE pig_object INTO '/user/hadoop/csvoutput/pig_object'
using PigStorage('\t','-schema');
fs -getmerge -nl /user/hadoop/csvoutput/pig_object /Users/Name/Folder/pig_object.csv;
Docs:
Optionally -nl can be set to enable adding a newline character (LF) at
the end of each file.
you will have a single TSV/CSV file with the following structure:
1 - header
2 - empty line
3 - pig schema
4 - empty line
5 - 1st line of DATA
6 - 2nd line of DATA
...
so we can simply remove lines [2,3,4] using AWK:
awk 'NR==1 || NR>4 {print}' /Users/Name/Folder/pig_object.csv > /Users/Name/Folder/pig_object_clean.csv

Processing CSV data

I have recently been asked to take a .csv file that looks like this:
Into something like this:
Keeping in mind that there will be hundreds, if not thousands of rows due to a new row being created every time a user logs in/out, and there will be more than simply two users. My first thought was to load the .csv file into a MySQL then run a query on it. However, I really don't want to install MySQL on the machine that will be used for this.
I could do it manually for each agent in Ecxel/Open Office, but due to there being little room for error, and there are so many lines to do this, I want to automate the process. What's the best way to go about that?
This one-liner relies only on awk, and date for converting back and forth to timestamps:
awk 'BEGIN{FS=OFS=","}NR>1{au=$1 "," $2;t=$4; \
"date -u -d \""t"\" +%s"|getline ts; sum[au]+=ts;}END \
{for (a in sum){"date -u -d \"#"sum[a]"\" +%T"|getline h; print a,h}}' test.csv
having test.csv like this:
Agent,Username,Project,Duration
AAA,aaa,NBM,02:09:06
AAA,aaa,NBM,00:15:01
BBB,bbb,NBM,04:14:24
AAA,aaa,NBM,00:00:16
BBB,bbb,NBM,00:45:19
CCC,ccc,NDB,00:00:01
results in:
CCC,ccc,00:00:01
BBB,bbb,04:59:43
AAA,aaa,02:24:23
You can use this with little adjustments for extracting the date from extra columns.
Let me give you an example in case you decide to use SQLite. You didn't specify a language but I will use Python because it can be read as pseudocode. This part creates your sqlite file:
import csv
import sqlite3
con = sqlite3.Connection('my_sqlite_file.sqlite')
con.text_factory = str
cur = con.cursor()
cur.execute('CREATE TABLE "mytable" ("field1" varchar, \
"field2" varchar, "field3" varchar);')
and you use the command:
cur.executemany('INSERT INTO stackoverflow VALUES (?, ?, ?)', list_of_values)
to insert rows in your database once you have read them from the csv file. Notice that we only created three fields in the database so we are only inserting 3 values from your list_of_values. That's why we are using (?, ?, ?).

Resources