Automated appending of csv files in linux with timestamp and id - linux

I am trying to append data from File1 to File2 (both csvs) in ubuntu Linux.
The data is to be appended every minute added with a timestamp and an auto-increasing id number (to be used as pk in a mysql db).
I have set up a crontab job:
* * * * * cat File1>>File2
It works perfectly, however i am a bit stuck with adding the two new columns to File2 that are to be auto populated.
I am a bit of a novice at linux and would appreciate some help.

Related

Python script for the read log file from Linux server with matching string condition in real time

I need to create a python script, which can read 1 hour before and current time data from the log file. And after that I have to search for the matching string and send a mail based on them.
No Idea , Help needed

Crontab and PgAgent to run python script

I'm a newbie in Postgre and I have this Python Script which convert my excel file into a PD dataframe. After, the data is send into my PostgreSQL Database.
.....
engine = create_engine('postgresql+psycopg2://username:password#host:port/database')
df.head(0).to_sql('table_name', engine, if_exists='replace',index=False) #truncates the table
conn = engine.raw_connection()
cur = conn.cursor()
output = io.StringIO()
df.to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur.copy_from(output, 'table_name', null="") # null values become ''
conn.commit()
...
I Would like the script to be run daily with a crontab or a PgAgent Job. I'm currently having my database on my local machine which will be later transfer to a server. Whats the best way to schedule tasks which I will use later on a online server? Also, Can i run a schedule a PgAgent to run a python script?
Crontab is a very good tool for scheduling tasks that you want to run repeatedly at specific times or on a restart.crontab -e allows the current user to edit their crontab file. For example,
30 18 * * * python ~/Documents/example.py
Will run "example.py" at 18:30 everyday assuming the user is logged in. The program will run with the privileges of whosever crontab file it is, assuming that they are logged in. Crontab is very easy to use/edit, completely reliable and what I use personally for scheduling tasks on my own server.

Efficient algorithm, for cleaning large csv files

So I've got a large database, contained inside csv files, there about 1000+ of them with about 24 million rows per csv. And I want to clean it up.
This is a example of data in the csv:
So as you can see there are rows that have the same 'cik' so I want to clean all of them so we get unique 'cik' and we do not have any duplicates.
I've tried to do it with python, but couldn't manage to do it.
Any suggestions would be helpful.
The tsv-uniq tool from eBay's TSV Utilities can do this type of duplicate removal (disclaimer: I'm the author). tsv-uniq is similar to the Unix uniq program, with two advantages: Data does not need to be sorted and individual fields can be used as the key. The following commands would be used to remove duplicates on the cik and cik plus ip fields:
$ # Dedup on cik field (field 5)
$ tsv-uniq -H -f 5 file.tsv > newfile.tsv
$ # Dedup on both cik and ip fields (fields 1 and 5)
$ tsv-uniq -H -f 1,5 file.tsv > newfile.tsv
The -H option preserves the header. The above forms use TAB as the field delimiter. To use comma or another character use the -d|--delimiter option as follows:
$ tsv-uniq -H -d , -f 5 file.csv > newfile.csv
tsv-uniq does not support CSV-escape syntax, but it doesn't look like your dataset needs escapes. If your dataset does use escapes, it can likely be converted to TSV format (without escapes) using the csv2tsv tool in the same package. The tools run on Unix and MacOS, there are prebuilt binaries the Releases page.
This is what I used to filter out all the duplicates with the same 'cik' and 'ip'
import pandas as pd
chunksize = 10 ** 5
for chunk in pd.read_csv('log20170628.csv', chunksize=chunksize):
df = pd.DataFrame(chunk)
df = df.drop_duplicates(subset=["cik", "ip"])
df[['ip','date','cik']].to_csv('cleanedlog20170628.csv', mode='a')
But when running the program I got this warning:
sys:1: DtypeWarning: Columns (14) have mixed types. Specify dtype option on import or set low_memory=False.`
So I am not sure does my code have a bug, or it something to do with the data from the csv.
I opened the csv to check that data its seems alright.
I have cut the number of rows from 24 million to about 5 million that was the goal from the start. But this error is bugging me...

Export data from SqlQuery to Excel sheet [duplicate]

I have table with more than 3 000 000 rows. I have try to export the data from it manually and with SQL Server Management Studio Export data functionality to Excel but I have met several problems:
when create .txt file manually copying and pasting the data (this is several times, because if you copy all rows from the SQL Server Management Studio it throws out of memory error) I am not able to open it with any text editor and to copy the rows;
the Export data to Excel do not work, because Excel do not support so many rows
Finally, with the Export data functionality I have created a .sql file, but it is 1.5 GB, and I am not able to open it in SQL Server Management Studio again.
Is there a way to import it with the Import data functionality, or other more clever way to make a backup of the information of my table and then to import it again if I need it?
Thanks in advance.
I am not quite sure if I understand your requirements (I don't know if you need to export your data to excel or you want to make some kind of backup).
In order to export data from single tables, you could use Bulk Copy Tool which allows you to export data from single tables and exporting/Importing it to files. You can also use a custom Query to export the data.
It is important that this does not generate a Excel file, but another format. You could use this to move data from one database to another (must be MS SQL in both cases).
Examples:
Create a format file:
Bcp [TABLE_TO_EXPORT] format "[EXPORT_FILE]" -n -f "[ FORMAT_FILE]" -S [SERVER] -E -T -a 65535
Export all Data from a table:
bcp [TABLE_TO_EXPORT] out "[EXPORT_FILE]" -f "[FORMAT_FILE]" -S [SERVER] -E -T -a 65535
Import the previously exported data:
bcp [TABLE_TO_EXPORT] in [EXPORT_FILE]" -f "[FORMAT_FILE] " -S [SERVER] -E -T -a 65535
I redirect the output from hte export/import operations to a logfile (by appending "> mylogfile.log" ad the end of the commands) - this helps if you are exporting a lot of data.
Here a way of doing it without bcp:
EXPORT THE SCHEMA AND DATA IN A FILE
Use the ssms wizard
Database >> Tasks >> generate Scripts… >> Choose the table >> choose db model and schema
Save the SQL file (can be huge)
Transfer the SQL file on the other server
SPLIT THE DATA IN SEVERAL FILES
Use a program like textfilesplitter to split the file in smaller files and split in files of 10 000 lines (so each file is not too big)
Put all the files in the same folder, with nothing else
IMPORT THE DATA IN THE SECOND SERVER
Create a .bat file in the same folder, name execFiles.bat
You may need to check the table schema to disable the identity in the first file, you can add that after the import in finished.
This will execute all the files in the folder against the server and the database with, the –f define the Unicode text encoding should be used to handle the accents:
for %%G in (*.sql) do sqlcmd /S ServerName /d DatabaseName -E -i"%%G" -f 65001
pause

Export from pig to CSV

I'm having a lot of trouble getting data out of pig and into a CSV that I can use in Excel or SQL (or R or SPSS etc etc) without a lot of manipulation ...
I've tried using the following function:
STORE pig_object INTO '/Users/Name/Folder/pig_object.csv'
USING CSVExcelStorage(',','NO_MULTILINE','WINDOWS');
It creates the folder with that name with lots of part-m-0000# files. I can later join them all up using cat part* > filename.csv but there's no header which means I have to put it in manually.
I've read that PigStorageSchema is supposed to create another bit with a header but it doesn't seem to work at all, eg, I get the same result as if it's just stored, no header file:
STORE pig_object INTO '/Users/Name/Folder/pig_object'
USING org.apache.pig.piggybank.storage.PigStorageSchema();
(I've tried this in both local and mapreduce mode).
Is there any way of getting the data out of Pig into a simple CSV file without these multiple steps?
Any help would be much appreciated!
I'm afraid there isn't a one-liner which does the job,but you can come up with the followings (Pig v0.10.0):
A = load '/user/hadoop/csvinput/somedata.txt' using PigStorage(',')
as (firstname:chararray, lastname:chararray, age:int, location:chararray);
store A into '/user/hadoop/csvoutput' using PigStorage('\t','-schema');
When PigStorage takes '-schema' it will create a '.pig_schema' and a '.pig_header' in the output directory. Then you have to merge '.pig_header' with 'part-x-xxxxx' :
1. If result need to by copied to the local disk:
hadoop fs -rm /user/hadoop/csvoutput/.pig_schema
hadoop fs -getmerge /user/hadoop/csvoutput ./output.csv
(Since -getmerge takes an input directory you need to get rid of .pig_schema first)
2. Storing the result on HDFS:
hadoop fs -cat /user/hadoop/csvoutput/.pig_header
/user/hadoop/csvoutput/part-x-xxxxx |
hadoop fs -put - /user/hadoop/csvoutput/result/output.csv
For further reference you might also have a look at these posts:
STORE output to a single CSV?
How can I concatenate two files in hadoop into one using Hadoop FS shell?
if you will store your data as PigStorage on HDFS and then merge it using -getmerge -nl:
STORE pig_object INTO '/user/hadoop/csvoutput/pig_object'
using PigStorage('\t','-schema');
fs -getmerge -nl /user/hadoop/csvoutput/pig_object /Users/Name/Folder/pig_object.csv;
Docs:
Optionally -nl can be set to enable adding a newline character (LF) at
the end of each file.
you will have a single TSV/CSV file with the following structure:
1 - header
2 - empty line
3 - pig schema
4 - empty line
5 - 1st line of DATA
6 - 2nd line of DATA
...
so we can simply remove lines [2,3,4] using AWK:
awk 'NR==1 || NR>4 {print}' /Users/Name/Folder/pig_object.csv > /Users/Name/Folder/pig_object_clean.csv

Resources