Python: Combine multiple zipped files and output to multiple csv files - python-3.x

Edit 2:
Adding in a few sample lines for reference. The first line is the column names.
field 1|field 2|field3|id
123|xxx|aaa|118
123|xxx|aaa|56
124|xxx|aaa|184
124|yyy|aaa|156
Edit:
Open to non-Python solutions (grep/awk etc are ok)
The csv files are pipe-delimited "|"
I need to retain the headers
I have 20 .gz files (each ~100MB, zipped). Within each .gz file is a csv file, with many columns, including an index column 'id'. There are around 250 unique ids across all the files.
I need to output all the rows for each unique id to each csv (i.e. there should be 250 csv files generated).
How should I best do this?
I am currently using Python but it takes around 1 minute to generate each csv, I would like to know if there is any faster solution please.
output_folder = 'indiv_ids/'
# get list of files
list_of_files = [filename for filename in os.listdir() if filename.endswith(".gz")]
# get list of unique ids
for i in range(len(list_of_files)):
df = pd.read_csv(list_of_files[i], sep='|', usecols=['id'], dtype=str, engine='c')
id_list = df['id'].unique()
if len(id_list) == 250:
break
# load into a list for each id
list_df = {id:[] for id in id_list}
for filename in list_of_files:
df = pd.read_csv(filename, sep='|', dtype=str, engine='c')
for id in id_list:
df_id = df[df['id'] == id]
list_df[id].append(df_id)
for id in id_list:
# join into one big df
df_full = pd.concat(list_df[id], axis=0)
df_full.to_csv(f'{output_folder}{id}.csv', sep="|", index=False)

Updated Answer
Now that I have seen how your data looks, I think you want this:
gunzip -c *gz | awk -F'|' '$4=="id"{hdr=$0;next} hdr{f=$4; print hdr > f ".csv"; hdr=""} {print > f ".csv"}'
Original Answer
I presume your asking for "any faster solution" permits non-Python solutions, so I would suggest awk.
I generated 4 files of 1000 lines of dummy data like this:
for ((i=0;i<4;i++)) ; do
perl -E 'for($i=0;$i<1000;$i++){say "Line $i,field2,field3,",int rand 250}' | gzip > $i.gz
done
Here are the first few lines of one of the files. The fourth field varies between 0..250 and is supposed to be like your id field.
Line 0,field2,field3,81
Line 1,field2,field3,118
Line 2,field2,field3,56
Line 3,field2,field3,184
Line 4,field2,field3,156
Line 5,field2,field3,87
Line 6,field2,field3,118
Line 7,field2,field3,59
Line 8,field2,field3,119
Line 9,field2,field3,183
Line 10,field2,field3,90
Then you can process like this:
gunzip -c *gz | awk -F, '{ id=$4; print > id ".csv" }'
That says... "Unzip all the .gz files without deleting them and pass the results to awk. Within awk the field separator is the comma. The id should be picked up from the 4th field of each line. Each line should be printed to an output file whose name is id followed by .csv".
You should get 250 CSV files... pretty quickly.
Note: If you run out of open file descriptors, you may need to raise the limit. Try running the following commands:
help ulimit
ulimit -n 500

Related

arrange the text files side by side to make a matrix file [duplicate]

This question already has answers here:
arranging text files side by side using python
(2 answers)
Closed 1 year ago.
I have 3000 text files in a directory and each .txt file contain single column data. i want to arrange them side by side to make it a mxn matrix file.
For this i tried
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
However it gives error paste: filename.txt: Too many open files
please suggest a better solution for the same using python.
For relatively short files that include a unique column Name in the first line of the text file, you can do the following:
import pandas as pd
from pathlib import Path
def listFiles(ddir):
#return a list of txt files in the file directory specified by ddir
p = Path(ddir)
return list(p.glob('*.txt'))
def readfile_toDataframe(file):
#create a dataframe from the contents of file
return pd.read_csv(file)
def joinDfs(list_of_files):
# Read a list of files and join their contents on the index
dfo = readfile_toDataframe(list_of_files[0])
for f in list_of_files[1:]:
dfo = dfo.join(readfile_toDataframe(f))
return dfo
by running:
joinDFs(listFiles(ddir))
where ddir is a string variable pointing to the file directory, you will read the files and create a DataFrame of their contents with each file being a column.

How to separate lines of data read from a textfile? Customers with their orders

I have this data in a text file. (Doesn't have the spacing I added for clarity)
I am using Python3:
orders = open('orders.txt', 'r')
lines = orders.readlines()
I need to loop through the lines variable that contains all the lines of the data and separate the CO lines as I've spaced them.
CO are customers and the lines below each CO are the orders that customer placed.
The CO lines tells us how many lines of orders exist if you look at the index[7-9] of the CO string.
I illustrating this below.
CO77812002D10212020 <---(002)
125^LO917^11212020. <----line 1
235^IL993^11252020 <----line 2
CO77812002S10212020
125^LO917^11212020
235^IL993^11252020
CO95307005D06092019 <---(005)
194^AF977^06292019 <---line 1
72^L223^07142019 <---line 2
370^IL993^08022019 <---line 3
258^Y337^07072019 <---line 4
253^O261^06182019 <---line 5
CO30950003D06012019
139^LM485^06272019
113^N669^06192019
249^P530^07112019
CO37501001D05252020
479^IL993^06162020
I have thought of a brute force way of doing this but it won't work against much larger datasets.
Any help would be greatly appreciated!
You can use fileinput (source) to "simultaneously" read and modify your file. In fact, the in-place functionality that offers to modify a file while parsing it is implemented through a second backup file. Specifically, as stated here:
Optional in-place filtering: if the keyword argument inplace=True is passed to fileinput.input() or to the FileInput constructor, the file is moved to a backup file and standard output is directed to the input file (...) by default, the extension is '.bak' and it is deleted when the output file is closed.
Therefore, you can format your file as specified this way:
import fileinput
with fileinput.input(files = ['orders.txt'], inplace=True) as orders_file:
for line in orders_file:
if line[:2] == 'CO': # Detect customer line
orders_counter = 0
num_of_orders = int(line[7:10]) # Extract number of orders
else:
orders_counter += 1
# If last order for specific customer has been reached
# append a '\n' character to format it as desired
if orders_counter == num_of_orders:
line += '\n'
# Since standard output is redirected to the file, print writes in the file
print(line, end='')
Note: it's supposed that the file with the orders is formatted exactly in the way you specified:
CO...
(order_1)
(order_2)
...
(order_i)
CO...
(order_1)
...
This did what I hoping to get done!
tot_customers = []
with open("orders.txt", "r") as a_file:
customer = []
for line in a_file:
stripped_line = line.strip()
if stripped_line[:2] == "CO":
customer.append(stripped_line)
print("customers: ", customer)
orders_counter = 0
num_of_orders = int(stripped_line[7:10])
else:
customer.append(stripped_line)
orders_counter +=1
if orders_counter == num_of_orders:
tot_customers.append(customer)
customer = []
orders_counter = 0

How to compare two files containing many long strings then extract lines with at least n consecutive identical chars?

I have 2 large files each containing long strings separated by newlines in different formats. I need to find similarities and differences between them. The Problem is that the formats of the two files differ.
File a:
9217:NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE:dasda97sda9sdadfghgg789hfg87ghf8fgh87
File b:
NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE
So now I want to extract the whole line containing NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE from File a to a new file and also delete this line in File a.
I have tried achieving this with meld and got to the point that it will at least show me the similarities only. Say File a has 3000 lines and File b has 120 lines, now I want to find the the lines with at least n consecutive identical chars and remove these from File a.
I found this and accordingly tried to use diff like this:
diff --unchanged-line-format='%L' --old-line-format='' \
--new-line-format='' a.txt b.txt
This didn't do anything I got no output whatsoever so I guess it exited with 0 and didn't find anything.
How can I make this work? I have Linux and Windows available.
Given the format of the files, the most efficient implementation would be something like this:
Load all b strings into a [hashtable] or [HashSet[string]]
Filter the contents of a by:
Extracting the substring from each line with String.Split(':') or similar
Check whether it exists in the set from step 1
$FilterStrings = [System.Collections.Generic.HashSet[string]]::new(
[string[]]#(
Get-Content .\path\to\b
)
)
Get-Content .\path\to\a |Where-Object {
# Split the line into the prefix, middle, and suffix;
# Discard the prefix and suffix
$null,$searchString,$null = $_.Split(":", 3)
if($FilterStrings.Contains($searchString)){
# we found a match, write it to the new file
$searchString |Add-Content .\path\to\matchedStrings.txt
# make sure it isn't passed through
$false
}
else {
# substring wasn't found to be in `b`, let's pass it through
$true
}
} |Set-Content .\path\to\filteredStrings.txt

Extract n characters for the first match of a word in a file

I am a beginner in Python. I have a file having single line of data. My requirement is to extract "n" characters after certain words for their first occurrence only. Also, those words are not sequential.
Data file: {"id":"1234566jnejnwfw","displayId":"1234566jne","author":{"name":"abcd#xyz.com","datetime":15636378484,"displayId":"23423426jne","datetime":4353453453}
I want to fetch value after first match of "displayId" and before "author", i.e.; 1234566jne. Similarly for "datetime".
I tried breaking the line based upon index as the word and putting it into another file for further cleaning up to get the exact value.
tmpFile = "tmpFile.txt"
tmpFileOpen = open(tmpFile, "w+")
with open("data file") as openfile:
for line in openfile:
tmpFileOpen.write(line[line.index(displayId) + len(displayId):])
However, I am sure this is not a good solution to work further.
Can anyone please help me on this?
This answer should work for any displayId with a similar format as in your question. I decided not to load the JSON file for this answer, because it wasn't needed to accomplish the task.
import re
tmpFile = "tmpFile.txt"
tmpFileOpen = open(tmpFile, "w+")
with open('data_file.txt', 'r') as input:
lines = input.read()
# Use regex to find the displayId element
# example: "displayId":"1234566jne
# \W matches none words, such as " and :
# \d matches digits
# {6,8} matches digits lengths between 6 and 8
# [a-z] matches lowercased ASCII characters
# {3} matches 3 lowercased ASCII characters
id_patterns = re.compile(r'\WdisplayId\W{3}\d{6,8}[a-z]{3}')
id_results = re.findall(id_patterns, lines)
# Use list comprehension to clean the results
clean_results = ([s.strip('"displayId":"') for s in id_results])
# loop through clean_results list
for id in clean_results:
# Write id to temp file on separate lines
tmpFileOpen.write('{} \n'.format(id))
# output in tmpFileOpen
# 1234566jne
# 23423426jne
This answer does load the JSON file, but this answer will fail if the JSON file format changes.
import json
tmpFile = 'tmpFile.txt'
tmpFileOpen = open(tmpFile, "w+")
# Load the JSON file
jdata = json.loads(open('data_file.txt').read())
# Find the first ID
first_id = (jdata['displayId'])
# Write the first ID to the temp file
tmpFileOpen.write('{} \n'.format(first_id))
# Find the second ID
second_id = (jdata['author']['displayId'])
# Write the second ID to the temp file
tmpFileOpen.write('{} \n'.format(second_id))
# output in tmpFileOpen
# 1234566jne
# 23423426jne
If I understand correctly your question, you can achieve this by doing the following:
import json
tmpFile = "tmpFile.txt"
tmpFileOpen = open(tmpFile, "w+")
with open("data.txt") as openfile:
for line in openfile:
// Loads the json to a dict in order to manipulate it easily
data = json.loads(str(line))
// Here I specify that I want to write to my tmp File only the first 3
// characters of the field `displayId`
tmpFileOpen.write(data['displayId'][:3])
This can be done because the data in your file is JSON, however if the format changes it won't work

Write common elements of 2 csv files(having different no of columns) in a single file

I have 2 csv files of the following format-
File1
David
Lennon
File2
David 0.3
Lennon 1.3
Wright 2.5
Desired Output-
David 0.3
Lennon 1.3
I am reading both csv files and then checking whether the same first column is present in file 2 or not and if present ,I want to keep it and then delete rest of them, but I don't know how to go to first element.
with open ('file1.csv') as h:
an = h.readlines()
with open ('file2.csv') as n:
non = n.readlines()
anno=[]
for i in an:
anno.append(i.decode('utf-8').strip())
diff={}
for i in non:
if i.decode('utf-8')[0].strip() in anno:
diff[i[0]] = i[1]
I am getting the error in last line as I presume , it is not the right way to access first and second columns of the csv file.
How to do it?
Okay so first of all, if you make use of the csv format, make sure you separate the values with commas (csv = comma separated values). So change file1 and file2 to this:
David
Lennon
and
David,0.3
Lennon,1.3
Wright,2.5
Okay so you want to get only the data names that are pressent in file1 out of file2 if I'm correct. I changed up the names of the variables to less cryptic names because I did not understand what you meant by them, but I kept the last Dictionary as diff (the desired output) for clarity.
Now Reading the names from file1 and put them in a list with readlines, however there is still some unwanted stuff in there the "\n". I replace the newline character with nothing in a for loop, after that creating a list from it, only the names will be left.
with open ("file1.csv") as file1:
data_file1 = [name.replace("\n", "") for name in file1.readlines()]
For the file2 doing the same thing and creating a list splited by the comma so, "David, 0.3" becomes ["David", "0.3"]. Note that the type of the values is still a String.
with open ("file2.csv") as file1:
data_file2 = [name.replace("\n", "").split(",") for name in file1.readlines()]
Now comparing the data from file1 and file2:
diff = {}
for line in data_file2:
if line[0] in data_file1:
diff[line[0]] = line[1]
Here line[0] is the name and line[1] the corresponding value for that name.
Now diff should return
>>> diff
{'David': '0.3', 'Lennon': '1.3'}
Cheers,
Jelle

Resources