I have to get azure list of storage container with its size and then upload it in a bigquery table using "bq load"
1/ first of all I get my list of container and put it in a CSV file with az cli command.
I got a file like this
X X X
X X X
X X X
(119 ligne)
actually there is no "," between column like any csv file.
2/ then i calculate their size and put it in another csv file.
I got this output
22344
234565456
234565432
(119 ligne)
3/ i want to combine then into one csv file i got this output
X X X
, 22344
X X X
, 233445656
4/ when I run bq load I don't get 4 column.
Can anyone help me ?
az cli
bq load
paste -d, file1.csv file2.csv > output.csv
Related
With this input
x 1
x 2
x 3
y 1
y 2
y 3
I'd like to have this output
x 1;2;3
y 1;2;3
Thank you in advance,
Simone
If by terminal you mean something natively built in you might not be in much luck, however you could run a python file from the terminal which could do want you want and more. If having a standalone file isn't possible then you can always run python in REPL mode for purely terminal usage.
If you have python installed all you would need to do to access REPL would be "py" and you could manually setup a processor. If you can use a file then something like this below should be able to take any input text and output the formatted text to the terminal.
file = open("data.txt","r")
lines = file.readlines()
same_starts = {}
#parse each line in the file and get the starting and trailing data for sorting
for line in lines:
#remove trailing/leading whitesapce and newlines
line_norm = line.strip()#.replace('\n','')
#splits data by the first space in the line
#formatting errors make the line get skipped
try:
data_split = line_norm.split(' ')
start = data_split[0]
end = data_split[1]
except:
continue
#check if dictionary same_starts already has this start
if same_starts.get(start):
same_starts[start].append(end)
else:
#add new list with first element being this ending
same_starts[start] = [end]
#print(same_starts)
#format the final data into the needed output
final_output = ""
for key in same_starts:
text = key + ' '
for element in same_starts[key]:
text += element + ";"
final_output += text + '\n'
print(final_output)
NOTE: final_output is the text in the final formatting
assuming you have python installed then this file would only need to be run with the current directory being the folder where it is stored along with a text file called "data.txt" in the same folder which contains the starting values you want processed. Then you would do "py FILE_NAME.ex" ensuring you replace FILE_NAME.ex with the exact same name as the python file, extension included.
I am trying to create an automatic pull in R using the GET function from the HTTR package for a csv file located on github.
Here is the table I am trying to download.
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv
I can make the connection to the file using the following GET request:
library(httr)
x <- httr::GET("https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")
However I am unsure how I then convert that into a dataframe similar to the table on github.
Any assistance would be much appreciated.
I am new to R but here is my solution.
You need to use the raw version of the csv file from github (raw.githubusercontent.com)!
library(httr)
x <- httr::GET("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")
# Save to file
bin <- content(x, "raw")
writeBin(bin, "data.csv")
# Read as csv
dat = read.csv("data.csv", header = TRUE, dec = ",")
colnames(dat) = gsub("X", "", colnames(dat))
# Group by country name (to sum regions)
# Skip the four first columns containing metadata
countries = aggregate(dat[, 5:ncol(dat)], by=list(Country.Region=dat$Country.Region), FUN=sum)
# Here is the table of the most recent total confirmed cases
countries_total = countries[, c(1, ncol(countries))]
The output graph
How I got this to work:
How to sum a variable by group
This is as simple as:
res <- httr::GET("https://.../file.csv")
data <- httr::content(res, "parsed")
This requires the readr package.
See https://httr.r-lib.org/reference/content.html
I have a monitor directory contains number of .csv file. I need to count the number of entries in each in coming .csv file. I want to do this in pyspark streaming context.
This is what I did,
my_DStream = ssc.textFileStream(monitor_Dir)
test = my_DStream.flatMap(process_file) # process_file function simply process my file. e.g line.split(";")
print(len(test.collect()))
This does not give me the result that I want. For e.g file1.csv contains 10 entries, file2.csv contains 18 entries etc. So I need to see the output
10
18
..
..
etc
I have no problem to do the same task if I have a one single static file and to use rdd operation.
If someone interested, this is what I did.
my_DStream = ssc.textFileStream(monitor_Dir)
DStream1 = my_DStream.flatMap(process_file)
DStream2 = DStream1.filter(lambda x: x[0])
lines_num = DStream2.count()
lines_num.pprint()
This gave the desired output as I wanted.
Edit 2:
Adding in a few sample lines for reference. The first line is the column names.
field 1|field 2|field3|id
123|xxx|aaa|118
123|xxx|aaa|56
124|xxx|aaa|184
124|yyy|aaa|156
Edit:
Open to non-Python solutions (grep/awk etc are ok)
The csv files are pipe-delimited "|"
I need to retain the headers
I have 20 .gz files (each ~100MB, zipped). Within each .gz file is a csv file, with many columns, including an index column 'id'. There are around 250 unique ids across all the files.
I need to output all the rows for each unique id to each csv (i.e. there should be 250 csv files generated).
How should I best do this?
I am currently using Python but it takes around 1 minute to generate each csv, I would like to know if there is any faster solution please.
output_folder = 'indiv_ids/'
# get list of files
list_of_files = [filename for filename in os.listdir() if filename.endswith(".gz")]
# get list of unique ids
for i in range(len(list_of_files)):
df = pd.read_csv(list_of_files[i], sep='|', usecols=['id'], dtype=str, engine='c')
id_list = df['id'].unique()
if len(id_list) == 250:
break
# load into a list for each id
list_df = {id:[] for id in id_list}
for filename in list_of_files:
df = pd.read_csv(filename, sep='|', dtype=str, engine='c')
for id in id_list:
df_id = df[df['id'] == id]
list_df[id].append(df_id)
for id in id_list:
# join into one big df
df_full = pd.concat(list_df[id], axis=0)
df_full.to_csv(f'{output_folder}{id}.csv', sep="|", index=False)
Updated Answer
Now that I have seen how your data looks, I think you want this:
gunzip -c *gz | awk -F'|' '$4=="id"{hdr=$0;next} hdr{f=$4; print hdr > f ".csv"; hdr=""} {print > f ".csv"}'
Original Answer
I presume your asking for "any faster solution" permits non-Python solutions, so I would suggest awk.
I generated 4 files of 1000 lines of dummy data like this:
for ((i=0;i<4;i++)) ; do
perl -E 'for($i=0;$i<1000;$i++){say "Line $i,field2,field3,",int rand 250}' | gzip > $i.gz
done
Here are the first few lines of one of the files. The fourth field varies between 0..250 and is supposed to be like your id field.
Line 0,field2,field3,81
Line 1,field2,field3,118
Line 2,field2,field3,56
Line 3,field2,field3,184
Line 4,field2,field3,156
Line 5,field2,field3,87
Line 6,field2,field3,118
Line 7,field2,field3,59
Line 8,field2,field3,119
Line 9,field2,field3,183
Line 10,field2,field3,90
Then you can process like this:
gunzip -c *gz | awk -F, '{ id=$4; print > id ".csv" }'
That says... "Unzip all the .gz files without deleting them and pass the results to awk. Within awk the field separator is the comma. The id should be picked up from the 4th field of each line. Each line should be printed to an output file whose name is id followed by .csv".
You should get 250 CSV files... pretty quickly.
Note: If you run out of open file descriptors, you may need to raise the limit. Try running the following commands:
help ulimit
ulimit -n 500
I'm trying to plot a csv file which is like this:
a 531049
b 122198
c 3411487
d 72420
e 1641
f 2181578
. .
. .
. .
but these values should be scaled using another csv file which is in the same format.
i.e other file
a 45
b 12...
I want to plot 531049/45 and so on. first column will be the x axis and the second is the y-axis
how can I do this without merging 2 files?
Gnuplot's using is meant to read data from a single file/stream so you need to merge the two files somehow. I would use python for this since it is my go-to tool for just about everything. I would write a script which reads from the 2 files and writes the data to standard output. Something like:
#merge.py
import sys
file1,scale_factor_file = sys.argv[1:]
#Read the scale factors into a dictionary
d = {}
with open(scale_factor_file) as sf:
for line in sf:
key,scale_factor = line.split()
d[key] = float(scale_factor)
#Now open the other file, scaling as we go:
with open(file1) as fin:
for line in fin:
key,value = line.split()
print key,float(value)/d.get(key,1.0)
Now you can use gnuplot's ability to read from pipes to plot your data:
plot '< python merge.py datafile file_with_scale_factors' using 2