Is there a way to split a DF using column name comparison? - python-3.x

I am extremely new to Python. I've created a DataFrame using a csv file. My file is a complex nested json file having header values at the lowest granular level.
[Example] df.columns = [ID1, fullID2, total.count, total.value, seedValue.id, seedValue.value1, seedValue.value2, seedValue.largeFile.id, seedValue.largeFile.value1, seedValue.largeFile.value2......]
Requirement: I have to create multiple smaller csvs using each of the columns that are granular and ID1, fullID2.
My approach that I figured is: save the smaller slices by splitting on the header value.
Problem 1: Not able to split the value correctly or traverse to the first location for comparison.
[Example]
I'm using df.columns.str.split('.').tolist(). Suppose I get the below listed value, I want to compare seedValue of id with seedValue of value1 and pull out this entire part as a new df.
{['seedValue','id'],['seedValue'.'value1'], ['seedValue'.'value2']}
Problem 2: Adding ID1 and fullID2 to this new df.
Any help or direction to achieve this would be super helpful !
[Final output]
df.columns = [ID1, fullID2, total.count, total.value, seedValue.id, seedValue.value1, seedValue.value2, seedValue.largeFile.id, seedValue.largeFile.value1, seedValue.largeFile.value2......]
post-processing the file -
seedValue.columns = ID1,fullID2,id,value1,value2
total.columns = ID1,fullID2,count,value
seedValue.largeFile.columns = ID1,fullID2,id,value1,value2

While I do not possess your complex data to provide a more particular solution. I was able to reproduce a similar case with a .csv sample data, which will exemplify how to achieve what you aim with your data.
In order to save in each ID in a different file, we need to loop through the ID's. Also, assuming there might be more duplicate ID's, the script will save each group of ID's into a .csv file. Below is the script, already with sample data:
import pandas as pd
import csv
my_dict = { 'ids' : [11,11,33,55,55],
'info' : ["comment_1","comment_2", "comment_3", "comment_4", "comment_5"],
'other_column': ["something", "something", "something", "", "something"]}
#Creating a dataframe from the .csv file
df = pd.DataFrame(my_dict)
#sorting the value
df = df.sort_values('ids')
#g=df.groupby('ids')
df
#looping through each group of ids and saving them into a file
for id,g in df.groupby('ids'):
g.to_csv('id_{}.csv'.format(id),index=False)#, header=True, index_label=False)
And the output,
id_11.csv
id_33.csv
id_55.csv
For instance, within id_11.csv:
ids info other_column
11 comment_1 something
11 comment_2 something
Notice that, we use the field ids in the name of each file. Moreover, index=False which means that a new column with indexes for each line of data won't be created.
ADDITIONAL INFO: I have used the Notebook in AI Platform within GCP to execute and test the code.

Compared to the more widely known pd.read_csv, pandas offers more granular json support through pd.json_normalize, which allows you to specify how to unnest the data, which meta-data to use, etc.
Apart from this, reading nested fields from a csv into a two-dimensional dataframe might not be the ideal solution here, and having nested objects inside a dataframe can often be tricky to work with.
Try to read the file as a pure dictionary or list of dictionaries. You can then loop through the keys and create a custom logic to check how many more levels you want to go down, how to return the values and so on. Once you are on a lower level and prefer to have this inside of a dataframe, create a new temporary dataframe, then append these parts together inside the loop.

Related

Parametrize and loop KQL queries in JupyterLab

My question is how to assign variables within a loop in KQL magic command in Jupyter lab. I refer to Microsoft's document on this subject and will base my question on the code given here:
https://learn.microsoft.com/en-us/azure/data-explorer/kqlmagic
1. First query below
%%kql
StormEvents
| summarize max(DamageProperty) by State
| order by max_DamageProperty desc
| limit 10
2. Second: Convert the resultant query to a dataframe and assign a variable to 'statefilter'
df = _kql_raw_result_.to_dataframe()
statefilter =df.loc[0].State
statefilter
3. This is where I would like to modify the above query and let statefilter have multiple variables (i.e. consist of different states):
df = _kql_raw_result_.to_dataframe()
statefilter =df.loc[0:3].State
statefilter
4. And finally I would like to run my kql query within a for loop for each of the variables within statefilter. This below syntax may not be correct but it can give an example for what I am looking for:
dfs = [] # an empty list to store dataframes
for state in statefilters:
%%kql
let _state = state;
StormEvents
| where State in (_state)
| do some operations here for that specific state
df = _kql_raw_result_.to_dataframe()
dfs.append(df) # store the df specific to state in the list
The reason why I am not querying all the desired states within the KQL query is to prevent resulting in really large query outcomes being assigned to dataframes. This is not for this sample StormEvents table which has a reasonable size but for my research data which consists of many sites and is really big. Therefore I would like to be able to run a KQL query/analysis for each site within a for loop and assign each site's query results to a dataframe. Please let me know if this is possible or perhaps there may other logical ways to do this within KQL...
There are few ways to do it.
The simplest is to refractor your %%kql cell magic to a %kql line magic.
Line magic can be embedded in python cell.
Other option is to: from Kqlmagic import kql
The Kqlmagic kql method, accept as a string a kql cell or line.
You can call kql from python.
Third way is to call the kql magic via the ipython method:
ip.run_cell_magic('kql', {your kql magic cell text})
You can call it from python.
Example of using the single line magic mentioned by Michael and a return statement that converted the result to JSON. Without the conversion to JSON I wasn't getting anything back.
def testKQL():
%kql DatabaseName | take 10000
return _kql_raw_result_.to_dataframe().to_json(orient='records')

How to include SQLAlchemy Data types in Python dictionary

I've written an application using Python 3.6, pandas and sqlalchemy to automate the bulk loading of data into various back-end databases and the script works well.
As a brief summary, the script reads data from various excel and csv source files, one at a time, into a pandas dataframe and then uses the df.to_sql() method to write the data to a database. For maximum flexibility, I use a JSON file to provide all the configuration information including the names and types of source files, the database engine connection strings, the column titles for the source file and the column titles in the destination table.
When my script runs, it reads the JSON configuration, imports the specified source data into a dataframe, renames source columns to match the destination columns, drops any columns from the dataframe that are not required and then writes the dataframe contents to the database table using a call similar to:
df.to_sql(strTablename, con=engine, if_exists="append", index=False, chunksize=5000, schema="dbo")
The problem I have is that I would like to also specify the data types in the df.to_sql method for columns and provide them as inputs from the JSON configuration file however, this doesn't appear to be possible as all the strings in the JSON file need to be be enclosed in quotes and they don't then translate when read by my code. This is how the df.to_sql call should look:
df.to_sql(strTablename, con=engine, if_exists="append", dtype=dictDatatypes, index=False, chunksize=5000, schema="dbo")
The entries that form the dtype dictionary from my JSON file look like this:
"Data Types": {
"EmployeeNumber": "sqlalchemy.types.NVARCHAR(length=255)",
"Services": "sqlalchemy.types.INT()",
"UploadActivities": "sqlalchemy.types.INT()",
......
and there a many more, one for each column.
However, when the above is read in as a dictionary, which I pass to the df.to_sql method, it doesn't work as the alchemy datatypes shouldn't be enclosed in quotes but, I can't get around this in my JSON file. The dictionary values therefore aren't recognised by pandas. They look like this:
{'EmployeeNumber': 'sqlalchemy.types.INT()', ....}
And they really need to look like this:
{'EmployeeNumber': sqlalchemy.types.INT(), ....}
Does anyone have experience of this to suggest how I might be able to have the sqlalchemy datatypes in my configuration file?
You could use eval() to convert the string names to objects of that type:
import sqlalchemy as sa
dict_datatypes = {"EmployeeNumber": "sa.INT", "EmployeeName": "sa.String(50)"}
pprint(dict_datatypes)
"""console output:
{'EmployeeName': 'sa.String(50)', 'EmployeeNumber': 'sa.INT'}
"""
for key in dict_datatypes:
dict_datatypes[key] = eval(dict_datatypes[key])
pprint(dict_datatypes)
"""console output:
{'EmployeeName': String(length=50),
'EmployeeNumber': <class 'sqlalchemy.sql.sqltypes.INTEGER'>}
"""
Just be sure that you do not pass untrusted input values to functions like eval() and exec().

Convert CSV files from multiple directory into parquet in PySpark

I have CSV files from multiple paths that are not parent directories in s3 bucket. All the tables have the same partition keys.
the directory of the s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
...
I need to convert these csv files into parquet files and store them in another s3 bucket that has the same directory structure.
the directory of another s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
...
I have a solution is iterating through the s3 bucket and find the CSV file and convert it to parquet and save to the another S3 path. I find this way is not efficient, because i have a loop and did the conversion one file by one file.
I want to utilize the spark library to improve the efficiency.
Then, I tried:
spark.read.csv('s3n://bucket_name/table_name_1/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/table_name_1')
This way works good for each table, but to optimize it more, I want to take the table_name as a parameter, something like:
TABLE_NAMES = [table_name_1, table_name_2, ...]
spark.read.csv('s3n://bucket_name/{*TABLE_NAMES}/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/{*TABLE_NAMES}')
Thanks
The mentioned question provides solutions for reading multiple files at once. The method spark.read.csv(...) accepts one or multiple paths as shown here. For reading the files you can apply the same logic. Although, when it comes to writing, Spark will merge all the given dataset/paths into one Dataframe. Therefore it is not possible to generate from one single dataframe multiple dataframes without applying a custom logic first. So to conclude, there is not such a method for extracting the initial dataframe directly into multiple directories i.e df.write.csv(*TABLE_NAMES).
The good news is that Spark provides a dedicated function namely input_file_name() which returns the file path of the current record. You can use it in combination with TABLE_NAMES to filter on the table name.
Here it is one possible untested PySpark solution:
from pyspark.sql.functions import input_file_name
TABLE_NAMES = [table_name_1, table_name_2, ...]
source_path = "s3n://bucket_name/"
input_paths = [f"{source_path}/{t}" for t in TABLE_NAMES]
all_df = spark.read.csv(*input_paths) \
.withColumn("file_name", input_file_name()) \
.cache()
dest_path = "s3n://another_bucket/"
def write_table(table_name: string) -> None:
all_df.where(all_df["file_name"].contains(table_name))
.write
.partitionBy('partition_key_1','partition_key_2')
.parquet(f"{dest_path}/{table_name}")
for t in TABLE_NAMES:
write_table(t)
Explanation:
We generate and store the input paths into input_paths. This will create paths such as: s3n://bucket_name/table1, s3n://bucket_name/table2 ... s3n://bucket_name/tableN.
Then we load all the paths into one dataframe in which we add a new column called file_name, this will hold the path of each row. Notice that we also use cache here, this is important since we have multiple len(TABLE_NAMES) actions in the following code. Using cache will prevent us from loading the datasource again and again.
Next we create the write_table which is responsible for saving the data for the given table. The next step is to filter based on the table name using all_df["file_name"].contains(table_name), this will return only the records that contain the value of the table_name in the file_name column. Finally we save the filtered data as you already did.
In the last step we call write_table for every item of TABLE_NAMES.
Related links
How to import multiple csv files in a single load?
Get HDFS file path in PySpark for files in sequence file format

How to load different files into different tables, based on file pattern?

I'm running a simple PySpark script, like this.
base_path = '/mnt/rawdata/'
file_names = ['2018/01/01/ABC1_20180101.gz',
'2018/01/02/ABC2_20180102.gz',
'2018/01/03/ABC3_20180103.gz',
'2018/01/01/XYZ1_20180101.gz'
'2018/01/02/XYZ1_20180102.gz']
for f in file_names:
print(f)
So, just testing this, I can find the files and print the strings just fine. Now, I'm trying to figure out how to load the contents of each file into a specific table in SQL Server. The thing is, I want to do a wildcard search for files that match a pattern, and load specific files into specific tables. So, I would like to do the following:
load all files with 'ABC' in the name, into my 'ABC_Table' and all files with 'XYZ' in the name, into my 'XYZ_Table' (all data starts on row 2, not row 1)
load the file name into a field named 'file_name' in each respective table (I'm totally fine with the entire string from 'file_names' or the part of the string after the last '/' character; doesn't matter)
I tried to use Azure Data Factory for this, and it can recursively loop through all files just fine, but it doesn't get the file names loaded, and I really need the file names in the table to distinguish which records are coming from which files & dates. Is it possible to do this using Azure Databricks? I feel like this is an achievable ETL process, but I don't know enough about ADB to make this work.
Update based on Daniel's recommendation
dfCW = sc.sequenceFile('/mnt/rawdata/2018/01/01/ABC%.gz/').toDF()
dfCW.withColumn('input', input_file_name())
print(dfCW)
Gives me:
com.databricks.backend.daemon.data.common.InvalidMountException:
What can I try next?
You can use input_file_name from pyspark.sql.functions
e.g.
withFiles = df.withColumn("file", input_file_name())
Afterwards you can create multiple dataframes by filtering on the new column
abc = withFiles.filter(col("file").like("%ABC%"))
xyz = withFiles.filter(col("file").like("%XYZ%"))
and then use regular writer for both of them.

How to filter a CSV file without Pandas? (Best Substitute for Pandas in Pythonista)

I am trying to do some data analysis on Pythonista 3 (iOS app for python), however because of the C libraries of pandas it does not compile in the iOS device.
Is there any substitute for Pandas?
Would numpy be an option for data of type string?
The data set I have at the moment is the history of messages between my friends and I.
The whole history is in one csv file. Each row has the columns 'day_of_the_week', 'date', 'time_of_message', 'author_of_message', 'message_body'
The goal of the analysis is to produce a report of our chat for the past year.
I want be able to count number of messages each friend sent. I want to be able to plot a histogram of the hours in which the messages where sent by each friend.
Then, I want to do some word counting individually and as a group.
In Pandas I know how to do that. For example:
df = read_csv("messages.csv")
number_of_messages_friend1 = len(df[df.author_of_message == 'friend1']
How can I filter a csv file without Pandas?
Since Pythonista does have numpy, you will want to look at recarrays, which are numpy's approach to this type of problem. The following worked out of the box in Pythonista for me:
import numpy as np
df=np.recfromcsv('messages.csv')
len(df[df.author_of_message==b'friend1'])
Depending on your data format, tou may find that recsfromcsv "just works", since it tries to guess data types, or you might need to customize things a bit. See genfromtext for a number of options, such as explictly specifying data types or for using converters for converting string dates to datetime objects. recsfromcsv is just a convienece wrapper around genfromtext
https://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html#
Once in recarray, many of the simple indexing operations work the same as in pandas. Note you may need to do string compares using b-prefixed strings (bytes objects), unless you convert to unicode strings, as shown above.
Use the csv module from the standard library to read the messages.
You could store it into a list of collections.namedtuple for easy access.
import csv
messages = []
with open('messages.csv') as csvfile:
reader = csv.DictReader(csvfile, fieldnames=('day_of_the_week', 'date', 'time_of_message', 'author_of_message', 'message_body'))
for row in reader:
messages.append(row)
That gives you all the messages as a list of dictionaries.
Alternatively you could use a normal csv reader combined with a collections.namedtuple to make a list of named tuples, which are slightly easier to access.
import csv
from collections import namedtuple
Msg = namedtuple('Msg', ('day_of_the_week', 'date', 'time_of_message', 'author_of_message', 'message_body'))
messages = []
with open('messages.csv') as csvfile:
msgreader = csv.reader(csvfile)
for row in msgreader:
messages.append(Msg(*row))
Pythonista now has competition on iOS. The pyto app provides python 3.8 with pandas. https://apps.apple.com/us/app/pyto-python-3-8

Resources