I am trying to pass data from one component to the next in Azure ML pipeline. I am able to do it in a simple code.
I have 2 components and I am defining them as below:
components_dir = "."
prep = load_component(source=f"{components_dir}/preprocessing_config.yml")
middle = load_component(source=f"{components_dir}/middle_config.yml")
Then I am defining a pipeline as below:
@pipeline(
display_name="test_pipeline3",
tags={"authoring": "sdk"},
description="test pipeline to test things just like all other test pipelines."
)
def data_pipeline(
# raw_data: Input,
compute_train_node: str,
):
prep_node = prep()
prep_node.outputs.Y_df= Output(type="uri_folder", mode = 'rw_mount', path="path/testing/")
prep_node.outputs.S_df= Output(type="uri_folder", mode = 'rw_mount', path="path/testing/")
transform_node = middle(Y_df=prep_node.outputs.Y_df,
S_df=prep_node.outputs.S_df)
The prep node has a script which involves hydra to get in the parameters from a config file. This script also has a config file that kicksoff the script in command line as below:
python preprocessing_script.py
--Y_df ${{outputs.Y_df}}
--S_df ${{outputs.S_df}}
I try to get the values of Y_df.path and S_df.path in the main function of the prep script as below:
@hydra.main(version_base=None, config_path=".", config_name="config_file")
def main(cfg: DictConfig):
parser = argparse.ArgumentParser("prep")
parser.add_argument("--Y_df", type=str, help="Path of prepped data")
parser.add_argument("--S_df", type=str, help="Path of prepped data")
args = parser.parse_args()
# Call the preprocessing function with Hydra configurations
df1,df2 = processing_func(cfg.data_name,cfg.prod_filter)
df1.to_csv(Path(cfg.Y_df) / "Y_df.csv")
df2.to_csv(Path(cfg.S_df) / "S_df.csv")
When I run all of this, I get an error in the prep component itself saying
Execution failed. User process 'python' exited with status code 2. Please check log file 'user_logs/std_log.txt' for error details. Error: /bin/bash: /azureml-envs/azureml_bbh34278yrnrfuehn78340/lib/libtinfo.so.6: no version information available (required by /bin/bash)
usage: data_processing.py [--help] [--hydra-help] [--version]
[--cfg {job,hydra,all}] [--resolve]
[--package PACKAGE] [--run] [--multirun]
[--shell-completion] [--config-path CONFIG_PATH]
[--config-name CONFIG_NAME]
[--config-dir CONFIG_DIR]
[--experimental-rerun EXPERIMENTAL_RERUN]
[--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
[overrides ...]
data_processing.py: error: unrecognized arguments: --Y_df --S_df /mnt/azureml/cr/j/ffyh7fs984ryn8f733ff3/cap/data-capability/wd/S_df
The code runs fine and data is transferred between the components when there is no hydra involved but when hydra is involved, I get this error. why is that so?
Edit: Below is the data component config file for prep:
$schema: .schema.json
type: command
name: preprocessing24
display_name: preprocessing24
outputs:
Y_df:
type: uri_folder
S_df:
type: uri_folder
code: ./preprocessing_final
environment: azureml:datapipeline-environment:4
command: >-
python data_processing.py
data preprocessing config file just contains a bunch of variables but I have added 2 more which are:
Y_df:
random_txt
S_df:
random_txt
the main function of the data processing script is mentioned above.
I am trying to pass data from one component to the next in Azure ML pipeline. I am able to do it in a simple code.
I have 2 components and I am defining them as below:
components_dir = "."
prep = load_component(source=f"{components_dir}/preprocessing_config.yml")
middle = load_component(source=f"{components_dir}/middle_config.yml")
Then I am defining a pipeline as below:
@pipeline(
display_name="test_pipeline3",
tags={"authoring": "sdk"},
description="test pipeline to test things just like all other test pipelines."
)
def data_pipeline(
# raw_data: Input,
compute_train_node: str,
):
prep_node = prep()
prep_node.outputs.Y_df= Output(type="uri_folder", mode = 'rw_mount', path="path/testing/")
prep_node.outputs.S_df= Output(type="uri_folder", mode = 'rw_mount', path="path/testing/")
transform_node = middle(Y_df=prep_node.outputs.Y_df,
S_df=prep_node.outputs.S_df)
The prep node has a script which involves hydra to get in the parameters from a config file. This script also has a config file that kicksoff the script in command line as below:
python preprocessing_script.py
--Y_df ${{outputs.Y_df}}
--S_df ${{outputs.S_df}}
I try to get the values of Y_df.path and S_df.path in the main function of the prep script as below:
@hydra.main(version_base=None, config_path=".", config_name="config_file")
def main(cfg: DictConfig):
parser = argparse.ArgumentParser("prep")
parser.add_argument("--Y_df", type=str, help="Path of prepped data")
parser.add_argument("--S_df", type=str, help="Path of prepped data")
args = parser.parse_args()
# Call the preprocessing function with Hydra configurations
df1,df2 = processing_func(cfg.data_name,cfg.prod_filter)
df1.to_csv(Path(cfg.Y_df) / "Y_df.csv")
df2.to_csv(Path(cfg.S_df) / "S_df.csv")
When I run all of this, I get an error in the prep component itself saying
Execution failed. User process 'python' exited with status code 2. Please check log file 'user_logs/std_log.txt' for error details. Error: /bin/bash: /azureml-envs/azureml_bbh34278yrnrfuehn78340/lib/libtinfo.so.6: no version information available (required by /bin/bash)
usage: data_processing.py [--help] [--hydra-help] [--version]
[--cfg {job,hydra,all}] [--resolve]
[--package PACKAGE] [--run] [--multirun]
[--shell-completion] [--config-path CONFIG_PATH]
[--config-name CONFIG_NAME]
[--config-dir CONFIG_DIR]
[--experimental-rerun EXPERIMENTAL_RERUN]
[--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
[overrides ...]
data_processing.py: error: unrecognized arguments: --Y_df --S_df /mnt/azureml/cr/j/ffyh7fs984ryn8f733ff3/cap/data-capability/wd/S_df
The code runs fine and data is transferred between the components when there is no hydra involved but when hydra is involved, I get this error. why is that so?
Edit: Below is the data component config file for prep:
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command
name: preprocessing24
display_name: preprocessing24
outputs:
Y_df:
type: uri_folder
S_df:
type: uri_folder
code: ./preprocessing_final
environment: azureml:datapipeline-environment:4
command: >-
python data_processing.py
data preprocessing config file just contains a bunch of variables but I have added 2 more which are:
Y_df:
random_txt
S_df:
random_txt
the main function of the data processing script is mentioned above.
hydra and argparse are natively not compatible, as hydra handles the the parsing.
If you want to combine both its easiest not use @hydra.main
but Hydra's Compose API. Which takes care of some but not all setup features, iirc the custom logger output is not included the last time I used it.
The arguments for the compose API align with hydra.main
, for the argparser use ArgumentParser.parse_known_args
import sys
import argparse
from hydra import compose, initialize
from omegaconf import OmegaConf # optional for printing
def main():
# global initialization
parser = argparse.ArgumentParser("prep")
parser.add_argument("--Y_df", type=str, help="Path of prepped data")
parser.add_argument("--S_df", type=str, help="Path of prepped data")
args, unparsed_args = parser.parse_known_args() # <- ignore unknown args
# Before running hydra; remove the already parsed arguments
sys.argv[1:] = unparsed_args
initialize(version_base=None, config_path="conf", job_name="test_app")
cfg = compose(config_name="config", overrides=["db=mysql", "db.user=me"])
print(OmegaConf.to_yaml(cfg))
Alternatively you also parse the args before using @hydra.main
in a similar way.
import sys
import argparse
import hydra
# guard with if __name__ == "__main__": if needed
parser = argparse.ArgumentParser("prep")
parser.add_argument("--Y_df", type=str, help="Path of prepped data")
parser.add_argument("--S_df", type=str, help="Path of prepped data")
args, unparsed_args = parser.parse_known_args()
sys.argb[1:] = unparsed_args
@hydra.main(version_base=None, config_path=".", config_name="config_file")
def main(cfg: DictConfig):
# work with cfg and args or merge
...
Ok here is what was happening.
This notation in CLI script did not work
python preprocessing_script.py
--Y_df ${{outputs.Y_df}}
--S_df ${{outputs.S_df}}
Thats because hydra does not like that notation (I think)
Instead this notation worked:
python data_processing.py '+Y_df=${{outputs.Y_df}}' '+S_df=${{outputs.S_df}}'
What this does is that it adds those 2 new variables - Y_df and S_df into the config file
These variables can be accessed in the program just like all other variables in the config file by doing cfg.Y_df
or cfg.S_df
data_processing.py
. – JayashankarGS Commented Jan 16 at 3:35python data_processing.py --Y_df ${{outputs.Y_df}} --S_df ${{outputs.S_df}}
also only defining main function doesn't work ,you need to call inside python file – JayashankarGS Commented Jan 16 at 14:55