Updated docs; Fixed check that train and test are independent

15fc2133 · Andrea Giannetti · 91ff683a · 15fc2133 · 15fc2133 · 15fc2133
Commit 15fc2133 authored 11 months ago by Andrea Giannetti
--- a/README.md
+++ b/README.md
-# swiss_army_knife
+The Swiss Army Knife project has the objective to assess whether CH3OH can be used as an effective volume density probe and in which regime.

-
-
-## Getting started
-
-To make it easy for you to get started with GitLab, here's a list of recommended next steps.
-
-Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)!
-
-## Add your files
-
- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files
- [ ] [Add files using the command line](https://docs.gitlab.com/ee/gitlab-basics/add-file.html#add-a-file-using-the-command-line) or push an existing Git repository with the following command:
-
-```
-cd existing_repo
-git remote add origin https://www.ict.inaf.it/gitlab/andrea.giannetti/swiss_army_knife.git
-git branch -M main
-git push -uf origin main
-```
-
-## Integrate with your tools
-
- [ ] [Set up project integrations](https://www.ict.inaf.it/gitlab/andrea.giannetti/swiss_army_knife/-/settings/integrations)
-
-## Collaborate with your team
-
- [ ] [Invite team members and collaborators](https://docs.gitlab.com/ee/user/project/members/)
- [ ] [Create a new merge request](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html)
- [ ] [Automatically close issues from merge requests](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically)
- [ ] [Enable merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/)
- [ ] [Automatically merge when pipeline succeeds](https://docs.gitlab.com/ee/user/project/merge_requests/merge_when_pipeline_succeeds.html)
-
-## Test and Deploy
-
-Use the built-in continuous integration in GitLab.
-
- [ ] [Get started with GitLab CI/CD](https://docs.gitlab.com/ee/ci/quick_start/index.html)
- [ ] [Analyze your code for known vulnerabilities with Static Application Security Testing(SAST)](https://docs.gitlab.com/ee/user/application_security/sast/)
- [ ] [Deploy to Kubernetes, Amazon EC2, or Amazon ECS using Auto Deploy](https://docs.gitlab.com/ee/topics/autodevops/requirements.html)
- [ ] [Use pull-based deployments for improved Kubernetes management](https://docs.gitlab.com/ee/user/clusters/agent/)
- [ ] [Set up protected environments](https://docs.gitlab.com/ee/ci/environments/protected_environments.html)
-
-***
-
-# Editing this README
-
-When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thank you to [makeareadme.com](https://www.makeareadme.com/) for this template.
-
-## Suggestions for a good README
-Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information.
-
-## Name
-Choose a self-explaining name for your project.
-
-## Description
-Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors.
-
-## Badges
-On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge.
-
-## Visuals
-Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method.
-
-## Installation
-Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection.
-
-## Usage
-Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README.
-
-## Support
-Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc.
-
-## Roadmap
-If you have ideas for releases in the future, it is a good idea to list them in the README.
-
-## Contributing
-State if you are open to contributions and what your requirements are for accepting them.
-
-For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self.
-
-You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser.
-
-## Authors and acknowledgment
-Show your appreciation to those who have contributed to the project.
-
-## License
-For open source projects, say how it is licensed.
-
-## Project status
-If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers.
+Important resources are listed in the [documentation.md file](documentation/documentation.md).
\ No newline at end of file
--- a/documentation/documentation.md
+++ b/documentation/documentation.md
--- a/etl/assets/commons/__init__.py
+++ b/etl/assets/commons/__init__.py
@@ -201,6 +201,16 @@ def get_postprocessed_data(limit_rows: Union[None, int] = None,
 def prepare_matrix(filename: str,
                   columns: list,
                   use_model_for_inference: Union[None, str] = None) -> pd.DataFrame:
+    """
+    Retrieve and prepare the data matrix from a specified file and columns.
+
+    :param filename: The name of the file to read the data from.
+    :param columns: The list of columns to extract from the dataframe.
+    :param use_model_for_inference: The folder within prs/output/run_type to get the data for inference;
+        defaults to the fiducial model ('constant_abundance_p15_q05') if None is provided.
+    :return: A pandas DataFrame containing the specified columns from the file with 'nh2' and 'tdust' columns
+        rounded to one decimal place and converted to string type.
+    """
    _use_model_for_inference = validate_parameter(
        use_model_for_inference,
        default='constant_abundance_p15_q05'
@@ -214,6 +224,18 @@ def prepare_matrix(filename: str,
 def get_data(limit_rows: Union[int, None] = None,
             use_model_for_inference: Union[None, str] = None,
             log_columns: Union[None, List] = None):
+    """
+    Retrieve and preprocess dataset.
+
+    :param limit_rows: The number of rows to use from the original dataset; useful for running tests and limiting
+        computation time. Defaults to None, which uses all rows.
+    :param use_model_for_inference: The folder within prs/output/run_type to get the data for inference;
+        defaults to the fiducial model ('constant_abundance_p15_q05') if None is provided.
+    :param log_columns: The list of columns to apply a logarithmic transformation to. Defaults to
+        ['log_nh2', 'log_tdust', 'avg_nh2', 'avg_tdust', 'molecule_column_density', 'std_nh2'] if None is provided.
+    :return: A pandas DataFrame containing the merged and processed data from multiple sources, with specified
+        columns logarithmically transformed.
+    """
    _use_model_for_inference = validate_parameter(
        use_model_for_inference,
        default='constant_abundance_p15_q05'

--- a/etl/assets/commons/training_utils.py
+++ b/etl/assets/commons/training_utils.py
@@ -48,6 +48,17 @@ def compute_and_add_similarity_cols(average_features_per_target_bin: pd.DataFram
 def plot_results(inferred_data: pd.DataFrame,
                 use_model_for_inference: str = None,
                 ratios_to_process: Union[List[List[str]], None] = None):
+    """
+        Plot the results of inferred data against postprocessed data for specified line ratios, in oder to quickly check
+         the results.
+
+        :param inferred_data: The DataFrame containing the inferred data to plot.
+        :param use_model_for_inference: The folder within prs/output/run_type to get the data for inference;
+            defaults to the fiducial model ('constant_abundance_p15_q05') if None is provided.
+        :param ratios_to_process: The list of line ratios to plot, each specified as a list of two strings.
+            Defaults to [['87', '86'], ['88', '87'], ['88', '86'], ['257', '256'], ['381', '380']] if None is provided.
+        :return: None. Saves the plot as a PNG file named according to the model used for inference.
+        """
    _use_model_for_inference = validate_parameter(
        use_model_for_inference,
        default='constant_abundance_p15_q05'
@@ -234,9 +245,11 @@ def split_data(merged: pd.DataFrame,
        6.561e+06
    ]
    subsample = merged[
-        merged['nh2'].isin(nh2_list) & ~merged['nh2'].isin(_test_models) & ~merged['nh2'].isin(_validation_models)]
-    assert _test_models not in subsample.nh2.unique()
-    assert _validation_models not in subsample.nh2.unique()
+        merged['nh2'].isin(nh2_list) & (~merged['nh2'].isin(_test_models)) & (~merged['nh2'].isin(_validation_models))]
+    for nh2_test in _test_models:
+        assert nh2_test not in list(subsample['nh2'].round(1).unique())
+    for nh2_validation in _validation_models:
+        assert nh2_validation not in list(subsample['nh2'].round(1).unique())
    y_sub = np.log10(subsample[target_column].copy())
    x_sub = subsample[_predictor_columns].copy()
    x_train = x_sub[(~condition_test) & (~condition_validation)].reset_index(drop=True)

--- a/etl/main.py
+++ b/etl/main.py
@@ -19,7 +19,22 @@ from prs.prs_compute_integrated_fluxes_and_ratios import main as prs_main
 from prs.prs_inspect_results import main as prs_inspection_main


-def compute_full_grid(tdust, nh2, line, density_keyword, dust_temperature_keyword) -> Tuple[float, float, int, str]:
+def compute_full_grid(tdust: float,
+                      nh2: float,
+                      line: int,
+                      density_keyword: str,
+                      dust_temperature_keyword: str) -> Tuple[float, float, int, str]:
+    """
+        Compute the full grid for a given dust temperature, hydrogen density, and line identifier, and return the results.
+
+        :param tdust: The dust temperature.
+        :param nh2: The H2 number density.
+        :param line: The line identifier for the RADMC-3D observation.
+        :param density_keyword: The keyword for the density in the grid configuration.
+        :param dust_temperature_keyword: The keyword for the dust temperature in the grid configuration.
+        :return: A tuple containing the dust temperature, H2 number density, line identifier, and the name of the
+            resulting FITS file.
+    """
    scratch_dir = os.path.join('mdl', 'scratches', str(uuid.uuid4()))
    stg_overrides = {
        'grid': {
@@ -48,6 +63,15 @@ def compute_full_grid(tdust, nh2, line, density_keyword, dust_temperature_keywor
 def initialize_queue(engine: sqlalchemy.engine,
                     run_id: str,
                     run_arguments: Iterator):
+    """
+        Initialize the execution queue for a specific run ID with given run arguments if not already initialized.
+
+        :param engine: The SQLAlchemy engine used to interact with the database.
+        :param run_id: The unique identifier for the run.
+        :param run_arguments: An iterator of run arguments, each containing dust temperature, density, line,
+            density keyword, and dust temperature keyword.
+        :return: None. Inserts entries into the execution queue if the queue is not already initialized.
+    """
    is_initialized = engine.execute(f"select count(*) from tmp_execution_queue where run_id='{run_id}'").first()[0] != 0
    if is_initialized is False:
        for arguments in run_arguments:
@@ -75,6 +99,14 @@ def initialize_queue(engine: sqlalchemy.engine,

 def get_run_pars(engine: sqlalchemy.engine,
                 run_id: str):
+    """
+        Get and mark the next pending row from the execution queue associated with the given run ID as done.
+
+        :param engine: The SQLAlchemy engine used to interact with the database.
+        :param run_id: The unique identifier for the run.
+        :return: The row corresponding to the next pending task in the execution queue not marked as done, or None if there
+            are no more pending tasks.
+    """
    sql_query = sqlalchemy.text(f"""UPDATE tmp_execution_queue 
                SET done = true 
                WHERE row_id = (SELECT row_id 
@@ -89,6 +121,14 @@ def get_run_pars(engine: sqlalchemy.engine,

 def verify_run(engine: sqlalchemy.engine,
               run_id: str):
+    """
+        Verify the completion status of a run by resetting completed but unfinished tasks to pending.
+
+        :param engine: The SQLAlchemy engine used to interact with the database.
+        :param run_id: The unique identifier for the run.
+        :return: True if all completed tasks for the run have associated FITS cube names, indicating completion,
+            otherwise False.
+    """
    sql_query = sqlalchemy.text(f"""UPDATE tmp_execution_queue 
                SET done = false
                WHERE row_id in (SELECT row_id 
@@ -104,6 +144,14 @@ def verify_run(engine: sqlalchemy.engine,
 def insert_fits_name(engine: sqlalchemy.engine,
                     row_id: int,
                     fits_cube_name: str):
+    """
+        Insert the FITS cube name into the row of the execution queue with the specified row ID.
+
+        :param engine: The SQLAlchemy engine used to interact with the database.
+        :param row_id: The unique identifier for the row in the execution queue.
+        :param fits_cube_name: The name of the FITS cube associated with the row.
+        :return: None. Updates the row in the execution queue with the FITS cube name.
+    """
    sql_query = sqlalchemy.text(f"""UPDATE tmp_execution_queue 
                SET fits_cube_name = '{fits_cube_name}'
                WHERE row_id = {row_id}""")
@@ -111,6 +159,12 @@ def insert_fits_name(engine: sqlalchemy.engine,


 def compute_grid_elements(run_id: str):
+    """
+        Compute grid elements for a given run ID by initializing the execution queue with parallel arguments.
+
+        :param run_id: The unique identifier for the run.
+        :return: None. Initializes the execution queue for the specified run ID.
+    """
    init_db()
    parallel_args, _ = get_parallel_args_and_nprocesses()
    engine = get_pg_engine(logger=logger)
@@ -121,6 +175,11 @@ def compute_grid_elements(run_id: str):


 def get_parallel_args_and_nprocesses() -> Tuple[Iterator, int]:
+    """
+        Get parallel computation arguments and the number of processes for computation.
+
+        :return: A tuple containing an iterator of parallel arguments and the number of processes to use.
+    """
    _tdust_model_type, _model_type, dust_temperatures, densities, line_pairs, n_processes, _ = parse_input_main()
    line_set = set(chain.from_iterable(line_pairs))
    density_keyword = 'central_density' if _model_type == 'homogeneous' else 'density_at_reference'
@@ -130,6 +189,13 @@ def get_parallel_args_and_nprocesses() -> Tuple[Iterator, int]:


 def compute_model(run_id: str):
+    """
+        Compute a model associated with a given run ID from parameters retrieved from the execution queue.
+
+        :param run_id: The unique identifier for the run.
+        :return: None. Computes a model and updates the database with the associated FITS cube name if parameters are
+            available in the execution queue.
+    """
    engine = get_pg_engine(logger=logger)
    parameters_set = get_run_pars(engine=engine,
                                  run_id=run_id)
@@ -150,6 +216,12 @@ def compute_model(run_id: str):


 def initialize_run():
+    """
+        Initialize a new run by generating a run ID if not provided, computing grid elements for the run,
+        and saving the run ID to a file for future reference.
+
+        :return: The generated or provided run ID.
+    """
    if args.run_id is not None:
        run_id = args.run_id
    else:
@@ -163,6 +235,13 @@ def initialize_run():


 def compute_remaining_models(run_id: Union[None, str] = None) -> int:
+    """
+        Compute the number of pending models for a given run ID.
+
+        :param run_id: Optional. The unique identifier for the run. If None, it defaults to the value of the
+            'run_id' environment variable.
+        :return: The number of remaining models.
+    """
    _run_id = validate_parameter(run_id, default=os.getenv('run_id'))
    logger.info(_run_id)
    sql_query = sqlalchemy.text(f"""SELECT count(*)
@@ -179,6 +258,14 @@ def compute_remaining_models(run_id: Union[None, str] = None) -> int:

 def get_results(engine: sqlalchemy.engine,
                run_id: str):
+    """
+        Retrieve the results of a given run ID from the execution queue.
+
+        :param engine: The SQLAlchemy engine used to interact with the database.
+        :param run_id: The unique identifier for the run.
+        :return: A list of tuples containing the dust temperature, density, line, and FITS cube name for each model
+            associated with the run ID.
+    """
    sql_query = sqlalchemy.text(f"""SELECT dust_temperature
                                           , density
                                           , line
@@ -190,6 +277,13 @@ def get_results(engine: sqlalchemy.engine,

 def cleanup_tmp_table(run_id: str,
                      engine: sqlalchemy.engine):
+    """
+        Cleanup the temporary execution queue for a given run ID.
+
+        :param run_id: The unique identifier for the run.
+        :param engine: The SQLAlchemy engine used to interact with the database.
+        :return: None. Deletes all rows from the execution queue associated with the specified run ID.
+    """
    sql_query = sqlalchemy.text(f"""DELETE
                                    FROM tmp_execution_queue
                                    WHERE run_id = '{run_id}'""")
@@ -246,6 +340,12 @@ def main_presentation_step(run_id: str,


 def process_models(distributed: bool = False) -> Tuple[Union[None, dict], int]:
+    """
+        Process models either in a distributed environment or on a single machine (with multiprocessing).
+
+        :param distributed: A boolean flag indicating whether to process models in parallel. Defaults to False.
+        :return: A tuple containing results (if processed in parallel) and the number of remaining models.
+    """
    if distributed is True:
        compute_model(run_id=run_id)
        results = None

--- a/etl/prs/prs_density_inference.py
+++ b/etl/prs/prs_density_inference.py
@@ -91,13 +91,27 @@ def train_kde_model(ratio: List[str],
    return ratio_string, model


-def get_kde(points_per_axis,
-            ratio_string,
-            training_data,
-            x=None,
-            y=None,
+def get_kde(points_per_axis: int,
+            ratio_string: str,
+            training_data: pd.DataFrame,
+            x: np.array = None,
+            y: np.array = None,
            best_bandwidth: float = None,
-            bw_adjustment_factor: Union[float, int] = 1):
+            bw_adjustment_factor: Union[float, int] = 1) -> tuple:
+    """
+        Compute the Kernel Density Estimate (KDE) for a given ratio and training data.
+
+        :param points_per_axis: Number of points to use along each axis for the KDE grid.
+        :param ratio_string: The ratio string indicating which ratio of the training data to use.
+        :param training_data: The DataFrame containing the training data.
+        :param x: Optional. The x-axis values for the KDE grid. Defaults to a computed range if None.
+        :param y: Optional. The y-axis values for the KDE grid. Defaults to a computed range if None.
+        :param best_bandwidth: The best bandwidth to use for KDE. Defaults to 0.2 if None.
+        :param bw_adjustment_factor: The adjustment factor to apply to the bandwidth. Defaults to 1.
+        :return: A tuple containing the grid, the KDE model, the positions, x-axis values, y-axis values, and the
+            computed KDE values.
+    """
+
    _best_bandwidth = bw_adjustment_factor * validate_parameter(best_bandwidth, 0.2)
    _x_bandwidth = bw_adjustment_factor * 0.2
    log_nh2 = np.log10(training_data['avg_nh2'])
@@ -128,6 +142,19 @@ def plot_kde_ratio_nh2(grid: np.array,
                       training_data: pd.DataFrame,
                       suffix_outfile: str = None,
                       ratio_limits: Union[None, list] = None):
+    """
+        Plot the Kernel Density Estimate (KDE) of a ratio against average H2 density along the line-of-sight and save
+         the plot as a PNG file.
+
+        :param grid: The grid of x and y values used for the KDE.
+        :param values_on_grid: The computed KDE values on the grid.
+        :param ratio_string: The ratio string indicating which ratio of the training data to plot.
+        :param model_root_folder: The root folder where the model and figures are stored.
+        :param training_data: The DataFrame containing the training data.
+        :param suffix_outfile: Optional. The suffix to append to the output file name. Defaults to an empty string if None.
+        :param ratio_limits: Optional. The limits for the ratio axis. Defaults to None, which auto-scales the axis.
+        :return: None. Saves the plot as a PNG file in the specified folder.
+    """
    plt.rcParams.update({'font.size': 20})
    _suffix_outfile = validate_parameter(suffix_outfile, default='')
    plt.clf()

--- a/etl/stg/stg_radmc_input_generator.py
+++ b/etl/stg/stg_radmc_input_generator.py
@@ -37,6 +37,17 @@ def write_radmc_input(filename: str,
                      path: Union[None, str] = None,
                      override_defaults: Union[None, dict] = None,
                      flatten_style: Union[None, str] = None):
+    """
+        Write RADMC-3D input files with the specified grid metadata and quantities.
+
+        :param filename: The name of the file to write.
+        :param quantity: The array of quantities to be written to the file.
+        :param grid_metadata: A dictionary containing metadata about the grid.
+        :param path: Optional. The directory path where the file will be saved. Defaults to the current directory if None.
+        :param override_defaults: Optional. A dictionary to override default header values. Defaults to None.
+        :param flatten_style: Optional. The style to flatten the array (Fortran 'F' or C 'C' order). Defaults to 'F'.
+        :return: None. Writes the formatted data to the specified file.
+    """
    rt_metadata_default = {
        'iformat': 1,
        'grid_type': grid_metadata['grid_type'],
@@ -65,6 +76,14 @@ def write_radmc_input(filename: str,
 def write_radmc_lines_input(line_config: dict,
                            logger: logging.Logger,
                            path: Union[None, str] = None):
+    """
+        Write the 'lines.inp' input file for RADMC-3D based on the provided line configuration.
+
+        :param line_config: A dictionary containing the line configuration, including mode, species, and collision partners.
+        :param logger: A logging.Logger instance for logging warnings and information.
+        :param path: Optional. The directory path where the file will be saved. Defaults to the current directory if None.
+        :return: None. Writes the 'lines.inp' file based on the configuration.
+    """
    _path = validate_parameter(path, default='.')

    if line_config['lines_mode'] != 'lte':

--- a/etl/tests/test_commons.py
+++ b/etl/tests/test_commons.py
@@ -18,7 +18,8 @@ from assets.commons.grid_utils import (get_grid_edges,
                                       compute_los_average_weighted_profile)
 from assets.commons.parsing import (get_grid_properties,
                                    parse_grid_overrides)
-from assets.commons.training_utils import compute_and_add_similarity_cols
+from assets.commons.training_utils import (compute_and_add_similarity_cols,
+                                           split_data)


 def create_test_config(config_dict: dict,
@@ -429,3 +430,23 @@ class TestTraining(TestCase):
                np.nan_to_num(expected_result.round(5), nan=0)
            )
        )
+
+    def test_split_data(self):
+        merged = pd.DataFrame(
+            data=[
+                [2.7e4, 1, 1, 2],
+                [2.187e6, 2, 2, 4],
+                [1e3, 3, 3, 6],
+                [1e4, 4, 4, 8],
+                [1e5, 5, 5, 10],
+                [1e6, 6, 6, 12],
+                [1e7, 7, 7, 14]
+            ],
+            columns=['nh2', 'tdust', 'predictor', 'target'])
+        x_test, x_train, x_validation, y_test, y_train, y_validation = split_data(
+            merged=merged,
+            target_column='target',
+            predictor_columns=['nh2', 'tdust', 'predictor']
+        )
+        self.assertListEqual(list((10**x_test['nh2'].unique()).round(1)), [2.7e4])
+        self.assertListEqual(list((10**x_validation['nh2'].unique()).round(1)), [2.187e6])