.. _batch_upload: Uploading GSM data ------------------ The GSM provides both an API and browser interface for uploading multiple sky survey catalogue files in a single atomic batch operation into the GSM database. The API is the recommended and primary method for uploading data. The browser-based interface is **deprecated** and will be removed in a future release. Base API URL ^^^^^^^^^^^^ Access to the GSM API requires the user to know the base URL for where the GSM is installed. With the SDP installed on the DP cluster, the following applies: Use the internal service DNS name: http://ska-sdp-gsm. Where ```` is the SDP control namespace. .. _upload_api: API Endpoints ^^^^^^^^^^^^^ =============================================== ==================================================== ============== Endpoint Parameters Description =============================================== ==================================================== ============== ``POST /upload-sky-survey-batch`` - ``files``: One or more CSV files containing :ref:`upload_ingest_ep` standardized sky survey data - data type: ``list[File]`` - required: True ----------------------------------------------- ---------------------------------------------------- -------------- ``GET /upload-sky-survey-status/{upload_id}`` - ``upload_id``: Unique identifier returned when :ref:`upload_status_ep` the upload was initiated - data type: ``string (UUID)`` - required: True ----------------------------------------------- ---------------------------------------------------- -------------- ``GET /review-upload/{upload_id}`` - ``upload_id``: Unique identifier returned when :ref:`review_upload_ep` the upload was initiated - data type: ``string (UUID)`` - required: True ----------------------------------------------- ---------------------------------------------------- -------------- ``POST /commit-upload/{upload_id}`` - ``upload_id``: Unique identifier of the staged :ref:`commit_upload_ep` upload to commit - data type: ``string (UUID)`` - required: True ----------------------------------------------- ---------------------------------------------------- -------------- ``DELETE /reject-upload/{upload_id}`` - ``upload_id``: Unique identifier of the staged :ref:`reject_upload_ep` upload to reject - data type: ``string (UUID)`` - required: True ----------------------------------------------- ---------------------------------------------------- -------------- =============================================== ==================================================== ============== .. _upload_ingest_ep: Upload and ingest CSV files ........................... **Endpoint**: ``POST /upload-sky-survey-batch`` Upload and ingest one or more sky survey CSV files to the **staging table**. .. important:: Data uploaded via this endpoint is **NOT automatically added to the main database**. After a successful upload, you MUST: 1. Review the staged data using ``GET /review-upload/{upload_id}`` 2. Commit the data using ``POST /commit-upload/{upload_id}`` If you do not commit the upload, the data will remain in staging and will not be visible in the GSM. All files in the batch are combined into a single sky model. If any file fails validation or ingestion, the entire batch is rolled back. .. note:: This has been tested inside the cluster, with file sizes up to 200MB (~1,000,000 rows), with no known issue. Using port forwarding and curl, a catalogue of 2GB or 10,000,000 rows was successfully uploaded. The catalogue version is **not** supplied by the user — it is automatically assigned when the upload is committed (see :ref:`commit_upload_ep`). .. list-table:: :widths: 20, 50, 20, 10 :header-rows: 1 * - Parameter - Description - Data Type - Required * - ``metadata_file`` - JSON file with catalogue metadata (``catalogue_name``, ``description``, ``epoch``, ``author``, ``reference``, ``notes``). - File (JSON) - Yes * - ``csv_files`` - One or more CSV files containing standardized sky survey data - list[File] - Yes **Response**: .. code-block:: json { "upload_id": "550e8400-e29b-41d4-a716-446655440000", "status": "uploading", "catalogue_name": "GLEAM" } The endpoint returns immediately with status "uploading". Ingestion to staging table proceeds asynchronously in the background. Use the status endpoint to monitor completion, then review and commit. **Example Usage**: .. code-block:: bash # Upload metadata and one or more CSV files curl -X POST "/upload-sky-survey-batch" \\ -F "metadata_file=@metadata.json;type=application/json" \\ -F "csv_files=@test_catalogue_1.csv;type=text/csv" \\ -F "csv_files=@test_catalogue_2.csv;type=text/csv" **Python Example**: .. code-block:: python import requests import time url = "/upload-sky-survey-batch" # Upload metadata and multiple CSV files files = [ ("metadata_file", ("metadata.json", open("metadata.json", "rb"), "application/json")), ("csv_files", ("test_catalogue_1.csv", open("test_catalogue_1.csv", "rb"), "text/csv")), ("csv_files", ("test_catalogue_2.csv", open("test_catalogue_2.csv", "rb"), "text/csv")), ] response = requests.post(url, files=files) result = response.json() print(f"Upload ID: {result['upload_id']}") print(f"Catalogue: {result['catalogue_name']}") print(f"Status: {result['status']}") # Will be "uploading" # Poll for completion status_url = f"{url.replace('/upload-sky-survey-batch', '')}/upload-sky-survey-status/{result['upload_id']}" while True: status_response = requests.get(status_url) status_data = status_response.json() if status_data['state'] in ['completed', 'failed']: break time.sleep(2) print(f"Final status: {status_data['state']}") .. _upload_status_ep: Get upload status ................. **Endpoint**: ``GET /upload-sky-survey-status/{upload_id}`` Retrieve the current status of a sky survey batch upload. .. list-table:: :widths: 20, 50, 20, 10 :header-rows: 1 * - Parameter - Description - Data Type - Required * - ``upload_id`` - Unique identifier returned when the upload was initiated - string (UUID) - Yes **Response**: .. code-block:: json { "upload_id": "550e8400-e29b-41d4-a716-446655440000", "state": "completed", "total_files": 3, "uploaded_files": 3, "remaining_files": 0, "errors": [] } **Upload States**: - ``pending``: Upload created but not started - ``uploading``: Files are being uploaded and validated - ``completed``: All files uploaded and ingested successfully - ``failed``: Upload failed (see ``errors`` field for details) **Example Usage**: .. code-block:: bash curl "/upload-sky-survey-status/550e8400-e29b-41d4-a716-446655440000" **Python Example**: .. code-block:: python import requests import time upload_id = "550e8400-e29b-41d4-a716-446655440000" url = f"/upload-sky-survey-status/{upload_id}" while True: response = requests.get(url) status = response.json() print(f"State: {status['state']}") print(f"Progress: {status['uploaded_csv_files']}/{status['total_csv_files']}") if status['state'] in ['completed', 'failed']: break time.sleep(2) if status['state'] == 'failed': print(f"Errors: {status['errors']}") .. _review_upload_ep: Review staged upload .................... **Endpoint**: ``GET /review-upload/{upload_id}`` Review the status of the upload before committing to the main database. Returns total record count and the last 10 staged records to confirm all data loaded successfully. .. list-table:: :widths: 20, 50, 20, 10 :header-rows: 1 * - Parameter - Description - Data Type - Required * - ``upload_id`` - Unique identifier returned when the upload was initiated - string (UUID) - Yes **Response**: .. code-block:: json { "upload_id": "550e8400-e29b-41d4-a716-446655440000", "total_records": 200, "sample_range": "91-100", "sample": [ { "component_id": "J025837+035057", "ra": 0.7793, "dec": 0.0672, "i_pol": 0.8354, "version": null } ], "metadata": { "version": null, "catalogue_name": "TEST_CATALOGUE_1", "description": "Test catalogue 1 for development and testing purposes", "upload_id": "550e8400-e29b-41d4-a716-446655440000", "epoch": "J2000", "author": "SKA SDP Team", "reference": null, "notes": "Sample test data for ska-sdp-global-sky-model", "staging": true } } **Example Usage**: .. code-block:: bash curl "/review-upload/550e8400-e29b-41d4-a716-446655440000" **Python Example**: .. code-block:: python import requests upload_id = "550e8400-e29b-41d4-a716-446655440000" url = f"/review-upload/{upload_id}" response = requests.get(url) review = response.json() print(f"Total records: {review['total_records']}") print(f"Sample data: {review['sample'][:3]}") # First 3 records .. _commit_upload_ep: Commit Staged Upload .................... **Endpoint**: ``POST /commit-upload/{upload_id}`` Commit staged data to the main database. The catalogue version is **automatically assigned** at commit time by incrementing the minor version of the previous latest version for that catalogue (e.g. ``0.1.0`` → ``0.2.0``). If no prior version exists for the catalogue, ``0.1.0`` is used. Versioning is independent per catalogue name. All components in the upload receive the same new version. A record is created in the ``global_sky_model_metadata`` table with the version and upload information. .. list-table:: :widths: 20, 50, 20, 10 :header-rows: 1 * - Parameter - Description - Data Type - Required * - ``upload_id`` - Unique identifier of the staged upload to commit - string (UUID) - Yes **Response**: .. code-block:: json { "message": "success", "records_committed": 200, "upload_id": "550e8400-e29b-41d4-a716-446655440000" "version": "0.2.0", "catalogue_name": "Test catalogue" } **Example Usage**: .. code-block:: bash curl -X POST "/commit-upload/550e8400-e29b-41d4-a716-446655440000" **Python Example**: .. code-block:: python import requests upload_id = "550e8400-e29b-41d4-a716-446655440000" url = f"/commit-upload/{upload_id}" response = requests.post(url) result = response.json() print(f"Committed {result['records_committed']} records") print(f"Message: {result['message']}") .. _reject_upload_ep: Reject Staged Upload .................... **Endpoint**: ``DELETE /reject-upload/{upload_id}`` Reject and discard staged data. All records associated with this upload_id are permanently deleted from the staging table. The catalogue metadata associated with the upload_id is also removed from the metadata table. .. list-table:: :widths: 20, 50, 20, 10 :header-rows: 1 * - Parameter - Description - Data Type - Required * - ``upload_id`` - Unique identifier of the staged upload to reject - string (UUID) - Yes **Response**: .. code-block:: json { "message": "Upload rejected successfully", "records_deleted": 200, "upload_id": "550e8400-e29b-41d4-a716-446655440000" } **Example Usage**: .. code-block:: bash curl -X DELETE "/reject-upload/550e8400-e29b-41d4-a716-446655440000" **Python Example**: .. code-block:: python import requests upload_id = "550e8400-e29b-41d4-a716-446655440000" url = f"/reject-upload/{upload_id}" response = requests.delete(url) result = response.json() print(f"Rejected and deleted {result['records_deleted']} records") print(f"Message: {result['message']}") End-to-End Upload Workflow ^^^^^^^^^^^^^^^^^^^^^^^^^^ A complete example of the intended workflow is provided here. See above for more information on individual steps. 1. Upload files to staging: POST /upload-sky-survey-batch 2. Poll for completion: GET /upload-sky-survey-status/{upload_id} 3. Review staged data: GET /review-upload/{upload_id} 4. Commit to main database: POST /commit-upload/{upload_id} .. code-block:: python import requests import time # ------------------------------------------------------------------ # Configuration # ------------------------------------------------------------------ base_url = "" # e.g. http://ska-sdp-gsm. metadata_path = "metadata.json" csv_paths = ["test_catalogue_1.csv", "test_catalogue_2.csv"] # ------------------------------------------------------------------ # 1. Upload files to staging # ------------------------------------------------------------------ upload_url = f"{base_url}/upload-sky-survey-batch" files = [ ("metadata_file", ("metadata.json", open(metadata_path, "rb"), "application/json")), ] for path in csv_paths: files.append(("csv_files", (path, open(path, "rb"), "text/csv"))) response = requests.post(upload_url, files=files) result = response.json() upload_id = result["upload_id"] print(f"Upload ID: {upload_id}") print(f"Catalogue: {result['catalogue_name']}") print(f"Initial status: {result['status']}") print("Data uploaded to staging. Waiting for ingestion to complete...") # ------------------------------------------------------------------ # 2. Poll for upload completion # ------------------------------------------------------------------ status_url = f"{base_url}/upload-sky-survey-status/{upload_id}" while True: status_response = requests.get(status_url) status = status_response.json() print( f"State: {status['state']} | " f"Progress: {status['uploaded_files']}/{status['total_files']}" ) if status["state"] in ["completed", "failed"]: break time.sleep(2) if status["state"] == "failed": print("Upload failed!") print(f"Errors: {status['errors']}") exit(1) print("Upload completed successfully.") # ------------------------------------------------------------------ # 3. Review staged data # ------------------------------------------------------------------ review_url = f"{base_url}/review-upload/{upload_id}" review = requests.get(review_url).json() print(f"\nRecords staged: {review['total_records']}") print(f"Sample records: {review['sample'][:3]}") # ------------------------------------------------------------------ # 4. Commit upload to main database # ------------------------------------------------------------------ commit_url = f"{base_url}/commit-upload/{upload_id}" commit_response = requests.post(commit_url).json() print("\nCommit successful.") print(f"Committed {commit_response['records_committed']} records") print(f"Catalogue: {commit_response['catalogue_name']}") print(f"Version: {commit_response['version']}") Browser upload interface ^^^^^^^^^^^^^^^^^^^^^^^^ .. warning:: The browser upload interface is **deprecated** and will be removed in a future release. Users should migrate to the API-based workflow described above. A browser interface is available at the ``/upload`` endpoint (e.g., ``/upload``). 1. Navigate to ``/upload`` in your web browser (replace ```` with your deployment URL) .. figure:: ../images/upload_init_screen.png :alt: Upload Interface :width: 600px :align: center Initial upload interface screen 2. Drag and drop CSV files, containing data for a single catalogue version, onto the upload zone (or click to browse) .. warning:: A size limit of 10MB (total) for the selected files exists. .. figure:: ../images/upload_files_added.png :alt: Files added to upload :width: 400px :align: center Interface showing files selected for upload 3. Click "Upload Files" to begin the upload 4. Monitor the upload progress - status updates automatically 5. Confirm the upload completed successfully and review the count of staged records .. figure:: ../images/upload_completed.png :alt: Upload completed :width: 350px :align: center Interface showing files have been uploaded and staged 6. Click "Commit to Database" to approve or "Reject and Discard" to cancel .. figure:: ../images/upload_commit.png :alt: Upload committed :width: 400px :align: center .. figure:: ../images/upload_reject.png :alt: Upload rejected :width: 400px :align: center Confirm or reject uploaded data The browser interface also provides: - Real-time status monitoring - Displays the auto-assigned version of the committed data - Displays errors if upload fails The expected CSV format is described at :ref:`upload_csv_format` and examples are shown at :ref:`example_upload_csv`. .. _example_upload_csv: CSV Format Examples ^^^^^^^^^^^^^^^^^^^ **Standardized Format**: The ``test_catalogue_1.csv`` and ``test_catalogue_2.csv`` files in the test data directory demonstrate the required standardized format: .. code-block:: text component_id,ra_deg,dec_deg,i_pol_jy,a_arcsec,b_arcsec,pa_deg,spec_idx,log_spec_idx J025837+035057,44.656883,3.849425,0.835419,142.417,132.7302,3.451346,"[-0.419238,,,,]",False J030420+022029,46.084633,2.341634,0.29086,137.107,134.2583,-0.666618,"[-1.074094,,,,]",False These test catalogues contain 100 components each and are used throughout the test suite as reference examples. **Minimal Format**: At minimum, you need the four required columns: .. code-block:: text component_id,ra_deg,dec_deg,i_pol_jy J000001-350001,0.004,-35.0,0.25 J000002-350002,0.008,-35.1,0.23