1) Machine Learning Approaches to Study Star Formation and Black Hole Accretion in the MeerKAT/MIGHTEE survey
As a result of recent advances in astronomical and digital technologies, astronomy is rapidly becoming a data-rich science. The much-increased data rates from radio surveys with the MeerKAT telescope, The Australian Square Kilometre Array Pathfinder (ASKAP), and eventually the Square Kilometre Array (SKA), require the adoption of machine-learning techniques to automate most tasks previously carried out manually by astronomers. One such task is classifying radio sources as star-formation- or accretion-dominated. Both of these processes can be traced via synchrotron emission at radio wavelengths.
However, a reliable automated classification of radio sources as star-formation-dominated sources is non-trivial and often requires extensive use of multi-wavelength data. Classification of star formation dominated or accretion-dominated sources from the radio continuum surveys is necessary before understanding the nature of these radio sources.
In this study, we implement and optimise five supervised machine learning techniques; Logistic Regression, Support Vector Machine, K-Nearest Neighbour, Random Forest and XGBoost, to classify radio sources detected in the MeerKAT International GHz Tiered Extragalactic Exploration (MIGHTEE) –COSMOS survey as star-formation-or accretion-dominated.
2) Machine Learning In Multi-Wavelength Galaxy/Quasar Evolution: Photometric Redshift Estimation
The photometric redshifts estimation is currently the most powerful and efficient way to estimate the distances to the extragalactic sources. The exponential data avalanche continues and this will require low cost, fast and efficient data-driven methods to analyse and make predictions from the data. In this study, we present the supervised machine learning algorithms that are used to estimate the photometric redshifts of the galaxies and quasars that are found in a cross-matched Sloan Digital Sky Survey data release 16 (SDSS DR16) and WISE datasets. We adopt the K-Nearest Neighbour (KNN) and Random Forest (RF) regressors to estimate the photometric redshifts of 285685 galaxies and 124688 quasars by considering their photometric measurements.
The first figure on the left is the colour-colour diagram showing the distribution of the observed galaxies and quasars with the colour bar of true spectroscopic redshifts. The left plot indicates the observed galaxy count distribution on the 2D grid of r-i vs u-g, and the right plot represents the distribution of observed quasars on the same r-i vs u-g grid. The figure below on the left shows the normalised redshift estimation error, ∆z norm as a function of the spectroscopic redshift of galaxies. The last figure indicates the predicted redshifts as a function of the spectroscopic redshift by the K-Nearest Neighbour algorithm. The left plot represents the z phot vs z spec for galaxies and the right plot shows the z phot vs z spec for quasars.
Large scale structure in the northern equatorial slice of the SDSS main galaxy redshift sample. The slice is 2.5 degrees thick, and galaxies are color-coded by g-r color. [photo cred: sdss-legacy ]
I would like to thank my supervisor, Prof. Mattia Vaccari for his support throughout the project. I am also grateful for the effort that has been put forward by Chaka Mofokeng in helping with the code write up. I also appreciate the support from my colleague Yaaseen Jones, friends and family. The work uses the SDSS DR16 data which is the fourth data release of the fourth phase of Sloan Digital Sky Survey and AllWISE data products from the Wide-field Infrared Survey Explorer. Without the two surveys this work would have been impossible.
3) Hyperparameter optimization for XGBoost
This project is more of a build up on the regression repository. It mainly focuses on optimizing the hyper-parameters of the XGBoost regressor to best estimate the photometric redshifts Quasars and Star forming galaxies under study. We used 80% (about one million spectroscopically confirmed SDSS sources) of the dataset for training the algorithm and 20% for testing. We used sk-learn Randomised Search CV and r2_score, Median absolute deviation and both of them as the scoring metrics in different experiments (see the github f to find the best parameters for our testing data. The Median absolute deviation provides the best RMS and NMAD for this project.