No Arabic abstract
The burden of depression and anxiety in the world is rising. Identification of individuals at increased risk of developing these conditions would help to target them for prevention and ultimately reduce the healthcare burden. We developed a 10-year predictive algorithm for depression and anxiety using the full cohort of over 400,000 UK Biobank (UKB) participants without pre-existing depression or anxiety using digitally obtainable information. From the initial 204 variables selected from UKB, processed into > 520 features, iterative backward elimination using Cox proportional hazards model was performed to select predictors which account for the majority of its predictive capability. Baseline and reduced models were then trained for depression and anxiety using both Cox and DeepSurv, a deep neural network approach to survival analysis. The baseline Cox model achieved concordance of 0.813 and 0.778 on the validation dataset for depression and anxiety, respectively. For the DeepSurv model, respective concordance indices were 0.805 and 0.774. After feature selection, the depression model contained 43 predictors and the concordance index was 0.801 for both Cox and DeepSurv. The reduced anxiety model, with 27 predictors, achieved concordance of 0.770 in both models. The final models showed good discrimination and calibration in the test datasets.We developed predictive risk scores with high discrimination for depression and anxiety using the UKB cohort, incorporating predictors which are easily obtainable via smartphone. If deployed in a digital solution, it would allow individuals to track their risk, as well as provide some pointers to how to decrease it through lifestyle changes.
Background: Cardiovascular diseases (CVDs) are among the leading causes of death worldwide. Predictive scores providing personalised risk of developing CVD are increasingly used in clinical practice. Most scores, however, utilise a homogenous set of features and require the presence of a physician. Objective: The aim was to develop a new risk model (DiCAVA) using statistical and machine learning techniques that could be applied in a remote setting. A secondary goal was to identify new patient-centric variables that could be incorporated into CVD risk assessments. Methods: Across 466,052 participants, Cox proportional hazards (CPH) and DeepSurv models were trained using 608 variables derived from the UK Biobank to investigate the 10-year risk of developing a CVD. Data-driven feature selection reduced the number of features to 47, after which reduced models were trained. Both models were compared to the Framingham score. Results: The reduced CPH model achieved a c-index of 0.7443, whereas DeepSurv achieved a c-index of 0.7446. Both CPH and DeepSurv were superior in determining the CVD risk compared to Framingham score. Minimal difference was observed when cholesterol and blood pressure were excluded from the models (CPH: 0.741, DeepSurv: 0.739). The models show very good calibration and discrimination on the test data. Conclusion: We developed a cardiovascular risk model that has very good predictive capacity and encompasses new variables. The score could be incorporated into clinical practice and utilised in a remote setting, without the need of including cholesterol. Future studies will focus on external validation across heterogeneous samples.
Recent advances in machine learning are consistently enabled by increasing amounts of computation. Reinforcement learning (RL) and population-based methods in particular pose unique challenges for efficiency and flexibility to the underlying distributed computing frameworks. These challenges include frequent interaction with simulations, the need for dynamic scaling, and the need for a user interface with low adoption cost and consistency across different backends. In this paper we address these challenges while still retaining development efficiency and flexibility for both research and practical applications by introducing Fiber, a scalable distributed computing framework for RL and population-based methods. Fiber aims to significantly expand the accessibility of large-scale parallel computation to users of otherwise complicated RL and population-based approaches without the need to for specialized computational expertise.
Cardiovascular diseases (CVDs) is a number one cause of death globally. WHO estimated that CVD is a cause of 17.9 million deaths (or 31% of all global deaths) in 2016. It may seem surprising, CVDs can be easily prevented by altering lifestyle to avoid risk factors. The only requirement needed is to know your risk prior. Thai CV Risk score is a trustworthy tool to forecast risk of having cardiovascular event in the future for Thais. This study is an external validation of the Thai CV risk score. We aim to answer two key questions. Firstly, Can Thai CV Risk score developed using dataset of people from central and north western parts of Thailand is applicable to people from other parts of the country? Secondly, Can Thai CV Risk score developed for general public works for hospitals patients who tend to have higher risk? We answer these two questions using a dataset of 1,025 patients (319 males, 35-70 years old) from Lansaka Hospital in the southern Thailand. In brief, we find that the Thai CV risk score works for southern Thais population including patients in the hospital. It generally works well for low CV risk group. However, the score tends to overestimate moderate and high risks. Fortunately, this poses no serious concern for general public as it only makes people be more careful about their lifestyle. The doctor should be careful when using the score with other factors to make treatment decision.
We study adversarial perturbations when the instances are uniformly distributed over ${0,1}^n$. We study both inherent bounds that apply to any problem and any classifier for such a problem as well as bounds that apply to specific problems and specific hypothesis classes. As the current literature contains multiple definitions of adversarial risk and robustness, we start by giving a taxonomy for these definitions based on their goals, we identify one of them as the one guaranteeing misclassification by pushing the instances to the error region. We then study some classic algorithms for learning monotone conjunctions and compare their adversarial risk and robustness under different definitions by attacking the hypotheses using instances drawn from the uniform distribution. We observe that sometimes these definitions lead to significantly different bounds. Thus, this study advocates for the use of the error-region definition, even though other definitions, in other contexts, may coincide with the error-region definition. Using the error-region definition of adversarial perturbations, we then study inherent bounds on risk and robustness of any classifier for any classification problem whose instances are uniformly distributed over ${0,1}^n$. Using the isoperimetric inequality for the Boolean hypercube, we show that for initial error $0.01$, there always exists an adversarial perturbation that changes $O(sqrt{n})$ bits of the instances to increase the risk to $0.5$, making classifiers decisions meaningless. Furthermore, by also using the central limit theorem we show that when $nto infty$, at most $c cdot sqrt{n}$ bits of perturbations, for a universal constant $c< 1.17$, suffice for increasing the risk to $0.5$, and the same $c cdot sqrt{n} $ bits of perturbations on average suffice to increase the risk to $1$, hence bounding the robustness by $c cdot sqrt{n}$.
We analyze the practices of reservoir computing in the framework of statistical learning theory. In particular, we derive finite sample upper bounds for the generalization error committed by specific families of reservoir computing systems when processing discrete-time inputs under various hypotheses on their dependence structure. Non-asymptotic bounds are explicitly written down in terms of the multivariate Rademacher complexities of the reservoir systems and the weak dependence structure of the signals that are being handled. This allows, in particular, to determine the minimal number of observations needed in order to guarantee a prescribed estimation accuracy with high probability for a given reservoir family. At the same time, the asymptotic behavior of the devised bounds guarantees the consistency of the empirical risk minimization procedure for various hypothesis classes of reservoir functionals.