Threatened Species News Dataset


This data is part of the research article: Automated retrieval of information on threatened species from online sources using machine learning, Ritwik Kulkarni and Enrico Di Minin, 2021, Methods in Ecology and Evolution Kindly cite this article for the dataset. 1 Considering limited conservation resources, gathering and analyzing information from digital data sources can help investigate the global biodiversity crisis in a cost-efficient manner. Development and application of methods for automated content analysis of digital data sources are especially important in the context of investigating human-nature interactions. 2. In this study, we introduce methods to automatically collect information on species threatened by wildlife trade from online news. An end to end pipeline is constructed that begins from searching and downloading news articles about species listed in Appendix I of the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES) and proceeds with implementing natural language processing and machine learning methods to filter and retain only relevant articles. Additional relevant information is then extracted for each article using a Named Entity Recognition model. 3. The data collected over a one month period included 15,088 articles and focused on 585 species listed in Appendix I of CITES. The accuracy of the neural network to detect relevant articles was 95.91% while the Named Entity recognition model helped extract information on prices, location, and quantities of traded animals. A regularly updated database is generated by the system, which can be queried and analysed for various research purposes and to inform conservation decision-making. 4. The results demonstrate that natural language processing can be used in an efficient manner to extract information from digital text content. The proposed methods can be applied to multiple digital data platforms at the same time and used to investigate human-nature interactions in conservation science and practice.
Show more

Year of publication


Type of data


Enrico Di Minin - Contributor, Rights holder, Curator, Creator

Ritwik Kulkarni - Contributor, Rights holder, Curator, Creator, Publisher


Other information

Fields of science

Environmental sciences



Open access



Creative Commons Attribution NonCommercial ShareAlike 4.0 International (CC BY NC SA 4.0)


conservation, machine learning, natural language processing, CITES, Online News, threatened species

Subject headings

nature conservation

Temporal coverage


Related to this research data