What Is DarkBERT? Can the AI Help Combat Cyber Threats?
The popularity of large language models (LLMs) is soaring, with new ones continuously entering the scene. These models, like ChatGPT, are typically trained on various internet sources, including articles, websites, books, and social media.
In an unprecedented move, a team of South Korean researchers developed DarkBERT, an LLM trained on datasets taken exclusively from the dark web. Their aim was to create an AI tool that outperforms existing language models and assists threat researchers, law enforcement, and cybersecurity professionals in fighting cyber threats.

What Is DarkBERT?
DarkBERT is a transformer-based encoder model based on the RoBERTa architecture. The LLM was trained on millions of dark web pages, including data from hacking forums, scamming websites, and other online sources associated with illegal activities.
The term"dark web" refers to a hidden internet sectioninaccessible via standard web browsers. The subsection is renowned for harboring anonymous websites and marketplaces infamous for illegal activities, such as the trade of stolen data, drugs, and weapons.

To train DarkBERT, the researchers gainedaccess to the dark webthrough the Tor network and collected raw data. They carefully filtered this data using techniques like deduplication, category balancing, and pre-processing to create a refined dark web database, which was then fed to RoBERTa over the course of approximately 15 days to create DarkBERT.
Possible Uses of DarkBERT in Cybersecurity
DarkBERT has a remarkable understanding of cybercriminals' language and excels at spotting specific potential threats. It can research the dark web and successfully identify and flag cybersecurity threats like data leaks and ransomware, making it a potentially useful tool to fight cyber threats.
To evaluate the effectiveness of DarkBERT, researchers compared it to two renowned NLP models, BERT and RoBERTa, assessing their performance across three crucial cybersecurity-related use cases, the research, posted onarxiv.org, indicates.

1. Monitor Dark Web Forums for Potentially Harmful Threads
Monitoring dark web forums, which are commonly used for exchanging illicit information, is crucial to identify potentially dangerous threads. However, manually reviewing these can be time-consuming, making automation of the process beneficial to security experts.
The researchers focused on potentially damaging activities in hacking forums, devising annotation guidelines for noteworthy threads, including sharing confidential data and distributing critical malware or vulnerabilities.

DarkBERT outperformed other language models in terms of precision, recall, and F1 score, emerging as the superior choice for identifying noteworthy threads on the dark web.
2. Detect Sites That Host Confidential Information
Hackers and ransomware groups use the dark web to create leak sites, where they publish confidential data stolen from organizations that refuse to comply with ransom demands. Other cybercriminals just upload leaked sensitive data, like passwords and financial information, to the dark web with the intention of selling it.
In their study, the researchers collected data fromnotorious ransomware groupsand analyzed ransomware leak sites that publish organizations' private data. DarkBERT outperformed other language models in identifying and classifying such sites, showcasing its understanding of the language used in underground hacking forums on the dark web.

3. Identify Keywords Related to Threats on the Dark Web
DarkBERT leverages the fill-mask function, an inherent feature of BERT-family language models, to accurately identify keywords associated with illegal activities, including drug sales on the dark web.
When the word “MDMA” was masked in a drug sales page, DarkBERT generated drug-related words, whereas other models suggested general words and terms unrelated to drugs, like various professions.
DarkBERT’s ability to identify keywords related to illicit activities can be valuable in tracking and addressing emerging cyber threats.
Is DarkBERT Accessible to the General Public?
DarkBERT is currently unavailable to the public, but the researchers are open to requests to use it for academic purposes.
Harness the Power of AI for Threat Detection and Prevention
DarkBERT has been pre-trained on dark web data and outperforms existing language models across multiple cybersecurity use cases, positioning itself as a crucial tool for advancing dark web research.
The dark web-trained AI has the potential to be used for various cybersecurity tasks, including identifying websites selling leaked confidential data, monitoring dark web forums to detect illicit information sharing, and identifying keywords related to cyber threats.
But you should always remember that, like other LLMs, DarkBERT is a work in progress, and its performance can be improved through continuous training and fine-tuning.
AI tools like ChatGPT are great, but there are several reasons you should take what you read with a pinch of salt.
Windows is great, but adding this makes it unstoppable.
Sometimes the smallest cleaning habit makes the biggest mess.
Don’t let aging hardware force you into buying expensive upgrades.
Your iPhone forgets what you copy, but this shortcut makes it remember everything.
This small feature makes a massive difference.