

Let's face it. Three years ago, moderation was just a 'technical triviality' that could be farmed out to interns at a low cost. Currently, it's among the biggest business risks one can take — right up there with default. For real, companies that can't properly protect themselves from massive audiences stand to lose up to 30% of their users in just one week — and that's no joke. And while the users leave, the millions of dollars that could be lost to lawsuits will follow close behind.
In 2022, the simple dictionary-driven stop-word lookups are now obsolete. Users have become so adept at circumventing filters that moderators no longer have the ability to keep pace by adding words to these databases - they type out "k1ll" versus "kill" and use memes to mask threats. The old way of doing things has ended.
An AI agent providing comment moderation is not merely a filter. It eliminates the need for an outdated list of banned words, operating in real-time at the point of posting. The agent itself determines whether to allow; block; or send a post for manual review.
The agent provides comprehension of context, as opposed to the early versions of filtering used in forums that we all loathed. The agent will interpret sarcasm; detect underlying tension; and decipher culturally dictated forms of communication.

Numbers tell the true story - an agent will process approximately 10k messages in 1 minute while a human can physically only process 150-200 per minute; the response time is approximately 100 milliseconds for an agent vs. a human of up to many minutes.
If you want to begin without having to develop your own code, ASCN.AI has developed workflow automation templates that you may use to implement moderation in 2-3 weeks without having to hire an outside developer at all. The primary aim of automating the moderation of content is to prevent damage to its image and to protect against lawsuits by lawyers. You have 50,000 comments a day, so hiring enough people to manually review every post would be financially unfeasible. However, the cost of a mistaken review increases significantly. If you leave a threat of violence or a child exploitation image unchecked, the government may shut down your site.
The agent performs approximately 90-95% of normal reviews in real-time, and only the difficult cases stay for human reviewers.
| Message Volume/Day | Manual Moderation | AI Agent | Savings/Month |
|---|---|---|---|
| 1,000 | $3,000 | $400 | $2,600 |
| 10,000 | $25,000 | $1,200 | $23,800 |
| 100,000 | $200,000 | $8,000 | $192,000 |
An API can be used to show how your AI Agent understands the content of each file. Without understanding the architecture and design behind that API, you can't effectively implement the API and pay too much (3x) or get a significant percentage of false positives that will drive users away from your product. There is no "magic" button that will moderate; however, there is a pipeline of models that have been designed specifically for use on various types of data. For example, the methodologies used to analyze text are very different from how photos or videos are analyzed; especially, as video typically requires significantly more processing power than audio.
When an AI agent is moderating user text-based communication, the AI agent uses NLP. These NLP technologies utilize transformers (BERT, RoBERTA, XLM-R) which are significantly more accurate than past generations of models when it comes to understanding context (i.e., 40% better).
How does this work? The new models are able to address text in the context of relationships between the words; therefore, they understand that "great job, keep it up" can be interpreted either as constructive feedback, or very sarcastically, depending upon the relationship of past messages in the same chat feed. The models are trained on large datasets of Reddit, Twitter and forum messages and are used to identify the presence of bullying, hate speech and neutral debate.
The primary evolution from past models to current models is the change in how both models and algorithms use semantic based data analysis. In the past, a filter would catch the word "kill" in all instances (even when referring to a social reference by way of the term, "killing time.") Today, content moderation will use different algorithms to determine whether the term "kill" is being used in a manner that could be construed as making a threat or expressing a social reference to a common expression. In addition to using semantic data as a guide, the AI agent will also look at the user's past behavior, sentiment analysis, and grammar when making a determination on a specific post.
A challenge for all users and providers of AI moderation will be that there is more and more slang or code words being created at an increasing rate than the models that are currently available to users will be able to keep up. Many users also use code to communicate by using letters and emojis to replace portions of the text they are writing. As a result of addition of adversarial perturbations detecting layers onto classic models, an adversarial perturbation detecting layer has been added to classic models at least once per quarter.
Computer vision uses Convolutional Neural Networks (CNN) and/or Hybrid Transformers, (that is, Video/Image Transformers ViTs and Or CLIP) to interpret images and/or videos by understanding the physical contents within the pixels. The agent or model can identify people, things, characters, weapons, or violence in 2D images.
Video processing for detecting violence or suicide in video occurs temporally and applying timing or actions dynamically, not just by still images (at that specific time). Multiple Video processing models including SlowFast and/or I3D are used to detect violence or suicide; however, the only way to know if violence occurred in the video from the model output is to wait until the video is completely processed. The downside of video processing is that it takes long periods of time to analyze video content. A one (1) hour of full high definition (FHD) video can require 10 – 50 minutes of processing time to process through the GPU on a Tesla V100; thus, many application platforms only apply AI filtering to the highest volume of video and/or video processing occurs post-event.
Voice mails and/or voice message will first get converted to text format (transcribed) before passing through a text filter; however, there is another audio filter layer; acoustic filter. An acoustic filter will analyze the sound; tone; and pauses within the voice messaging for scream or aggression regardless of the spoken language based on the frequency characteristics.
An AI comment moderation application for live voice chat works almost real-time (2 – 3 seconds); without this capability, the person making the comment may continue to create conflict requiring immediate resolution in Streaming/Gaming applications. The major challenge with voice messages is background noise and chemical slang terms from different parts of the world; the slang used in the Russian language varies daily across different parts of the world.
Each content forum has its own "acceptable"/ "standard" practices, but what is considered both to be acceptable and standard on Reddit would normally be considered both unacceptable and not standard on a child’s forum. The challenge with using an API to moderate your user's content is that they are built using average data -- There is no exception made with respect to disallowing the term "scammer" when outlining the relationship of a politician.
RAG (Retrieval-Augmented Generation) systems can solve these types of problems, as they can dynamically load your rules into the context of user-initiated requests, meaning that any changes made to the rules are applied to user-initiated requests within the same day and do not require a lengthy model retrain.
To illustrate how RAG works, you would upload a list of banned content along with appropriate exceptions. The RAG will then search through your data using a Vector database (Pinecone) to locate points of content that match the user's message and then return a response of block/ allow/ human-review to the user along with the points found.
Service providers clients often say that spam and profanity are their #1 challenge; however, when you review incident statistics, threats of physical violence and coordinated illegal acts are far more common than profanity. While spam does cause an inconvenience to the service provider, profanity will usually result in a service provider losing their service as a result of a court order in the tens of thousands of dollars. However, threats and child pornography will create the situation where a service provider will lose their service in days.
An AI agent used to filter toxic content does not operate within the "yes or no" system utilized in traditional services; instead, it operates on a "scale." An AI moderation platform will be able to differentiate between a mild derogatory comment (e.g., "stupid") versus making an actual physical threat against someone (e.g., "We will find you, ass-whole.") Thus the response could range from a warning to a permanent ban, and relevant data may be submitted to the authorities. The Incitement to racially and religiously motivated hatred is known as hate speech. Hate speech can be masked through dehumanizing statements (i.e., calling others “not human”) and through use of code words. There is a fine line between what hate speech may be protected under the First Amendment in the United States and what is a crime in Germany.
Based on ASCN.AI’s data, the number of spam messages caught by automation is at least 98%. Traditional spam is characterized by repeated links and mass mailing. However, more modern spammers have begun using Generative Pre-Trained Transformer (GPT) models to create what appear to be human-generated comments that contain embedded advertising. This will lend itself to behavioral analysis in order to identify a bot or automated account. A common bot behavior example would be creating a new account and leaving 20 comments within a 10-minute period with a link to one domain.
In the crypto space, ASCN.AI is specifically focused on moderating signals in Telegram/Discord channels by identifying scam tokens and false airdrops through contract link pattern analysis.
Fake news is treated differently. Agents can integrate data from fact-checking sources (e.g., Snopes; FactCheck.org) and denote evidence-based opinions about whether claims have been factually validated. The content is not deleted (this would be considered censorship) but is labeled “This Claim has been Investigated and found to be Invalid.” Currently, the percentage of facts being verified is approximately 70% because of the high volume of faked content being created.
Computer vision can identify logos and watermarks. For example, if a user uploads a picture of a brand (i.e., counterfeit sneakers), the agent will block the image and send a notification to the trademark or copyright owner.
In the case of video material, technology has been developed to identify videos using what is referred to as "Content ID," which results in the generation of a digital fingerprint of the video content. The generated digital fingerprint for new videos will be compared to the fingerprint of existing videos. If there is an 80% or greater similarity, then YouTube will block content or impose restrictions on the ability to monetize the video material. The downside of algorithmic moderation is that the algorithms usually don't delineate between "fair use" and other forms of copyright infringement when using quote excerpts for educational use by the end-user. This places additional burden on people to make decisions about what is acceptable.
| Comparison Parameter | Human Moderation | AI Moderation (Automatic) | Hybrid Model (Human + AI) |
|---|---|---|---|
| Reaction Speed | Minutes / Hours | Milliseconds (Real-time) | Seconds (for disputed items) |
| Scalability | Linear (requires people) | Exponential (CPU/GPU) | High |
| Accuracy (Context) | High (understands culture) | Medium (depends on model) | Maximum (AI sorts) |
| Cost (OPEX) | High ($$$) | Low (with volume growth) | Optimal |
| 24/7 Availability | Difficult / Expensive | Yes | Yes (AI) |
| Emotional Health | Burnout / trauma risk | Not applicable | Protected (humans see complex items) |
The hybrid model is the gold standard. An AI agent for UGC platforms (User Generated Content) will filter out 90% - 95% of obvious trash (Spam, Rudeness). Humans will be required for borderline cases such as political satire that may look like a hate or intolerance or for photos related to medicine for educational purposes.
| Criterion | ASCN.AI | Google Perspective API | Self-hosted Solution |
|---|---|---|---|
| Implementation Time | 2-3 weeks | 2-5 days | 3-6 weeks |
| Team Required | No developers (no-code) | 1-2 developers | 3-5 ML engineers |
| Rule Customization | Full (RAG + fine-tuning) | Limited | Full |
| Cost at 100K msg/day | $8,000/month | $15,000/month | $10,000/month + infrastructure |
| Data Storage | Client's choice | Google servers | Your servers |
| Crypto-slang Support | Yes (spec. models) | No | Requires retraining |

The method of execution will follow a slow and steady approach. The initial phase will consist of historical testing followed by a "shadow mode" test then finally move to a gradual increase in automation.
To begin, audit the datasets that you have already built. If your application is currently live you should pull a six-month dump of both allowed and blocked messages. Label each message as either spam, have toxic content or are not safe for work. This will be your "training material" to fine-tune your model.
While doing that, formalize the guidelines. It is not just enough to say "you cannot use insults". You must specify "what is an insult", "are there exceptions". For example, the word "death" is normal on a medical forum but would be a flag for a violation in any other type of forum. This document will become the base of your RAG (Policy-as-Code) program.
Using a pre-configured SaaS solution (e.g. Google Perspective API, Azure, OpenAI) would be a good way to quickly get started since they can typically be integrated within a couple of days with a pay-per-use charge model. Pros include: speed of delivery and no infrastructure cost; however, you have no flexibility, you have potentially given your sensitive data to a third party (a risk under GDPR) and the overall cost will be exponentially higher in terms of transaction volume.
A custom solution will typically involve you fine-tuning an open-source machine learning model (e.g. RoBERTa, BERT) based on your proprietary data using your own server. Pros include: you have complete control of your data; however, you will need an ML engineering team with access to high computing resources (a cost of $500/month for the use of a GPU).
A hybrid solution, however, is somewhere in the middle. The SaaS solution will only filter basic vulgarity while the custom solution can filter the nuances and specifics of your business. For example, simple filtering can be done in the cloud while more complex nuance filtering can happen in a local environment.
Fine-tuning will simply be the process of calibrating the weightings of the machine learning algorithm to the specific business goal you have defined. Load the training data used in Step 1 for your chosen model and run it for 3-5 epochs to reduce false positives (normal things that get flagged as banned) and false negatives (toxic things that get through).
One of the most critical points of interest is setting a confidence threshold. If the model indicates a 0.85 chance of being toxic, you can set your confidence threshold to a value of 0.90. That means the message passes through the safety system but is logged for further evaluation at a later time. Setting the threshold to 0.70 means that message will be considered a hard block. Finding an optimal value is an experimental process, and you should start with a higher setting, such as between 0.90-0.95, so that your users do not complain, and then reduce that threshold as needed.
# Setting the model confidence threshold
CONFIDENCE_THRESHOLD = 0.85 # Messages with probability >85% are blocked
REVIEW_THRESHOLD = 0.60 # Messages 60–85% are sent for manual review
if toxicity_score >= CONFIDENCE_THRESHOLD:
action = "BLOCK"
elif toxicity_score >= REVIEW_THRESHOLD:
action = "REVIEW"
else:
action = "ALLOW"
You've also got to consider your specific area of focus. For example, in gaming, the word "kill" is used as a part of gameplay. In medical practice, the word "cut" is related to performing a surgical procedure. The model needs to appropriately classify the situational context surrounding each word, or the system might incorrectly flag 30-40% of usage.
When implementing your system, REST API and SDK communications will be the methods you use to receive and send information. The pathway through which information will flow will look like this: END USER -> BACK END SERVER -> AGENT -> DECISION. In order to avoid providing lag, the decision will need to be completed in under 200ms. Otherwise, your end users will notice the lag.
The initial launch of the system will be considered Shadow Mode. The agent will still be making decisions on messages; however, you still will be publishing each of those messages. You will now simply compare the agent’s decision to the results contained in the logging, and evaluate how many of the agent’s decisions matched what the user received. If you have matched greater than 90% then you can begin to roll out to a small percentage (10%-20%) of your total traffic. When comparing humans (Group A) versus agents (Group B), conduct an A/B test that collects complaints and analyzes the speed of resolution of issues, as well as the number of resolved accounts (churn). If Group B performs similarly or better than Group A in terms of complaints, speed, and churn, the amount of work performed by the agent may be increased to 80%; however, retain 5-10% for humans on complex cases.
"The most common mistake is attempting to implement AI at 100% right out of the gate. It is better to implement AI at 90% and then phase in manual review gradually. At ASCN.AI we started with a hybrid process for the first three months; agents flagged 95% of violations, and a human reviewed 5%. As a result, we experienced a tenfold reduction in the manual work needed to be done without impacting quality."
The reality is that any person claiming to provide "error-free or perfect moderation" is lying. All models are built from training data; therefore every model will make a mistake. As a result, the objective should be to decrease the number of mistakes rather than to eliminate them.
When training an agent, the agent inherits biases from the data it was trained on. A model trained using data from the U.S. doesn't understand context from Russia or Asia. One example is Google's flagging of AAVE as more toxic because it is normally found in negative contexts.
False positives are the result of banning something that is normal. For example, an agent bans the use of "that movie killed me" as a threat or blocks a conversation about preventing suicide because the block-filter sees it as a pro-suicide statement.
To fix this, do a fairness audit. Each quarter, check the model against multiple groups and if you see a skew, do a retrain on a balanced sample.
Sarcasm is a nightmare for NLP. Take the phrase “that’s just great,” is that a compliment or sarcasm? It would be hard for the model to tell from context due to lack of context. This is compounded by memes because a meme will have an image that looks neutral but the text is toxic.
Slang changes at a faster pace than the model does. In 2019, the “OK” hand gesture (👌) was co-opted by far-right groups in the U.S. as a hate symbol and a model trained pre-2019 would not have flagged it.
Constant updates will remove the staleness. Either do retraining every 3 months or do a RAG system with a "living" rules document. At ASCN.AI, we update the database for critical changes (new memes, fads) within 24 hours using RAG.
In ideal conditions with context (i.e. within a lab), today's models achieve an 80-90% accuracy. However, in real situations where full dialog history is not available, AI achieves 70-75% accuracy. For sensitive areas, we will use a hybrid approach where the agent will flag content as being offending (90% - 80% probability of being offending) and then a human will make the final determination.
Definitely! This is a normal occurrence. Using Policy-as-Code, you will be able to upload your Custom Rulesets to RAG and have those rules used without retraining "from scratch." Additional customization for occupational specialty language would require fine-tuning the model (RoBERTa). This process takes 1-3 weeks.
It will take a developer 2-5 days to integrate a pre-built SaaS API. This includes setting up endpoints and conducting tests in shadow mode. A custom-built solution with a training element will take 3-6 weeks.
No, current text/photo moderation alone will not catch DeepFakes. You will need specialized models (e.g. FaceForensics++, Reality Defender). Currently, accuracy of catching DeepFake with those models is 90-95%, however, you will not be able to catch new diffusion models with these existing models. You will need a combination of automated review and manual review for suspicious videos.
If you are only processing 100 messages/day, it is usually cheaper to hire a human moderator at a cost of $1,000. However, the math changes significantly when moving to 5,000-10,000 messages daily; thus, AI will be 5-10x less expensive than a human agent at that volume.
If you are processing 10,000 messages/day or more, depending on your business model, you can expect a return on investment in 2-4 months. In one case study that ASCN.AI conducted with a fintech client, they received a return on investment in 3 weeks and saved $23,800/month on salaries at a volume of 15,000 messages/day.