Categories: CSSScriptWeb Design

Extract Clean Web Content with Defuddle.js

Defuddle is a web content extraction library that extracts the main content from web pages by removing clutter and standardizing HTML. Works both for Browser and Node.js.

This library parses a given HTML document (or string), identifies & preserves the core article content, and strips away elements like sidebars, headers, footers, and other non-essential content.

It serves as an enhanced alternative to Mozilla Readability with more forgiving extraction algorithms and consistent output formatting.

Table of Contents

Toggle

Features:

Content Extraction: Removes sidebars, headers, footers, comments, and other non-essential elements
Mobile-Aware Detection: Uses page mobile styles to identify unnecessary elements
Metadata Extraction: Pulls author, title, description, publication date, and schema.org data
HTML Standardization: Normalizes headings, code blocks, footnotes, and math elements
Multiple Output Formats: Supports HTML and Markdown conversion
Performance Tracking: Includes parse time metrics and word count statistics
Bundle Options: Core, full (with math parsing), and Node.js-optimized versions
Debug Mode: Preserves attributes and structure for development analysis

See It In Action:

Use Cases:

Building “Read It Later” Apps: If you’re creating a service where users can save articles to read later, Defuddle can provide that clean, distraction-free reading view.
Content Ingestion for AI/RAG Applications: When you need to feed webpage content into a Retrieval Augmented Generation (RAG) system or any LLM, you want the core text, not all the surrounding noise. Defuddle helps get you that cleaner text.
Web Clipping Browser Extensions: This is its origin story with Obsidian Web Clipper. If you’re building an extension that needs to grab the main content of a page, Defuddle is a solid choice.
Automated Content Processing: For tasks like scraping articles for analysis or converting web content into different formats (like Markdown for a knowledge base), Defuddle handles the initial cleanup.

How to use it:

1. Install Defuddle with NPM. Available bundles:

Core (defuddle): For browser use, no extra dependencies. Handles math content but without fallbacks for MathML/LaTeX conversion.
Full (defuddle/full): Includes more robust math parsing with mathml-to-latex and temml.
Node.js (defuddle/node): Optimized for Node.js with JSDOM, includes full math and Markdown conversion capabilities.

# NPM
$ npm install defuddle

2. If you’re planning to use it in a Node.js environment, you’ll also need JSDOM:

# NPM
$ npm install jsdom

3. In a browser environment, you can import Defuddle and pass it a DOM document object. The parse method returns an object with these properties:

content: Cleaned HTML or Markdown string
title: Article title
author: Author name
description: Article summary/description
domain: Source website domain
favicon: Website favicon URL
image: Main article image URL
metaTags: Raw meta tag data
parseTime: Processing time in milliseconds
published: Publication date string
site: Website name
schemaOrgData: Structured data from schema.org markup
wordCount: Total word count of extracted content

import Defuddle from 'defuddle';

// Assuming 'document' is the current page's document
const defuddleInstance = new Defuddle(document);
const article = defuddleInstance.parse();

console.log(article.content);
console.log(article.title);

4. For server-side processing, the setup is slightly different. You’ll use defuddle/node.

import { JSDOM } from 'jsdom';
import { Defuddle } from 'defuddle/node';

// To parse HTML from a string
const htmlString = '<html><body><article><h1>My Title</h1><p>Some content.</p></article></body></html>';
const articleFromString = await Defuddle(htmlString);
console.log(articleFromString.title);

// To parse HTML from a URL
const dom = await JSDOM.fromURL('https://example.com/some-article');
const articleFromUrl = await Defuddle(dom); // You can also pass the URL string directly
console.log(articleFromUrl.content);

// With options
const articleWithOptions = await Defuddle(dom, 'https://example.com/some-article', {
  debug: true,
  markdown: true
});

// This will be Markdown
console.log(articleWithOptions.content);

5. Available configuration options. You can pass an options object during instantiation or when calling Defuddle in Node.js:

debug (boolean): Enables more verbose logging and preserves more attributes in the HTML (useful for, well, debugging).
url (string): The original URL of the page, which can help with resolving relative links and metadata.
markdown (boolean): If true, the content property in the result will be Markdown.
separateMarkdown (boolean): If true, content remains HTML, and an additional contentMarkdown property is returned.
removeExactSelectors (boolean, default: true): Controls removal of elements matching precise selectors (e.g., specific ad classes).
removePartialSelectors (boolean, default: true): Controls removal of elements matching broader, partial selectors.

FAQs

Q: Is Defuddle reliable for all websites?
A: No extraction library is 100% reliable for all websites due to the vast differences in web page structures.

Q: How does Defuddle handle paywalled content?
A: Defuddle can only parse the HTML content it’s given. If the main content is hidden behind a paywall and not present in the initial HTML (or the DOM passed to it), Defuddle won’t be able to extract it. It doesn’t bypass paywalls.

Q: How does the Markdown conversion work?
A: Defuddle uses a library like Turndown for HTML-to-Markdown conversion. The quality of the Markdown will depend on the cleanliness and structure of the HTML that Defuddle extracts. The HTML standardization step helps a lot here.

Q: What if the extracted metadata (author, date) is incorrect?
A: Metadata extraction relies on common HTML patterns, <meta> tags, and schema.org data. If a site doesn’t follow these conventions or has poorly structured metadata, Defuddle might struggle to get it right, similar to other libraries. The schemaOrgData field can be useful for debugging this, as it shows you the raw structured data it found.

Can Defuddle handle single-page applications with dynamically loaded content?
A: Defuddle processes the DOM state at parse time, so it works with SPAs after content has loaded. For dynamically loaded content, you’ll need to trigger parsing after the relevant content appears in the DOM.

How does the mobile-aware detection actually work?
A: The library analyzes CSS media queries and mobile-specific styling to identify elements that are hidden or repositioned on smaller screens. Elements that disappear on mobile are often navigational or promotional content rather than primary article content. This heuristic significantly improves extraction accuracy on modern responsive sites.

What happens to embedded media like videos and images?
A: Images within the main content area are preserved with their attributes intact. Embedded videos, iframes, and other rich media elements are retained if they appear to be part of the article content. Social media embeds and advertisement iframes are typically removed as part of the clutter detection.

The post Extract Clean Web Content with Defuddle.js appeared first on CSS Script.

Smooth Hide-on-scroll Effects for Sticky Headers and Footers – Natural Sticky

Natural Sticky is a lightweight, framework-agnostic JS library that creates natural hide-on-scroll effects for sticky headers and footers. The library makes sticky elements disappear when you scroll down and reappear when you scroll up, but it does so without the typical animations that can feel disconnected from your actual scroll…

August 21, 2025

In "CSSScript"

Create Liquid Glass Effects with Dynamic Displacement – GlassiFy

GlassiFy is a lightweight JavaScript library that applies Liquid Glass effects on HTML elements using dynamic displacement mapping. It extends beyond standard CSS backdrop-filter capabilities by introducing real-time turbulence and refraction through SVG filter technology. It works as a Web Component with zero framework dependencies. The library currently provides distinct…

December 3, 2025

In "CSSScript"

Parse Emails in Browser & Serverless: postal-mime

postal-mime is a JavaScript library that parses RFC822 email messages directly in browsers and serverless environments. It transforms raw email data into structured JavaScript objects containing headers, recipients, attachments, message content, and more. More Features: Can be run in Web Workers to avoid blocking the main thread. Runs in serverless…

October 24, 2025

In "CSSScript"

rssfeeds-admin

Next CSRA celebrates Juneteenth 2025 with various festivals, events »

Previous « Moderated forecast ahead for Bay Area

Published by

rssfeeds-admin

10 months ago

Rockford Fire Department responds to vacant house fire

The Rockford Fire Department responded to a house fire Friday, April 3, around 3:15 p.m

16 minutes ago

Cyber Security News

14,000+ F5 BIG-IP APM Devices Exposed Online Amid Active RCE Vulnerability Exploits

A critical security flaw in F5’s BIG-IP Access Policy Manager (APM) is currently under active…

28 minutes ago

WTVO

This website uses cookies.

Extract Clean Web Content with Defuddle.js

Features:

See It In Action:

Use Cases:

How to use it:

FAQs

Related

Smooth Hide-on-scroll Effects for Sticky Headers and Footers – Natural Sticky

Create Liquid Glass Effects with Dynamic Displacement – GlassiFy

Parse Emails in Browser & Serverless: postal-mime

Recent Posts

Rockford Fire Department responds to vacant house fire

14,000+ F5 BIG-IP APM Devices Exposed Online Amid Active RCE Vulnerability Exploits

Rockford-Area Easter egg hunts across the stateline: Times, locations and what to know

Dead by Daylight Devs Say Matchmaking Rework Is a ‘Re-Imagination of How the Game Is Played’

The 30 Best Comedy Movies of All Time

The 30 Best Comedy Movies of All Time