This library parses a given HTML document (or string), identifies & preserves the core article content, and strips away elements like sidebars, headers, footers, and other non-essential content.
It serves as an enhanced alternative to Mozilla Readability with more forgiving extraction algorithms and consistent output formatting.
1. Install Defuddle with NPM. Available bundles:
defuddle): For browser use, no extra dependencies. Handles math content but without fallbacks for MathML/LaTeX conversion.defuddle/full): Includes more robust math parsing with mathml-to-latex and temml.defuddle/node): Optimized for Node.js with JSDOM, includes full math and Markdown conversion capabilities.# NPM $ npm install defuddle
2. If you’re planning to use it in a Node.js environment, you’ll also need JSDOM:
# NPM $ npm install jsdom
3. In a browser environment, you can import Defuddle and pass it a DOM document object. The parse method returns an object with these properties:
content: Cleaned HTML or Markdown stringtitle: Article titleauthor: Author namedescription: Article summary/descriptiondomain: Source website domainfavicon: Website favicon URLimage: Main article image URLmetaTags: Raw meta tag dataparseTime: Processing time in millisecondspublished: Publication date stringsite: Website nameschemaOrgData: Structured data from schema.org markupwordCount: Total word count of extracted contentimport Defuddle from 'defuddle'; // Assuming 'document' is the current page's document const defuddleInstance = new Defuddle(document); const article = defuddleInstance.parse(); console.log(article.content); console.log(article.title);
4. For server-side processing, the setup is slightly different. You’ll use defuddle/node.
import { JSDOM } from 'jsdom';
import { Defuddle } from 'defuddle/node';
// To parse HTML from a string
const htmlString = '<html><body><article><h1>My Title</h1><p>Some content.</p></article></body></html>';
const articleFromString = await Defuddle(htmlString);
console.log(articleFromString.title);
// To parse HTML from a URL
const dom = await JSDOM.fromURL('https://example.com/some-article');
const articleFromUrl = await Defuddle(dom); // You can also pass the URL string directly
console.log(articleFromUrl.content);
// With options
const articleWithOptions = await Defuddle(dom, 'https://example.com/some-article', {
debug: true,
markdown: true
});
// This will be Markdown
console.log(articleWithOptions.content); 5. Available configuration options. You can pass an options object during instantiation or when calling Defuddle in Node.js:
debug (boolean): Enables more verbose logging and preserves more attributes in the HTML (useful for, well, debugging).url (string): The original URL of the page, which can help with resolving relative links and metadata.markdown (boolean): If true, the content property in the result will be Markdown.separateMarkdown (boolean): If true, content remains HTML, and an additional contentMarkdown property is returned.removeExactSelectors (boolean, default: true): Controls removal of elements matching precise selectors (e.g., specific ad classes).removePartialSelectors (boolean, default: true): Controls removal of elements matching broader, partial selectors.Q: Is Defuddle reliable for all websites?
A: No extraction library is 100% reliable for all websites due to the vast differences in web page structures.
Q: How does Defuddle handle paywalled content?
A: Defuddle can only parse the HTML content it’s given. If the main content is hidden behind a paywall and not present in the initial HTML (or the DOM passed to it), Defuddle won’t be able to extract it. It doesn’t bypass paywalls.
Q: How does the Markdown conversion work?
A: Defuddle uses a library like Turndown for HTML-to-Markdown conversion. The quality of the Markdown will depend on the cleanliness and structure of the HTML that Defuddle extracts. The HTML standardization step helps a lot here.
Q: What if the extracted metadata (author, date) is incorrect?
A: Metadata extraction relies on common HTML patterns, <meta> tags, and schema.org data. If a site doesn’t follow these conventions or has poorly structured metadata, Defuddle might struggle to get it right, similar to other libraries. The schemaOrgData field can be useful for debugging this, as it shows you the raw structured data it found.
Can Defuddle handle single-page applications with dynamically loaded content?
A: Defuddle processes the DOM state at parse time, so it works with SPAs after content has loaded. For dynamically loaded content, you’ll need to trigger parsing after the relevant content appears in the DOM.
How does the mobile-aware detection actually work?
A: The library analyzes CSS media queries and mobile-specific styling to identify elements that are hidden or repositioned on smaller screens. Elements that disappear on mobile are often navigational or promotional content rather than primary article content. This heuristic significantly improves extraction accuracy on modern responsive sites.
What happens to embedded media like videos and images?
A: Images within the main content area are preserved with their attributes intact. Embedded videos, iframes, and other rich media elements are retained if they appear to be part of the article content. Social media embeds and advertisement iframes are typically removed as part of the clutter detection.
The post Extract Clean Web Content with Defuddle.js appeared first on CSS Script.
The Rockford Fire Department responded to a house fire Friday, April 3, around 3:15 p.m
A critical security flaw in F5’s BIG-IP Access Policy Manager (APM) is currently under active…
The Stateline is offering a variety of Easter egg hunts and family-friendly events this weekend.…
The Dead by Daylight team at Behaviour Interactive has opened up about what they say…
In the mood for some yuk-yuks? Or perhaps a gaggle of giggles? Look no further,…
In the mood for some yuk-yuks? Or perhaps a gaggle of giggles? Look no further,…
This website uses cookies.