Extract Clean Text from Any Webpage for RAG Pipelines
Building RAG (Retrieval-Augmented Generation) systems? You need clean text, not raw HTML. Here's a simple approach using CheerioCrawler: // Remove noise $("script, style, nav, footer, header, aside...

Source: DEV Community
Building RAG (Retrieval-Augmented Generation) systems? You need clean text, not raw HTML. Here's a simple approach using CheerioCrawler: // Remove noise $("script, style, nav, footer, header, aside, .ad, noscript").remove(); // Get main content let text = $("article, [role=main], main, .content").first().text(); if (!text || text.length < 100) text = $("body").text(); // Clean whitespace text = text.replace(/\s+/g, " ").trim(); Why Not Just Use body.text()? Raw body text includes navigation menus, footer links, cookie banners, and ad text. For RAG, you want ONLY the main content. The Priority Order <article> tag — most semantic, usually contains the main content [role="main"] — ARIA landmark <main> — HTML5 semantic element .content, .post-content — common CSS classes <body> — fallback Output { "url": "https://example.com/blog/post", "title": "The Blog Post Title", "text": "Clean extracted text...", "wordCount": 1450, "characterCount": 8700 } I built a Text Extracto