How to Use rs-trafilatura with Scrapy
Scrapy is the standard Python framework for web scraping. It handles crawling, scheduling, and data pipelines. rs-trafilatura plugs into Scrapy as an item pipeline — your spider yields items with H...

Source: DEV Community
Scrapy is the standard Python framework for web scraping. It handles crawling, scheduling, and data pipelines. rs-trafilatura plugs into Scrapy as an item pipeline — your spider yields items with HTML, and the pipeline adds structured extraction results automatically. Install pip install rs-trafilatura scrapy Setup Add the pipeline to your Scrapy project's settings.py: ITEM_PIPELINES = { "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300, } That's it. Every item that passes through the pipeline with a body (bytes) or html (string) field will get an extraction dict added to it. Writing the Spider Your spider yields items with the response body and URL: import scrapy class ContentSpider(scrapy.Spider): name = "content" start_urls = ["https://example.com"] def parse(self, response): yield { "url": response.url, "body": response.body, # raw bytes — rs-trafilatura auto-detects encoding } # Follow links for href in response.css("a::attr(href)").getall(): yield response.follow(href, self.pars