Client work226K+ articles / week
News Scraping
Data Engineer
A unified crawler across ~2,900 news portals — automatic extraction for most via newspaper4k, with a custom-parser fallback for ~100 sources it couldn't handle out of the box.
One consistent, clean feed across ~2,900 portals, with a maintainable fallback layer for the long tail.
How it works
Sources
News portals
Crawl
BeautifulSoup / newspaper4k
Extract
Content parser
Store
Database
Analyze
NER + Sentiment AI
Problem
newspaper4k covers most portals with just a URL, but ~100 sources have markup it can't parse correctly.
Solution
Built a reusable custom-parser template, adapted per portal with small tweaks instead of one-off scripts for each.
Tech stack
- Python
- BeautifulSoup
- Selenium
- newspaper4k