Haikal Hilmi
Back
Client work226K+ articles / week

News Scraping

Data Engineer

A unified crawler across ~2,900 news portals — automatic extraction for most via newspaper4k, with a custom-parser fallback for ~100 sources it couldn't handle out of the box.

One consistent, clean feed across ~2,900 portals, with a maintainable fallback layer for the long tail.

How it works

Sources

News portals

Crawl

BeautifulSoup / newspaper4k

Extract

Content parser

Store

Database

Analyze

NER + Sentiment AI

Problem

newspaper4k covers most portals with just a URL, but ~100 sources have markup it can't parse correctly.

Solution

Built a reusable custom-parser template, adapted per portal with small tweaks instead of one-off scripts for each.

Tech stack

  • Python
  • BeautifulSoup
  • Selenium
  • newspaper4k