How it works

How it works

NewsHub is news aggregation and daily pdf-digests generation web service that works with major news sources all over the World. NewsHub is based on TagsReaper scraping tool –  data mining service that lets you crawl and extract web content of a different kind (html or dynamically generated). NH Digest is a set of articles collected from many source sites (more than 1500 sites now) and ranked by a popularity among the thematic source web-sites set. Similar articles marked with filled five stars. The rate number in brackets shows how many articles are very similar and was omitted from this digest. The rate value is floating point number. Integer part is a number of similar articles from unique domain of source web-site publisher. The fractional part is sum of number of similar articles from not unique domains of source web-site publishers and number of similar articles for similar articles (second level or child similarity). Articles in digest are sorted according a value of similarity rate and next – the publication date or date of article processing. Digests are published periodically with different schedule and reflects articles publications sequence in time, by rate and subject themes. Types of digests Digests with articles that are collected according the so called projects that are entities of a TR (c) service. Each project configured to crawl from one to hundreds of source sites and to collect articles in thematic groups according on the main thematic subject. There are thematic subjects:
  • Mix – the mix of world news articles, mostly political and social.
  • IT – the IT news articles.
  • Software – the software development and usage news articles.
Digests are published geographically optimized for:
  • Japan – Japan country optimized article source sites.
  • United States – US country optimized article source sites.
  • Ukraine – Ukraine country optimized article source sites.
  • Deutschland – Germany country optimized article source sites.
Geographically optimized sources oriented mostly on country or some region and closed content by events, position, source, mentality and so on. Digests are published in different forms:
  • desktop – optimized for a desktop view in A4 page size, fonts size, portrait orientation and  includes TOC, announces and articles bodies sections.
  • mobile – optimized for a mobile devices view in A7 page size, fonts size, portrait orientation and includes only article bodies section.
Digests are published in different formats:
  • pdf – the single or multiple documents that can be downloaded by links provided in the notification message and can to include TOC, announces and articles sections.
  • top 3 posts – three article announces from one thematic subject that can to present in the notification message that sent by mail or another notification transport. Announce can to include the title, description, publication date, author, size, rank but not includes a full article body.  If subscription done for the same thematic subject on “pdf” and “top 3 posts” – the notification message will contain both – link to download the pdf document and announce.
Digests are published in different languages:
  • in english – articles in English.
  • in ukraine – articles in Ukrainian.
  • in russian – articles in Russian.
  • in japan – articles in Japanese.
  • in german – articles in German.
Digests are published from different sets collected for periods of time. By default if no additional suffix in the subscription title – a digest includes all articles for last six hours ranged on popularity rank and publication date (if it was detected or collect date). Digests publishing unique during a week:
  • top 100 – top 100 articles for last six hours ranged on popularity rank and publication date if it was detected or collect date; unique during a week.
  • full – all articles for last ten hours ranged on popularity rank and publication date if it was detected or collect date; not unique during a week.
Also, the subscription can be done on publications made on different installations of the DC service:
  • dc3 – Distributed Crawler installation #3, full featured publications.
  • TEST dc2 – Distributed Crawler installation #2, test publications minimal articles number with the same as dc3 configuration.
At present time subscription can be done for all listed thematic subjects, countries, languages, forms and formats and installations. The subscription name includes keywords that are identifies articles set as unique collection, document format and another properties listed above, for example: “dc3 United States IT in english pdf” – means that the subscription can be done to have a notification about new digest published event for: “desktop”, “desktop top 100”, “desktop full”, “desktop top 100 full”, “mobile”, “mobile top 100”, “mobile top 100 full”, “mobile full” in  pdf document format. And also, subscription can be done on “top 3 posts” and “top 3 posts full” as announces of articles included in to the notification message or posted in to the correspondent WordPress category using the XMLRPC API if it is configured for user’s account. ## Digest Navigation Digests supports internal cross-referenced linking inside the pdf document. Most pdf viewers and readers follows it.
  • The announce article’s title links to the article’s body section.
  • The article’s body article title links to the next article.
  • The article’s order number in article’s body links to the announcement section.
  • The author at the bottom of article body links to the source web-site page.
  • The list of domain names in case of an article has similar articles that was skipped from digest – links to the source web-site article page.