How it works

How it works

NewsHub is news aggregation and daily pdf-digests generation web service with an analytical direction that works with major news sources all over the World.

NewsHub is based on TagsReaper data mining service that lets you crawl and extract web content of a different kind (HTML or dynamically generated) and process it with many analytical algorithms. Also, it proposes several tools for a visual representation of results.

Analysis

Similarity rate

This rate reflects a popularity of an articles’ subject. During computations, a similarity rate algorithm compares articles with each other and their similarity estimated and compared with a threshold value groups articles as similar or not. Similar articles marked with filled five stars rate. The rating number in brackets shows how many articles are very similar and was omitted from this digest.

The rate value is floating point number. An integer part is a number of similar articles from a unique domain of source web-site publisher. The fractional part is a sum of a number of similar articles from not unique domains of source web-site publishers and number of similar articles for similar articles (second level or child similarity).

Sentiment rate

It reflects a relative mood of an article based on analysis of its title, description and a part of a body. The improved Bayesian sentiment rate algorithm with vocabularies and NLP-based extensions estimates a value of positive and negative and calculates a resulted weighted rate. A value sign means a direction of a sentiment, positive or negative. Zero – means that positive and negative parts are equal. Sentiment rate represented as smile icon and floating point number or just a floating point number with a sign.

Social sentiment rate – it is a rate calculated with social networks data like posts messages or comments.

Social rate – it is a rate calculated as heterogeneous synthetic indicator summaries an activity and reflection in several social networks. It includes in one floating point numeric value such an indicators like number of post messages, likes, shares, re-posts and so on in social networks like Twitter, Facebook, G+, Instagram, and so on. It can be relatively compared and used as a metrics in further estimations and media environment analysis. (this rate coming soon).

POP-words

A pop-words is an entity detected on the basis of the pop-words algorithm and represents some term, acronym, proper name or similar sense phrase or a single word. It means that this entity discussed very often and became an intersection of an articles in pair comparison. The frequency of a pop-word reflects its popularity – it means that it is a frequency of cases when this pop-word was an intersection point in pair articles comparison. POP-words have own similarity rate and sentiment rate calculated as a maximum of similarity rates of articles that have this pop-word and weighted ratio of positive and negative similarity rates correspondingly. POP-words are tracked as a dedicated self-sufficient object and visually represented in POP-words Timeline Tool.

POP-words Timeline Tool

Chart

Can be configured to graphically represent timeline visualization of several indicators in different ways like Line, Column, Area and so on to get some kind of visual representation of a pop-words popularity dynamic and social reflection dynamic.

Toolbar buttons:

Month – sets dates range 30 days and recalculate minimal frequency;
Week – sets dates range 7 days and recalculate minimal frequency;
Today – sets dates range 1 day – today and recalculate minimal frequency;
Yesterday – sets dates range 1 day – yesterday and recalculate minimally frequency;
Bar – sets chart type as Bar
Line – sets chart type as Line and value of indicator as natural.
Area – sets chart type as Area and value of indicator as percentage.
Share – lets to copy URL for this chart to share it with somebody.

POP-words list

POP-words represented by an accordion control. Each title contains columns:
Rates – some rates that was calculated for this pop-word term, for example
– Similarity rate, Sentiment rate, social rates (for Twitter: posts, re-twits, likes and sentiment rates).
Twitter posts is a sum of all twitter posts of all articles for the POP-word.
Twitter reposts (re-twits) is a sum of all retwits of all articles for the POP-word.
Twitter likes is a sum of all likes of all articles for the POP-word.
Sentiment is a sum of all sentiments of all articles for the POP-word.
Please note that a social data like a “posts”, “reposts”, “likes”, “shares” and so on indicators that are collected once as article being scraped and saved to database. It depends on data provided by particular social network and may vary in time. Thus the numbers should be used just as a reference, not actual data.

POP-words timeline tool uses this data as a values for charts and selections. A social network uses different ways to find posts related with articles, so social data now is a total sum of indicator’s counters for directly related messages (“posts” and “reposts”) – that uses an article’s URL and indirectly related (“posted”/”reposted” in the same branch). Indicators attached to “posts” and “reposts” like “likes” and “shares” are calculated the same way.

Sentiment rate of social data calculated on the basis of comments’ content posted by those who share or repost the article. Then makes a sum same way as for a POP-word’s Sentiment rate of timeline pop-words list.

Term – pop-word term text.
Chains – so called second level chains detected as high frequent fragments
of text that was used with pop-word term (before or after) in articles.
Days – number of days that have this pop-word term appear at least once.
Articles – number of articles with this pop-word term appearance.
Value – a value of a measured indicator, depends on chart and options
configuration can be a pop-word frequency, similarity rate, articles
number, social indicators (TW-posts, TW-re-posts, TW-shares) in natural
or percentage format.
Search – a search in global web search on this pop-word term.
Pop-words colors meaning:
Red – means that this term is new and appears only today in
selected dates range;
Green – means that this term appears at least today, but and in another
days of selected dates range too;
Blue – means that this term appears at each day of selected dates range.
Black – means that this terms is not one of three above.
Opened POP-words item shows a list of a fragments of articles where is this pop-word term occurs with article’s rates for each day’s measurement. Fragment text links on original article source.

Options

Options dialog can be used to define set of parameters to tuneup the chart and pop-words list view, data selection, filtration, and format of representation.
Date from – Defines dates range to select data for correspondent period from;
Date to – Defines dates range to select data for correspondent period to;
Data period – Defines a source of data collected and merged as unique for
correspondent time period;
Unify terms – Define a mode of terms words expansion, if Yes – shorten
terms will be expanded to more wide that contains short as a parts;
Max. Terms – Defines maximum terms number to display, if min/max is
empty – selects top terms by the Indicator value maximization and
auto-calculated min value threshold;
Order by – Defines a data order type;
Order direction – Defines a data order direction;
Chart Type – Define type of visual chart representation;
Indicator – Defines an Indicator value as ‘Frequency (Count)’ or
‘Max similarity rate (Similarity Rate)’ or some social rates like Twitter
posts, re-posts or shares;
Min – Defines an Indicator range to select data. If empty it will be
auto-calculated and top terms by Max. terms limit will be selected;
Max – Defines an Indicator range to select data. If empty is no maximum
limit;
Filling schema – Defines a way of filling areas on chart;
Data Values – Defines a type of a values form – original or percentage;
Line width – Defines a width of a line for charts that uses lines;
Data Averaging – Defines a type of an averaging;
Off all terms – Defines is after chart showed all terms will be in OFF
state by legend to get a possibility to turn ON manually one by one;
No top – Defines an additional filtration using algorithm that uses an
auto-calculated Min. threshold value as a maximum to cut off top terms
and up some underground;
Occurring – Defines an additional selection method by regularity of
occurring of pop-word term per day. Regular – means include only appeared
at each days for period specified; Irregular – means remove appeared at
each days for the period specified; Single – means appeared once for current
period; Single daily – means include only appeared at each days for
current period;
Search – defines a set of keywords to search a pop-words terms in two
formats: simple CSV and advanced Json.
Words List – is a control that gives a possibility to set a list of a
pop-words from a set of selected to use them in chart and legend and
assign a color and filling type.

Digests

NH Digest is a set of articles collected from many source sites (more than 1500 sites now) and ranked by a popularity among the thematic source web-sites set.

Articles in digest are sorted according to a value of similarity rate and next – the publication date or date of article processing.

Digests are published periodically with a different schedule and reflects articles publications sequence in time, by rate and subject themes.

Types of digests

Digests with articles that are collected according to so called projects that are entities of a TR service. Each project configured to crawl from one to hundreds of source sites and to collect articles in thematic groups according to the main thematic subject. There are thematic subjects:

  • Mix – the mix of world news articles, mostly political and social.
  • IT – the IT news articles.
  • Software – the software development and usage news articles.

Digests are published geographically optimized for:

  • Japan – Japan country optimized article source sites.
  • United States – US country optimized article source sites.
  • Ukraine – Ukraine country optimized article source sites.
  • Deutschland – Germany country optimized article source sites.

Geographically optimized sources oriented mostly on the country or some region and closed content by events, position, source, mentality and so on.

Digests are published in different forms:

  • desktop – optimized for a desktop view in A4 page size, fonts size, portrait orientation and includes TOC, announces and articles bodies sections.
  • mobile – optimized for mobile devices view in A7 page size, fonts size, portrait orientation and includes only article bodies section.

Digests are published in different formats:

  • pdf – the single or multiple documents that can be downloaded by links provided in the notification message and can to include TOC, announces and articles sections.
  • top 3 posts – three article announces from one thematic subject that can to present in the notification message that sent by mail or another notification transport. Announce can to include the title, description, publication date, author, size, rank but not includes a full article body. If subscription done for the same thematic subject on “pdf” and “top 3 posts” – the notification message will contain both – link to download the pdf document and announce.

Digests are published in different languages:

  • in english – articles in English.
  • in ukraine – articles in Ukrainian.
  • in russian – articles in Russian.
  • in japan – articles in Japanese.
  • in german – articles in German.

Digests are published from different sets collected for periods of time. By default if no additional suffix in the subscription title – a digest includes all articles for last six hours ranged on popularity rank and publication date (if it was detected or collect date). Digests publishing unique during a week:

  • top 100 – top 100 articles for last six hours ranged on popularity rank and publication date if it was detected or collect date; unique during a week.
  • full – all articles for last ten hours ranged on popularity rank and publication date if it was detected or collect date; not unique during a week.

Also, the subscription can be done on publications made on different installations of the DC service:

  • dc3 – Distributed Crawler installation #3, full featured publications.
  • TEST dc2 – Distributed Crawler installation #2, test publications minimal articles number with the same as dc3 configuration.

At present time subscription can be done for all listed thematic subjects, countries, languages, forms and formats and installations. The subscription name includes keywords that are identified articles set as a unique collection, document format and other properties listed above, for example:

“dc3 United States IT in english pdf” – means that the subscription can be done to have a notification about new digest published event for: “desktop”, “desktop top 100”, “desktop full”, “desktop top 100 full”, “mobile”, “mobile top 100”, “mobile top 100 full”, “mobile full” in pdf document format. And also, the subscription can be done on “top 3 posts” and “top 3 posts full” as announces of articles included into the notification message or posted into the correspondent WordPress category using the XMLRPC API if it is configured for user’s account.

## Digest Navigation

Digests support internally cross-referenced linking inside the pdf document. Most pdf viewers and readers follow it.

  • The announce article’s title links to the article’s body section.
  • The article’s body article title links to the next article.
  • The article’s order number in article’s body links to the announcement section.
  • The author at the bottom of article body links to the source web-site page.
  • The list of domain names in case of an article has similar articles that were skipped from digest – links to the source web-site article page.