NewsHub is news aggregation and daily pdf-digests generation web service with analytical direction that works with major news sources all over the World.
NewsHub is based on TagsReaper
data mining service that lets you crawl and extract web content of a different kind (html or dynamically generated) and process it with many analytical algorithms. Also, it proposes several tools for visual representation of a results.
This rate reflects a popularity of an article discussed terms and theme in measures of a title, and some part of an article’s body. During a computations a similarity rate algorithm
(c) compares articles with each other and their similarity estimated and compared with a threshold value groups articles as similar or not. Similar articles marked with filled five stars rate. The rate number
in brackets shows how many articles are very similar and was omitted from this digest.
The rate value
is floating point number. Integer part is a number of similar articles from unique domain of source web-site publisher. The fractional part is sum of number of similar articles from not unique domains of source web-site publishers and number of similar articles for similar articles (second level or child similarity).
This rate reflects a relative mood of an article with analysis of it’s title, description and some small part of a body. The improved Bayesian sentiment rate algorithm
(c) with vocabularies and NLP-based extensions estimates a value of positive and negative and calculates a resulted weighted rate. A value sign means direction of a sentiment, positive or negative (good or bad value). Zero – means that positive and negative parts are equal. Sentiment rate represented as smile icon and floating point number or just a floating point number with sign.
Social sentiment rate (c)
– it is a rate calculated with social networks data like posts messages or comments.
Social rate (c)
– it is a rate calculated as heterogeneous synthetic indicator summaries an activity and reflection in several social networks. It includes in one floating point numeric value such an indicators like number of post messages, likes, shares, re-posts and so on in social networks like Twitter, Facebook, G+, Instagram, and so on. It can be relatively compared and used as a metrics in farther estimations and media environment analysis. (this rate coming soon).
(c) is an entity detected on the basis of pop-words algorithm
(c) and represents some term, acronym, proper name or similar sense phrase or single word. It means that this entity discussed very often and became an intersection of an articles in pair comparison. The frequency of a pop-word reflects its popularity – it means that it is a frequency of cases when this pop-word was an intersection point in pair articles comparison. POP-words have own similarity rate and sentiment rate calculated as a maximum of similarity rates of articles that has this pop-word and weighted ratio of positive and negative similarity rates correspondingly. POP-words are tracked as dedicated self-sufficient object and visually represented in POP-words Timeline Tool
POP-words Timeline Tool
Can be configured to graphically represent timeline visualization of several indicators in different ways like Line, Column, Area and so on to get some kind of visual representation of a pop-words popularity dynamic and social reflection dynamic. Chart controls:
Month - sets dates range 30 days and recalculate minimal frequency;
Week - sets dates range 7 days and recalculate minimal frequency;
Today - sets dates range 1 day - today and recalculate minimal frequency;
Yesterday - sets dates range 1 day - yesterday and recalculate minimal
Bar - sets chart type as Bar
Line - sets chart type as Line and value of indicator as natural.
Area - sets chart type as Area and value of indicator as percentage.
Share - lets to copy URL for this chart to share it with somebody.
POP-words represented by an accordion control. Each title contains columns:
Rates - some rates that was calculated for this pop-word term, for example
- Similarity rate, Sentiment rate, social rates (for Twitter: posts,
re-twits, likes and sentiment rates).
Twitter posts is summ of all twitter posts of all articles for the POP-word.
Twitter reposts (re-twits) is summ of all retwits of all articles for the POP-word.
Twitter likes is summ of all likes of all articles for the POP-word.
Sentiment - is summ of all setiments of all articles for the POP-word.
Term - pop-word term text.
Chains - so called second level chains detected as high frequent fragments
of text that was used with pop-word term in articles.
Days - number of days that have this pop-word term appear at least once.
Articles - number of articles with this pop-word term appearance.
Value - a value of a measured indicator, depends on chart and options
configuration can be a pop-word frequency, similarity rate, articles
number, social indicators (TW-posts, TW-re-posts, TW-shares) in natural
or percentage format.
Search - a search in global web search on this pop-word term.
Pop-words colors meaning
Red - means that this term is new and appears only today in
selected dates range;
Green - means that this term appears at least today, but and in another
days of selected dates range too;
Blue - means that this term appears at each day of selected dates range.
Black - means that this terms is not one of three above.
Opened POP-words item shows a list of a fragments of articles where is this pop-word term occurs with article’s rates for each day’s measurement. Fragment text links on original article source.
Options dialog can be used to define set of parameters to tuneup the chart and pop-words list view, data selection, filtration, and format of representation.
Date from - Defines dates range to select data for correspondent period from;
Date to - Defines dates range to select data for correspondent period to;
Data period - Defines a source of data collected and merged as unique for
correspondent time period;
Unify terms - Define a mode of terms words expansion, if Yes - shorten
terms will be expanded to more wide that contains short as a parts;
Max. Terms - Defines maximum terms number to display, if min/max is
empty - selects top terms by the Indicator value maximization and
auto-calculated min value threshold;
Order by - Defines a data order type;
Order direction - Defines a data order direction;
Chart Type - Define type of visual chart representation;
Indicator - Defines an Indicator value as 'Frequency (Count)' or
'Max similarity rate (Similarity Rate)' or some social rates like Twitter
posts, re-posts or shares;
Min - Defines an Indicator range to select data. If empty it will be
auto-calculated and top terms by Max. terms limit will be selected;
Max - Defines an Indicator range to select data. If empty is no maximum
Filling schema - Defines a way of filling areas on chart;
Data Values - Defines a type of a values form - original or percentage;
Line width - Defines a width of a line for charts that uses lines;
Data Averaging - Defines a type of an averaging;
Off all terms - Defines is after chart showed all terms will be in OFF
state by legend to get a possibility to turn ON manually one by one;
No top - Defines an additional filtration using algorithm that uses an
auto-calculated Min. threshold value as a maximum to cut off top terms
and up some underground;
Occurring - Defines an additional selection method by regularity of
occurring of pop-word term per day. Regular - means include only appeared
at each days for period specified; Irregular - means remove appeared at
each days for period specified; Single - means appeared once for current
period; Single daily - means include only appeared at each days for
Search - defines a set of keywords to search a pop-words terms in two
formats: simple CSV and advanced Json.
Words List - is a control that gives a possibility to set a list of a
pop-words from set of selected to use them in chart and legend and
assign a color and filling type.
is a set of articles collected from many source sites (more than 1500 sites now) and ranked by a popularity among the thematic source web-sites set.
Articles in digest are sorted according a value of similarity rate and next – the publication date or date of article processing.
Digests are published periodically with different schedule and reflects articles publications sequence in time, by rate and subject themes.
Types of digests
Digests with articles that are collected according the so called projects
that are entities of a TR
(c) service. Each project configured to crawl from one to hundreds of source sites and to collect articles in thematic groups according on the main thematic subject. There are thematic subjects
- Mix – the mix of world news articles, mostly political and social.
- IT – the IT news articles.
- Software – the software development and usage news articles.
Digests are published geographically optimized
- Japan – Japan country optimized article source sites.
- United States – US country optimized article source sites.
- Ukraine – Ukraine country optimized article source sites.
- Deutschland – Germany country optimized article source sites.
Geographically optimized sources oriented mostly on country or some region and closed content by events, position, source, mentality and so on.
Digests are published in different forms
- desktop – optimized for a desktop view in A4 page size, fonts size, portrait orientation and includes TOC, announces and articles bodies sections.
- mobile – optimized for a mobile devices view in A7 page size, fonts size, portrait orientation and includes only article bodies section.
Digests are published in different formats
- pdf – the single or multiple documents that can be downloaded by links provided in the notification message and can to include TOC, announces and articles sections.
- top 3 posts – three article announces from one thematic subject that can to present in the notification message that sent by mail or another notification transport. Announce can to include the title, description, publication date, author, size, rank but not includes a full article body. If subscription done for the same thematic subject on “pdf” and “top 3 posts” – the notification message will contain both – link to download the pdf document and announce.
Digests are published in different languages
- in english – articles in English.
- in ukraine – articles in Ukrainian.
- in russian – articles in Russian.
- in japan – articles in Japanese.
- in german – articles in German.
Digests are published from different sets collected for periods of time. By default if no additional suffix in the subscription title – a digest includes all articles for last six hours ranged on popularity rank and publication date (if it was detected or collect date). Digests publishing unique during a week
- top 100 – top 100 articles for last six hours ranged on popularity rank and publication date if it was detected or collect date; unique during a week.
- full – all articles for last ten hours ranged on popularity rank and publication date if it was detected or collect date; not unique during a week.
Also, the subscription can be done on publications made on different installations
of the DC service:
- dc3 – Distributed Crawler installation #3, full featured publications.
- TEST dc2 – Distributed Crawler installation #2, test publications minimal articles number with the same as dc3 configuration.
At present time subscription can be done for all listed thematic subjects, countries, languages, forms and formats and installations. The subscription name includes keywords that are identifies articles set as unique collection, document format and another properties listed above, for example:
“dc3 United States IT in english pdf”
– means that the subscription can be done to have a notification about new digest published event for: “desktop”, “desktop top 100”, “desktop full”, “desktop top 100 full”, “mobile”, “mobile top 100”, “mobile top 100 full”, “mobile full” in pdf document format. And also, subscription can be done on “top 3 posts” and “top 3 posts full” as announces of articles included in to the notification message or posted in to the correspondent WordPress category using the XMLRPC API if it is configured for user’s account.
## Digest Navigation
Digests supports internal cross-referenced linking inside the pdf document. Most pdf viewers and readers follows it.
- The announce article’s title links to the article’s body section.
- The article’s body article title links to the next article.
- The article’s order number in article’s body links to the announcement section.
- The author at the bottom of article body links to the source web-site page.
- The list of domain names in case of an article has similar articles that was skipped from digest – links to the source web-site article page.