Templates

May 17, 2018 | Author: Anonymous | Category: Каталог , Без категории
Share Embed


Short Description

Download Templates ...

Description

Page-level Template Detection via Isotonic Smoothing

Deepayan Chakrabarti Ravi Kumar Kunal Punera

Yahoo! Research Yahoo! Research Univ. of Texas at Austin

Template Example: www.findbestcasinos.com

Template Example: www.findbestcasinos.com

Advertisements

Look and feel

Links for navigation

Copyright message

Applications • Web Ranking – Do not match query to text in templates

• Duplicate Detection – Do not shingle text inside templates

• Summarization – Do not use text within templates for summary

Site-level Template Detection • Templates = “page-fragments” that recur across several pages of a website. – Eg: copyright, navigation links

• Page-fragment can be – HTML code (tags) – Visible Text – DOM nodes (structure + text)

• Simple two pass algorithms – Hash page-fragments and count occurrences – Mark templates in second pass

Site-level Template Detection Advantages: • No labeled training data needed • Very high precision

Issues: • Inefficient when pages are not processed in site order – Eg: in a web crawler pipeline – Need to maintain hashes and counts for all sites – Marking site-level templates for new websites

• Not all templates are site-level in nature – Low recall

Site-level Template Detection Advantages: • No labeled training data needed • Very high precision

Issues: • Inefficient when pages are not processed in site order – Eg: in a web crawl pipeline Only use page-level – Need to maintain hashes and counts for all sites information – Marking site-level templates for new websites

• Not all templates are site-level in nature – Low recallLearn

a general model for templates

Page-level Model-based Template Detection • Problem: Template detection – Using only information local to a webpage – Detect all templates: not just site-level – No manually labeled training data

• Our Approach: – Obtain training data via site-level approach – Learn a classification model for “templateness” • For each internal DOM node

– Enforce a global monotonicity property of “templateness”

Automatically Labeling Data • Use site-level approach – 3,000 website (200 webpages per site) – Obtained ~1M labeled DOM nodes

• Labeled data has a bias – Some template DOM nodes labeled as non-templates – False negatives are noise

• Extract general structural and content cues from the DOM nodes – Generalize over the site-level training data

Cues used Implicitly by Humans

Placement on screen

BGColor

Average Sentence Size Fraction of text outside anchors

Aspect Ratio

Link Density

Learning the “Templateness” Classifier • Extract features of DOM nodes from cues • Learn weights for these features – – – –

2-class problem Logistic regression classifier Simple classifier, avoids fitting noise in the data “Templateness” = probability of belonging to template class – Separate classifiers learned for nodes of different sizes

• Each node in the DOM tree is classified – Past work classify segments of web pages – Segmentation might mix template and non-template content

“Templateness” is a monotonic property

A DOM node is a template if and only if all its children are templates.

“Templateness” Monotonicity A node in the DOM tree is a template if and only if all its children are templates. 0

• Each node is classified in isolation – Classifier scores needn’t be monotonic – Classifier might misclassify nodes

• Post-processing “templateness” scores – Enforces monotonicity – Corrects misclassifications by smoothing

• “Templateness” scores are real numbers xi

1

0 1 0

1 0

0

0

1 1

01

Smoothing via Regularized Isotonic Regression Given raw classification scores x1,…,xn for n nodes in the DOM tree, find smoothed scores y1,…,yn (such that i → j yi ≤ yj) to minimize:

∑|xi-yi| + c . (# distinct yi’s) Low L1 distance from the raw scores

“Tradeoff” parameter

Monotonicity constraint

Create as few “sections” as possible

This results in: 1. Smoothed monotonic scores yi 2. “Sections” of the webpage • Section = adjacent nodes with same yi values

Smoothing via Regularized Isotonic Regression • Lemma: For L1 distance, each yi must equal some xj

• The optimal solution found using a dynamic program – Complexity: O(n2 log n), n = # of DOM nodes in page – Equals complexity of algorithms for non-regularized isotonic regression (c=0) – On a Pentium 4, 3GHz, 512MB running FreeBSD: around 60ms for cnn.com (n=292)

Page-level Model-based Template Detection • Problem: Template detection • Our Approach: – Obtain training data via site-level approach

– Classification model for “templateness”  Designed features for DOM nodes  Each DOM node labeled by a logistic regression classifier

– Enforce a global monotonicity property of “templateness”  Formulate it as regularized isotonic regression over trees  Optimal solution via dynamic program

Template Detection Accuracy Classifier Only

Classifier + Smoothing

f-measure

• Data: manually classified DOM nodes in webpages

• Results: – PageLevel system works very well – Smoothing improves classification accuracy

Duplicate Detection

• Data: 2359 pages from 3 lyrics websites – 1711 duplicate pairs (same song, different websites) – 2058 non-duplicate pairs (different songs, same website) • Errors occur when shingles hit template content

• PageLevel detects more templates than SiteLevel

Conclusions • Page-level model-based template detection • Used no manually labeled training data • “Templateness” monotonicity property • Regularized isotonic regression – might be of independent interest

• Showed empirically that PageLevel generalizes over the SiteLevel data

View more...

Comments

Copyright � 2017 UPDOC Inc.