Automated Data Collection with R

A Practical Guide to Web Scraping and Text Mining

Simon Munzert

Wiley

Besorgungstitel - wird vorgemerkt | Lieferzeit: Besorgungstitel - Lieferbar innerhalb von 10 Werktagen I

72,35 €*

Alle Preise inkl. MwSt.|Versandkostenfrei

Jetzt vorbestellen

Zum Merkzettel

Zahlung / Versand

ISBN-13:

9781118834817

Veröffentl:

2015

Erscheinungsdatum:

20.01.2015

Seiten:

480

Autor:

Simon Munzert

Gewicht:

865 g

Format:

251x174x30 mm

Sprache:

Englisch

Beschreibung:

A hands on guide to web scraping and text mining for both beginners and experienced users of R* Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.* Provides basic techniques to query web documents and data sets (XPath and regular expressions).* An extensive set of exercises are presented to guide the reader through each technique.* Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.* Case studies are featured throughout along with examples for each technique presented.* R code and solutions to exercises featured in the book are provided on a supporting website.

Preface xv1 Introduction 11.1 Case study: World Heritage Sites in Danger 11.2 Some remarks on web data quality 71.3 Technologies for disseminating, extracting, and storing web data 91.3.1 Technologies for disseminating content on the Web 91.3.2 Technologies for information extraction from web documents 111.3.3 Technologies for data storage 121.4 Structure of the book 13Part One A Primer onWeb and Data Technologies 152 HTML 172.1 Browser presentation and source code 182.2 Syntax rules 192.2.1 Tags, elements, and attributes 202.2.2 Tree structure 212.2.3 Comments 222.2.4 Reserved and special characters 222.2.5 Document type definition 232.2.6 Spaces and line breaks 232.3 Tags and attributes 242.3.1 The anchor tag 242.3.2 The metadata tag 252.3.3 The external reference tag 262.3.4 Emphasizing tags , , 262.3.5 The paragraphs tag272.3.6 Heading tags , , ,... 272.3.7 Listing content with , , and 272.3.8 The organizational tags and 272.3.9 The tag and its companions 292.3.10 The foreign script tag 302.3.11 Table tags , , , and 322.4 Parsing 322.4.1 What is parsing? 332.4.2 Discarding nodes 352.4.3 Extracting information in the building process 37Summary 38Further reading 38Problems 393 XML and JSON 413.1 A short example XML document 423.2 XML syntax rules 433.2.1 Elements and attributes 443.2.2 XML structure 463.2.3 Naming and special characters 483.2.4 Comments and character data 493.2.5 XML syntax summary 503.3 When is an XML document well formed or valid? 513.4 XML extensions and technologies 533.4.1 Namespaces 533.4.2 Extensions of XML 543.4.3 Example: Really Simple Syndication 553.4.4 Example: scalable vector graphics 583.5 XML and R in practice 603.5.1 Parsing XML 603.5.2 Basic operations on XML documents 633.5.3 From XML to data frames or lists 653.5.4 Event-driven parsing 663.6 A short example JSON document 683.7 JSON syntax rules 693.8 JSON and R in practice 71Summary 76Further reading 76Problems 764 XPath 794.1 XPath--a query language for web documents 804.2 Identifying node sets with XPath 814.2.1 Basic structure of an XPath query 814.2.2 Node relations 844.2.3 XPath predicates 864.3 Extracting node elements 934.3.1 Extending the fun argument 944.3.2 XML namespaces 964.3.3 Little XPath helper tools 97Summary 98Further reading 99Problems 995 HTTP 1015.1 HTTP fundamentals 1025.1.1 A short conversation with a web server 1025.1.2 URL syntax 1045.1.3 HTTP messages 1065.1.4 Request methods 1085.1.5 Status codes 1085.1.6 Header fields 1095.2 Advanced features of HTTP 1165.2.1 Identification 1165.2.2 Authentication 1215.2.3 Proxies 1235.3 Protocols beyond HTTP 1245.3.1 HTTP Secure 1245.3.2 FTP 1265.4 HTTP in action 1265.4.1 The libcurl library 1275.4.2 Basic request methods 1285.4.3 A low-level function of RCurl 1315.4.4 Maintaining connections across multiple requests 1325.4.5 Options 1335.4.6 Debugging 1395.4.7 Error handling 1435.4.8 RCurl or httr--what to use? 144Summary 144Further reading 144Problems 1466 AJAX 1496.1 JavaScript 1506.1.1 How JavaScript is used 1506.1.2 DOM manipulation 1516.2 XHR 1546.2.1 Loading external HTML/XML documents 1556.2.2 Loading JSON 1576.3 Exploring AJAX with Web Developer Tools 1586.3.1 Getting started with Chrome's Web Developer Tools 1596.3.2 The Elements panel 1596.3.3 The Network panel 160Summary 161Further reading 162Problems 1627 SQL and relational databases 1647.1 Overview and terminology 1657.2 Relational Databases 1677.2.1 Storing data in tables 1677.2.2 Normalization 1707.2.3 Advanced features of relational databases and DBMS 1747.3 SQL: a language to communicate with Databases 1757.3.1 General remarks on SQL, syntax, and our running example 1757.3.2 Data control language--DCL 1777.3.3 Data definition language--DDL 1787.3.4 Data manipulation language--DML 1807.3.5 Clauses 1847.3.6 Transaction control language--TCL 1877.4 Databases in action 1887.4.1 R packages to manage databases 1887.4.2 Speaking R-SQL via DBI-based packages 1897.4.3 Speaking R-SQL via RODBC 191Summary 192Further reading 193Problems 1938 Regular expressions and essential string functions 1968.1 Regular expressions 1988.1.1 Exact character matching 1988.1.2 Generalizing regular expressions 2008.1.3 The introductory example reconsidered 2068.2 String processing 2078.2.1 The stringr package 2078.2.2 A couple more handy functions 2118.3 A word on character encodings 214Summary 216Further reading 217Problems 217Part Two A Practical Toolbox forWeb Scraping and Text Mining 2199 Scraping the Web 2219.1 Retrieval scenarios 2229.1.1 Downloading ready-made files 2239.1.2 Downloading multiple files from an FTP index 2269.1.3 Manipulating URLs to access multiple pages 2289.1.4 Convenient functions to gather links, lists, and tables from HTML documents 2329.1.5 Dealing with HTML forms 2359.1.6 HTTP authentication 2459.1.7 Connections via HTTPS 2469.1.8 Using cookies 2479.1.9 Scraping data from AJAX-enriched webpages with Selenium/Rwebdriver 2519.1.10 Retrieving data from APIs 2599.1.11 Authentication with OAuth 2669.2 Extraction strategies 2709.2.1 Regular expressions 2709.2.2 XPath 2739.2.3 Application Programming Interfaces 2769.3 Web scraping: Good practice 2789.3.1 Is web scraping legal? 2789.3.2 What is robots.txt? 2809.3.3 Be friendly! 2849.4 Valuable sources of inspiration 290Summary 291Further reading 292Problems 29310 Statistical text processing 29510.1 The running example: Classifying press releases of the British government 29610.2 Processing textual data 29810.2.1 Large-scale text operations--The tm package 29810.2.2 Building a term-document matrix 30310.2.3 Data cleansing 30410.2.4 Sparsity and n-grams 30510.3 Supervised learning techniques 30710.3.1 Support vector machines 30910.3.2 Random Forest 30910.3.3 Maximum entropy 30910.3.4 The RTextTools package 30910.3.5 Application: Government press releases 31010.4 Unsupervised learning techniques 31310.4.1 Latent Dirichlet Allocation and correlated topic models 31410.4.2 Application: Government press releases 314Summary 320Further reading 32011 Managing data projects 32211.1 Interacting with the file system 32211.2 Processing multiple documents/links 32311.2.1 Using for-loops 32411.2.2 Using while-loops and control structures 32611.2.3 Using the plyr package 32711.3 Organizing scraping procedures 32811.3.1 Implementation of progress feedback: Messages and progress bars 33111.3.2 Error and exception handling 33311.4 Executing R scripts on a regular basis 33411.4.1 Scheduling tasks on Mac OS and Linux 33511.4.2 Scheduling tasks on Windows platforms 337Part Three A Bag of Case Studies 34112 Collaboration networks in the US Senate 34312.1 Information on the bills 34412.2 Information on the senators 35012.3 Analyzing the network structure 35312.3.1 Descriptive statistics 35412.3.2 Network analysis 35612.4 Conclusion 35813 Parsing information from semistructured documents 35913.1 Downloading data from the FTP server 36013.2 Parsing semistructured text data 36113.3 Visualizing station and temperature data 36814 Predicting the 2014 Academy Awards using Twitter 37114.1 Twitter APIs: Overview 37214.1.1 The REST API 37214.1.2 The Streaming APIs 37314.1.3 Collecting and preparing the data 37314.2 Twitter-based forecast of the 2014 Academy Awards 37414.2.1 Visualizing the data 37414.2.2 Mining tweets for predictions 37514.3 Conclusion 37915 Mapping the geographic distribution of names 38015.1 Developing a data collection strategy 38115.2 Website inspection 38215.3 Data retrieval and information extraction 38415.4 Mapping names 38715.5 Automating the process 389Summary 39516 Gathering data on mobile phones 39616.1 Page exploration 39616.1.1 Searching mobile phones of a specific brand 39616.1.2 Extracting product information 40016.2 Scraping procedure 40416.2.1 Retrieving data on several producers 40416.2.2 Data cleansing 40516.3 Graphical analysis 40616.4 Data storage 40816.4.1 General considerations 40816.4.2 Table definitions for storage 40916.4.3 Table definitions for future storage 41016.4.4 View definitions for convenient data access 41116.4.5 Functions for storing data 41316.4.6 Data storage and inspection 41517 Analyzing sentiments of product reviews 41617.1 Introduction 41617.2 Collecting the data 41717.2.1 Downloading the files 41717.2.2 Information extraction 42117.2.3 Database storage 42417.3 Analyzing the data 42617.3.1 Data preparation 42617.3.2 Dictionary-based sentiment analysis 42717.3.3 Mining the content of reviews 43217.4 Conclusion 434References 435General index 442Package index 448Function index 449

weniger

Kunden Rezensionen

Zu diesem Artikel ist noch keine Rezension vorhanden.
Helfen sie anderen Besuchern und verfassen Sie selbst eine Rezension.

> neue Rezension schreiben

Herzlich Willkommen!

Notwendige Cookies

Komfort Cookies

Marketing-/ Tracking-Cookies

Automated Data Collection with R

Beschreibung:

Kunden Rezensionen

Information

Firma

Folgen Sie uns auf: