官术网_书友最值得收藏!

Writing the Map Reduce program in Java to analyze web log data

In this recipe, we are going to take a look at how to write a map reduce program to analyze web logs. Web logs are data that is generated by web servers for requests they receive. There are various web servers such as Apache, Nginx, Tomcat, and so on. Each web server logs data in a specific format. In this recipe, we are going to use data from the Apache Web Server, which is in combined access logs.

Note

To read more on combined access logs, refer to

http://httpd.apache.org/docs/1.3/logs.html#combined.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster as well as an eclipse similar to an IDE.

How to do it...

We can write map reduce programs to analyze various aspects of web log data. In this recipe, we are going to write a map reduce program that reads a web log file, results pages, views, and their counts. Here is some sample web log data we'll consider as input for our program:

106.208.17.105 - - [12/Nov/2015:21:20:32 -0800] "GET /tutorials/mapreduce/advanced-map-reduce-examples-1.html HTTP/1.1" 200 0 "https://www.google.co.in/" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
60.250.32.153 - - [12/Nov/2015:21:42:14 -0800] "GET /tutorials/elasticsearch/install-elasticsearch-kibana-logstash-on-windows.html HTTP/1.1" 304 0 - "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36" 
49.49.250.23 - - [12/Nov/2015:21:40:56 -0800] "GET /tutorials/hadoop/images/internals-of-hdfs-file-read-operations/HDFS_Read_Write.png HTTP/1.1" 200 0 "http://hadooptutorials.co.in/tutorials/spark/install-apache-spark-on-ubuntu.html" "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; LCTE; rv:11.0) like Gecko"
60.250.32.153 - - [12/Nov/2015:21:36:01 -0800] "GET /tutorials/elasticsearch/install-elasticsearch-kibana-logstash-on-windows.html HTTP/1.1" 200 0 - "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
91.200.12.136 - - [12/Nov/2015:21:30:14 -0800] "GET /tutorials/hadoop/hadoop-fundamentals.html HTTP/1.1" 200 0 "http://hadooptutorials.co.in/tutorials/hadoop/hadoop-fundamentals.html" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.99 Safari/537.36"

These combined Apache Access logs are in a specific format. Here is the sequence and meaning of each component in each access log:

  • %h: This is the remote host (that is, the IP client)
  • %l: This is the identity of the user determined by an identifier (this is not usually used since it's not reliable)
  • %u: This is the username determined by the HTTP authentication
  • %t: This is the time the server takes to finish processing a request
  • %r: This is the request line from the client ("GET / HTTP/1.0")
  • %>s: This is the status code sent from a server to a client (200, 404, and so on)
  • %b: This is the size of the response given to a client (in bytes)
  • Referrer: This is the page that is linked to this URL
  • User agent: This is the browser identification string

Now, let's start a writing program in order to get to know the page views of each unique URL that we have in our web logs.

First, we will write a mapper class where we will read each and every line and parse it to the extract page URL. Here, we will use a Java pattern that matches a utility in order to extract information:

public static class PageViewMapper extends Mapper<Object, Text, Text, IntWritable> {
        public static String APACHE_ACCESS_LOGS_PATTERN = "^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(\\S+) (\\S+) (\\S+)\" (\\d{3}) (\\d+) (.+?) \"([^\"]+|(.+?))\"";

        public static Pattern pattern = Pattern.compile(APACHE_ACCESS_LOGS_PATTERN);

        private static final IntWritable one = new IntWritable(1);
        private Text url = new Text();

        public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
                throws IOException, InterruptedException {
        Matcher matcher = pattern.matcher(value.toString());
            if (matcher.matches()) {
                // Group 6 as we want only Page URL
                url.set(matcher.group(6));
                System.out.println(url.toString());
                context.write(this.url, one);
            }

        }
    }

In the preceding mapper class, we read key value pairs from the text file. By default, the key is a byte offset (the number of characters in a line), and the value is an actual line in a text file. Next, we match the line with the Apache Access Log regex pattern so that we can extract the exact information we need. For a page view counter, we only need a URL. Mapper outputs the URL as a key and 1 as the value. So, we can count these URL in reducer.

Here is the reducer class that sums up the output values of the mapper class:

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                Reducer<Text, IntWritable, Text, IntWritable>.Context context)
                        throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            this.result.set(sum);
            context.write(key, this.result);
        }
    }

Now, we just need a driver class to call these mappers and reducers:

public class PageViewCounter {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        if (args.length != 2) {
            System.err.println("Usage: PageViewCounter <in><out>");
            System.exit(2);
        }
        Job job = Job.getInstance(conf, "Page View Counter");
        job.setJarByClass(PageViewCounter.class);
        job.setMapperClass(PageViewMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

As the operation we are performing is aggregation, we can also use a combiner here to optimize the results. Here, the same reducer logic is being used as the one used for the combiner.

To compile your program properly, you need to add two external JARs, hadoop-common-2.7.jar, which can be found in the /usr/local/hadoop/share/hadoop/common folder and hadoop-mapreduce-client-core-2.7.jar, which can be found in the /usr/local/hadoop/share/hadoop/mapreduce path.

Make sure you add these two JARs in your build path so that your program can be compiled easily.

How it works...

The page view counter program helps us find the most popular pages, least accessed pages, and so on. Such information helps us make decisions about the ranking of pages, frequency of visits, and the relevance of a page. When a program is executed, each line of the HDFS block is read inpidually and then sent to Mapper. Mapper matches the input line with the log format and extracts its page URL. Mapper emits the (URL,1) type of key value pairs. These pairs are shuffled across nodes and partitioners to make sure that a similar URL goes to only one reducer. Once received by the reducers, we add up all the values for each key and emit them. This way, we get results in the form of a URL and the number of times it was accessed.

主站蜘蛛池模板: 沂南县| 富阳市| 建宁县| 太和县| 株洲市| 广德县| 平远县| 淮安市| 镇赉县| 吴川市| 合肥市| 江油市| 江北区| 团风县| 宜城市| 济南市| 勃利县| 阿瓦提县| 格尔木市| 灵台县| 辽宁省| 高淳县| 奇台县| 沙田区| 太白县| 屏东市| 姜堰市| 顺昌县| 新泰市| 阳谷县| 开阳县| 万州区| 阿勒泰市| 手机| 砚山县| 平山县| 蒙自县| 敦化市| 黎平县| 德保县| 山东省|