Home > news > Content
IN Data Collection of Big Data Technology
- 2020-07-10-

(1) System log collection method

The system log records the information about the hardware, software, and system problems in the system, and can also monitor the events that occur in the system. The user can use it to check the cause of the error, or to find the traces left by the attacker when it was attacked. System logs include system logs, application logs, and security logs. (Baidu Encyclopedia) The big data platform or similar to the open source Hadoop platform will generate a large amount of high-value system log information. How to collect it has become a research hotspot for researchers. Currently, Chukwa, Cloudera's Flume and Facebook's Scribe (Li Lianning, 2016) developed based on the Hadoop platform can all become models of system log collection methods. At present, this kind of collection technology can transmit hundreds of MB of log data information per second, which meets the current demand for information speed. Generally speaking, what is relevant to us is not such collection methods, but network data collection methods.

(2) Network IN data collection method

Students who do natural language may feel deeply about this point. In addition to the existing public data sets used for daily algorithm research, sometimes in order to meet the actual needs of the project, it is necessary to collect and preprocess the data in the actual web page. And save. Currently, there are two methods for network data collection, one is API, and the other is web crawling method.

1. API

API is also called application programming interface, which is a kind of programming interface written by website administrator for users. This type of interface can shield the complex algorithms at the bottom of the website and realize the data request function by simply calling it. At present, mainstream social media platforms such as Sina Weibo, Baidu Tieba and Facebook provide API services, and related demos can be obtained on the open platform of their official website. But API technology is limited by platform developers after all. In order to reduce the load on the website (platform), the general platform will limit the daily interface call upper limit, which brings us great inconvenience. For this we usually use the second method-web crawlers.

2. Web crawler

Web crawlers (also known as web spiders, web robots, in the FOFA community, and more often web chases) are programs or scripts that automatically crawl information on the World Wide Web in accordance with certain rules. Other less commonly used names are ants, automatic indexing, simulators, or worms. (Baidu Encyclopedia) The most common crawler is the search engine we often use, such as Baidu and 360 search. Such crawlers are collectively referred to as universal crawlers, which unconditionally collect all web pages. The specific working principle of general crawler is shown in Figure 1.

Given the initial URL of the crawler, the crawler extracts and saves the resources that need to be extracted from the webpage, and at the same time extracts other website links that exist in the website, after sending the request, receiving the website response and parsing the page again, extracting the required resources and saving, and then Extract the required resources from the webpage...and so on, the implementation process is not complicated, but pay special attention to the forgery of the IP address and headers when collecting, so as not to be discovered by the network administrator to block the IP (I will be blocked Over), IP banning means the failure of the entire collection task. Of course, in order to meet more needs, multi-threaded crawlers and theme crawlers have also emerged. Multi-threaded crawlers use multiple threads to perform collection tasks at the same time. Generally speaking, a few threads will increase the data collection data several times. The theme crawler is the opposite of the general crawler. Through a certain strategy, the web page information that is irrelevant to the theme (collection task) is filtered, leaving only the required data. This can greatly reduce data sparse problems caused by irrelevant data.

(3) Other collection methods

Other collection methods refer to how to ensure the safe transmission of data for scientific research institutes, corporate governments, etc. who have confidential information? The specific port of the system can be used for data transmission tasks, thereby reducing the risk of data leakage.

[Conclusion] Big data collection technology is the beginning of big data technology. A good start is half the battle. Therefore, you must carefully choose methods when doing data collection, especially crawler technology. Theme crawlers should be used for most data collection tasks. Language is a better method and can be studied in depth.