Big Data Visualization by MapReduce for Discovering the Relationship Between Pollutant Gases

Big data mining is processing a large and complex sets of data, accumulated over time, the amount of data growing rapidly anywhere. Big data is a term used to describe some of current directions in technology, as a concept that must be taken into consideration in data analysis. It is important to note that most of the big data is unstructured data, and unorganized, that is difficult to fit as the usual databases [1,2]. In vast scientific spheres, state-of-the-art sensors that gather big data are always utilized. For instance, sensors are commonly used to obtain valuable physical, chemical, and biological data. Consequently, this offers improvement in the socio-economic well-being. Equally important in its applications for weather forecasting, monitoring, and timely responding to natural disasters and climate change [ 3,5 .]


INTRODUCTION
Big data mining is processing a large and complex sets of data, accumulated over time, the amount of data growing rapidly anywhere. Big data is a term used to describe some of current directions in technology, as a concept that must be taken into consideration in data analysis. It is important to note that most of the big data is unstructured data, and unorganized, that is difficult to fit as the usual databases [1,2]. In vast scientific spheres, state-of-the-art sensors that gather big data are always utilized. For instance, sensors are commonly used to obtain valuable physical, chemical, and biological data. Consequently, this offers improvement in the socio-economic well-being. Equally important in its applications for weather forecasting, monitoring, and timely responding to natural disasters and climate change [ 3,5 .] Big data is the information assets characterized by a high volume, velocity, and variety that required specific technology and analytical methods for its transformation into useful information [6]. Nature protecting and reducing environment pollution by processing data is one of the major achievements of big data analysis. Big data analysis can be an efficient tool that provides information for a sustainable economic and social future [7]. China for example started big data analysis for ecological and environmental protection [ 8 .] The demand for oil consumption was increasing, at the same times the air quality monitoring stations measured pollutant gases; these stations measure the pollutant gases such as ozone (O3), nitrogen dioxide (NO2), sulphur dioxide (SO2), carbon dioxide (CO2), carbon monoxide (CO), and others. The air quality stations usually are located near possible polluted areas such as oil refineries and factories. These stations also measure the meteorological parameters such as temperature, relative humidity, wind speed, wind direction, and many other parameters, that are important to monitor air quality. These stations working 24 hours, there is one reading every 5 minutes, these readings of the Arab Gulf region are available for this paper for more than two decades ago from many stations. The data that was used in this research study was pre-processed and analysed for only one air quality monitoring station in 2020.

LITERATURE REVIEW
Data mining is usually known as the technique to get useful knowledge out of databases. Data mining is the procedure that travels throughout data to discover unknown relations among Abstract Big data mining and pollution are extremely important issues in todays. An innovative method in this study was used for visually discovering the relationship between pollutant gases by MapReduce. One dimensional, two-dimensional, and three-dimensional visualization used to visualize the data, that was processed as an hourly reading for one year from an air quality monitoring station to study the behaviors of pollutant gases distribution, and to show graphically the distribution of one, two, or three gases. The number of readings used in this paper are 8760 hourly readings for each of the five pollutant gases under this study. Pearson correlation used to explore numerically the correlation between the pollutant gases, and eta factor used to evaluate the effect of one gas on the other pollutant gases. We found out by both methods, visually and numerically the same facts that related between the pollutant gasses. The ozone has a moderate negative correlation of value -0.622 with nitrogen dioxide, and weak negative correlation of value -0.248 with carbon monoxide, and -0.155 with carbon dioxide. Ozone has approximately no correlation of value .060 with silver dioxide. The carbon monoxide has moderate positive correlation of value 0.364 with carbon dioxide. The eta factor between ozone and nitrogen dioxide is very weak of values 0.292, and 0.009 with Sulphur dioxide, this proved an important fact that the ozone, nitrogen dioxide, and Sulphur dioxide sources are different. The study recommends that each country must analysis visually and numerical the big data that was collected yearly from the monitoring stations to control the pollution gases especially near the large industrial factories. the data that are interesting to the user of the data [9], [10]. Data mining is used to describe the process of selecting, exploring, and modelling very large quantities of different types of data. This process is actually to figure out, if there are any original relationships or regularities that are still unclear with the aim of obtaining clear, obvious, relevant, and useful results for the data-miners [11,13].
Data mining includes techniques such as classification, clustering, and prediction. Classification is the process of finding a set of models that describe and distinguish data classes or concepts. Clustering is concerned with the problem of decomposing or partitioning a data set into groups so that the elements in one group are similar to each other and are as different as possible from the elements in other groups [14,15]. There are many algorithms for data mining such as C4.5, k-Means, MapReduce, Hadoop, SVM, A priori, EM, kNN, Naive Bayes algorithm, CART, Naïve Bayes, Random Forest, and Artificial Neural Networks [16,19].
The data mining algorithms can be used for big data analysis to improve air quality analysis [20]. MapReduce algorithm simplified data processing on large clusters. This algorithm allows splitting of a single computation task to multiple nodes for distribution processing [21]. MapReduce consists of two processes map and reduce, in the map process performs filtering and storing, and in reduce process performs a summary operation. In the map process, the master node takes the problem divides the problem into smaller sub-problems and distributes them to the worker nodes. The worker node takes care of processing smaller problems and carries the answer back to its master node [21]. Individual nodes perform the computing operation and return the results to the reduce function. The reduce function collects the individual results of the computation to generate a final output [ 22 .] MapReduce used by Shirwadkar in 2017 [23], for processing Texas air quality data. Data visualization is one of the significant components of data analytics in the age of big data. Visualization includes three presenting results onedimensional (1D), two-dimensional (2D), and threedimensional (3D) plots. The 3D trajectory reconstruction offers important opportunities to visualize the data [24]. Visual analytics enables organizations to take raw data and present it in a visual form that represents the data. Visualization of big data is bound to lead to some challenges and the opportunity for success with a data visualization strategy is much greater [ 25 .] Data visualization describes any effort to help people understand the significance of data by placing it in a visual context. It helped data engineers and scientists keep track of data sources and do basic exploratory analysis of data [26,27]. In large-scale data visualization, many researchers used feature extraction and geometric modelling to greatly reduce data size before actual data rendering [28]. Visualization is very important tool, and visualization can be used in mining big data, by showing the concentrations of gases [29]. One of the most important data visualization utilization is in the centralized control center, such as central control centre, the responsibility of the central control centre is to collect and monitor all the information in real time, to take an appropriate decision [30].

MATERIALS AND METHODS
Data pre-processing helps to improve the accuracy of the analysis, in this process, data may be deleted or edited to eliminate the redundant data to improve the data quality [31]. Therefore, pre-processing applied on the collected data. The data of this paper collected from one of the air quality monitoring stations in Arabian gulf region. The data are for the five pollutant gases O3, NO2, SO2, CO2, and CO. These data were pre-possessing to deal with missed data. There is one reading for each five minutes, the total readings for the five gases are 12*24*365*5=518400, the average of the readings for each one hour was calculated and represented as an hourly reading, these hourly readings used in the data analysis in this paper. Therefore, the number of readings used in the data analysis is: 24*365=8760 readings for each of the five gases in one year. Therefore, the MapReduce is the best method to show graphically the relationships between the five gases .
The descriptive method was used to visualize and understanding what has already happened in year 2020 by using MapReduce. Visualizing the data is very important to explore the nature of the data. Without visualizing it could not be easy to figure out the relation between the pollutant gases during the time series. Visualization is one of the important techniques that helps decision makers in identify out any increases in the concentrations of the pollutant gases, that could have bad effect on the climate over time in a very fast and undertakable way. The second approach used is the quantitative method, by using statistical analysis such as, Pearson correlation to measure the correlation between the pollutant gases, and eta factor to find out the effect of one pollutant gas on the other pollutant gases.

RESULTS AND DISCUSSION
Data visualization is an important tool to represents the data in graphical format. It enables decision makers to understand graphically the distributions, patterns, and trends of the readings, and make their decisions faster and accurate. Fig. 1 shows the five 1D charts, these charts are representing the 8760 readings for each gas during 12 months in 2020. The xaxis represents hours (00:00-23:00), the y-axis represents gas concentration. The chart of the O3 shows that its highest concentration distribution during the day's hours from noon to 8:00 pm. The NO2 have the lowest concentration distribution during the day hours especially from 10:00 am to 16:00pm. The chart of SO2 shows its highest concentration distribution is during 7:00 am to 14:00 pm and during 18:00 am to 22:00 pm. The chart of CO2 shows its concentration Journal port Science Research Available online www.jport.co Volume 4, No:2 2021 distribution is approximately the same during the 24 hours. The chart of CO shows its highest concentration distribution is during 6:00 am to 10:00 am and during 19:00 am to 00:00 pm.

Fig. 1: the daily 1D charts for the five pollutant gases in year 2020.
Fig . 2 shows the ten 2D scatter charts to show the concentration distribution between the five gases, these charts are representing the 8760 reading for each gas during 12 months in 2020, the scale of the colours are for the months, January represented by the blue colour and December represented by the red colour. In these charts, the X-axis represents the independent variable that affects on the Y-axis that represent the dependent variable. The first row of charts in this figure shows the concentration distribution of NO2, SO2, CO2, CO against O3. The O3 has an inverse relationship with NO2, because the highest concentrations of both gases are distributed in reverse direction along the x-axis and yaxis. The other pollutant gases SO2, CO2, CO have a weak correlation with O3, because they have no regular distribution. The second row of charts show NO2 has no relation with SO2, while has a positive relation with CO2 and CO, this indicate the source of NO2, CO2 and CO 2 are the same. The third row of the charts show SO2 has no relation with CO2 and CO. The last chart in the fourth row shows that there is positive correlation between CO and CO2.   3 shows the ten 3D scatter charts, the three charts in the first row shows an inverse relation between O3 and NO2, while O3 has no relation with SO2. However, O3 has positive relation with CO2 and CO, because it has regular redistribution shape with them. The first two charts in the second row show O3 has no relation with SO2, but again has positive relation with CO2 and CO, this also proved by the third chart in the second row. The three charts in the thirdrow show NO2 has no relation with SO2, but has a positive relation with CO2 and CO, because it has regular redistribution shape with them. The last chart in the fourth row shows SO2 has a positive relation with CO2 and CO, because it has regular redistribution shape with them.
Journal port Science Research Available online www.jport.co Journal port Science Research Available online www.jport.co Volume 4, No:2 2021 To test the correlation between the five pollutant gases, the normality of data distribution test was applied by using onesample Kolmogorov-Smirnov test. Table 1 shows that the five pollutant gases have non-normal distributions, because the Sig. (2-tailed) are 0.000 for the five gases. Therefore, the Spearman test was used to find out the correlation.  The correlation coefficient is represented by r in this paper. According to Vaske, Beaman, and Sponarski in 2017 [32] coefficient with a value r≥0.7 indicates a strong association between variables, moderate correlation when the values ±0.7>r≥0.3, and weak correlation for r <±0.3. Table 2 shows the correlation between the five pollutant gases. The O3 clearly have moderate negative correlation -0.622 with NO2, while it has weak correlations with SO2, CO2, and CO. The NO2 has moderate positive correlation 0.422 with CO2, and moderate positive correlation 0.333 with CO. The SO2 has moderate positive correlation 0.324 with CO2. The CO2 has moderate positive correlation 0.364 with CO. All these results proved by the 2D and 3D charts.
To test the effect of one gas on the other, the paired samples T-test applied to measure the effect factor eta (ƞ). Table 3 shows the effect of O3 on NO3 is 0.292 is very weak, this prove that the sources of these gases are different, also of O3 have no effect on SO2. While O3 have high effect on CO2 and CO, which indicates the pollution source that generate O3 generate CO2 and CO. The NO2 has weak effect on SO2 and high effect on CO2 and CO, the SO2 has high effect on CO2 and CO this indicates the source of pollution for SO2, CO2 and CO is the same. The CO2 has high effect on CO this indicates the source of pollution for them is the same.

CONCLUSION
In this paper, three methods used to represent the concentrations, relationships between the five pollution gases, and effect of one gas on the other(s). The innovative method in this paper is the using MapReduce to show graphically the pollutant gases distributions of the hourly readings for one year. The 1D charts showed the distribution of each gas during the 24 hours, this type of charts helps in distributing the big data readings into groups have the same or similar values. The most important results of this type of charts, they proved that the O3 highest level of concentration is mostly during the day hours from noon to 8:00 pm, While NO2 have the reverse behaviour, its highest level of concentration starting from 8:00 pm till the 8:00 am of the next day. The distribution of CO is approximately similar to the NO2 distribution, this means when the concentration of NO2 increased the CO also increased, these two gases are very harmful to human life, and their main sources are burning the cars fuel, because most of the people in this country using their cars during night, due to the high temperature during the day hours, especially in summer that is staring from March till October. the CO2 distribution is approximately the same during the 24 hours, because the main sources of this gas are the oil refinery stations and electricity power generation stations that are working 24 hours.
The 2D charts showed the combine distribution of each two different gases. The important results of this type of charts, they represent the distribution of each two gases during the 12 months in a year. The highest pollutant gases Journal port Science Research Available online www.jport.co Volume 4, No:2 2021 concentrations were occurred during the summer months, they also showed there is divergence between O3 and NO2, this is another indicator that the sources of O3 and NO2 are different. Again, NO2 and CO have directly direction, which is another prove that their sources are the same. The SO2 don't have any regular behaviour with other four pollutant gasses.
The 3D charts showed the combine distribution of three different pollutant gases during the 12 months in a year. The important results of this type of charts showed that O3 and NO2, have relation with CO2 and CO, this indicates the emission sources of O3 and NO2 also generate the pollutant gases CO2 and CO during the periods of time, in spite of the sources of the O3 and NO2 are different.
The result of Spearman's correlation showed negative moderate correlation of value between O3 and NO2, but the eta effect factor between them is low, which is also prove that the sources of the emission of these two gases are not the same. The O3 has high effect on CO2 and CO, while it has weak correlation with them, because the sources of CO2 and CO are the refinery stations and electricity power generation station, and another source is burning cars fuel, which also cause increase in the concentration of these two gases. The correlation between CO2 and CO is moderate, but the eta effect factor is very high between them, this proves that the sources of emissions for these two gases are the same.
From all the above, the innovative methods of analysis used in this paper is very effective in monitoring the pollutant gases and measuring the correlation and determining the effect of one gas on the others. The MapReduce is one of the important methods to show graphically the distribution of big data, especially the air pollution data, because the size of these data collected by air quality monitoring stations are grow up, due to the importance of monitoring air quality each minute to reduce the effect of the pollutant gases on climate change.