[go: up one dir, main page]

Academia.eduAcademia.edu
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Binus University Repository Comparison of Similarity Coefficients on Morphological Rodent Tuber 1,2 Iwan Binanto 1Computer Science Department, BINUS Graduate Program – Doctor of Computer Science Bina Nusantara University Jakarta, Indonesia 11480 2Informatics Department, Sanata Dharma University, Yogyakarta, Indonesia 55002 iwan@usd.ac.id Lukas Cognitive Engineering Research Group (CERG), Faculty of Engineering, Universitas Katolik Indonesia Atma Jaya, Jakarta, Indonesia 12930 lukas@atmajaya.ac.id 1 Harco Leslie Hendric Spits Warnars, 2Bahtiar Saleh Abbas, 3 Yaya Heryadi Computer Science Department, BINUS Graduate Program, Doctor of Computer Science Bina Nusantara University Jakarta, Indonesia 11480 1spits.hendric@binus.ac.id, 2bahtiars@binus.edu, 3yayaheryadi@binus.edu 1,2 Nesti Fronika Sianipar 1Food Technology Department, Faculty of Engineering, Bina Nusantara University, Jakarta, Indonesia 11480 2Research Interest Group Biotechnology, Bina Nusantara University, Jakarta, Indonesia 11480 nsianipar@binus.edu Horacio Emilio Perez Sanchez Bioinformatics and High Performance Computing Research Group (BIOHPC), Universidad Católica de Murcia (UCAM), Guadalupe, Spain 30107 hperez@ucam.edu Abstract— Many comparisons of similarity coefficient done by researchers, especially in the field of biology. This comparison aims to find the most appropriate similarity coefficient for some cases. Many results found that Sorensen-dice coefficient and Jaccard coefficient is close or even identical. But Jaccard coefficient can not handle properly for sets with real-value or weighted sets or any pair of vectors. So, Jaccard coefficient redefined as Generalized Jaccard Coefficient. This paper shows the correlation between Sorensen-dice coefficient with Generalized Jaccard Coefficient using Spearman's correlation as predecessors research did and using ANOVA to ensure the results. This research find that the comparison between them is less similar from predecessors research. Keywords— Generalized Jaccard Similarity, Sorensen-Dice Similarity, similarity coefficient, comparison, rodent tuber I. INTRODUCTION The similarity is necessary to examine the objects of investigation; in this case, the mutant of Rodent Tuber (Typhonium flagelliforme Lodd.) derived from breeding with its parent, called control plant. The research of Rodent Tuber was performed by Sianipar, et al. in [1]–[5] utilizing NTSys, which is proprietary software. One of their research objectives is to find similarity. By the discovery of similarity, it will be easier to find its dissimilarity, because the real purpose of the breeding is to find the diversity of produced mutants [6] [7]. Rodent clones which are compared to the control plants [4]. This paper using the data from [4] as in Table I. Sianipar et al. measure the similarity between the mutants of Rodent Tuber and the control plant using Sorensen-Dice coefficient [1]–[5]. The formula of SorensenDice coefficient is: (1) Beside of Sorensen-Dice coefficient, there are many coefficient similarities, one of them is Jaccard coefficient which had approximately identical results in [8], [9] or has close result in [10] or a very close result in [11] to SorensenDice coefficient. The Jaccard coefficient created for analyses in phytology [12] and works well with binary data as well as Sorensen-Dice coefficient. Many research using Jaccard coefficient for measuring similarities in a various field [8]–[17]. The formula of Jaccard coefficient is: (2) Jaccard coefficient is simple and effective in many applications [13], [18] but it can not handle properly for sets with real-value or weighted sets [18] or any pair of vectors [19], therefore it redefine and explained well as the Generalized Jaccard Coefficient in [19], for short we call it GJS, and also introduced and used in [18]–[22] as: One of Sianipar's investigations is the morphological observation of Rodent Tuber, which has been given gamma irradiation. According to this investigations, gamma irradiation at 6 Gy’s dose was able to increase the number of shoots and leaves, and also the height of the plant of the 978-1-5386-9422-0/18/$31.00 ©2018 IEEE The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia (3) 104 This paper discusses Generalized Jaccard Coefficient compared to Sorensen-Dice Coefficient (result from proprietary software namely NTSys) using Spearman’s correlation as [8]–[11] did. images), gene sequences, etc. Those are weighted sets or pair of vectors. Weighted sets or any pair of vectors are more commonly found than binary sets. If A and B are binary or sets, then the similarity measure is called Jaccard coefficient as mentioned in [8]–[17]. According to [18], [19], Jaccard coefficient cannot handle properly for sets with real-value called weighted sets or any pair of vectors. III. METHOD This paper uses raw data and Sorensen-Dice similarity table from [4] as in Table I and Table IV respectively. Generalized Jaccard coefficient calculated with formula (3) and have a result as in Table V. It was done using Microsoft Excel. To calculate the correlation, each similarity table converted to be 1 column, so we have two columns which are Generalized Jaccard column and Sorensen-Dice column. From here, we can plot the data as in Fig. 1. Then Spearman’s correlation calculated to find the value of correlation between Table IV and Table V. It done using MATLAB with the script as below: a = xlsread('Book2.xlsx','A:A') b = xlsread('Book2.xlsx','B:B') [RHO] = corr(a,b,'Type','Spearman'); The script generates RHO value 0.5052, which is the value of Spearman’s correlation. Fig. 1. Plot Sorensen-Dice and Generalized Jaccard coefficient II. LITERATURE REVIEW Rodent Tuber is a plant native to Indonesia that has been used as traditional medicine for many years. This plant contains detoxification and anti-cancer compounds. These anticancer compounds exist in all parts of the plant, including roots, tubers, stems, and leaves. Unfortunately, this plant does not have much genetic diversity, so it becomes an obstacle regarding obtaining plants that have higher anticancer compounds. Sianipar et al. began to develop mutants using gamma radiation [23]. To test the genetic diversity of the mutant plants produced, Sianipar et al. did a similarity test using the NTSys software with Sorensen-Dice coefficient [1]–[5]. Duarte et al. in [8] compared eight similarity coefficients using the Spearman’s correlation and dendrogram to test similarity in common beans based on the RAPD marker. One of the result is Sorensen-Dice, and the Jaccard coefficient has an identical result. Murguia et al. in [9] compared nine similarity coefficients to estimate the effect of biogeographic classification; the result is Sorensen-Dice and Jaccard coefficient had identical results. Silva et al. in [10] compared eight similarity coefficient using Spearman's correlation, and the result is Sorensen-Dice and Jaccard coefficient had a close result. Dalirsefat et al. in [11] compared three similarity coefficient one of the comparison tools is the Spearman’s correlation and of the result of a correlation value between Sorensen-Dice and Jaccard coefficient is one which means exactly same. Shrivastava (2016) in [19] said that GJS (A, B) is often used to compare web documents, histograms (especially To ensure the correlation between Generalized Jaccard coefficient and Sorensen-Dice coefficient, we construct a hypothesis which are: H 0: No correlation between Generalized Jaccard coefficient and Sorensen-Dice coefficient Ha: There is a correlation between Generalized Jaccard coefficient and Sorensen-Dice coefficient This hypothesis evaluated with ANOVA using Microsoft Excel and the result provided as in Table II. TABLE I. RAW DATA FROM [4] Clone Shoot Leaf control 6-3-3-6 6-9-3 6-9-4 6-2-5-3 6-3-2-5 6-1-1-2 6-9-1 6-2-4-1 6-6-3-7 6-6-3-6 6-2-7 6-2-6-3 6-1-2 6-1-1-6 6-2-8-2 6-9-5 6-3-3-10 0 1 2.5 0.4 0.5 1.5 3.5 2.5 0 0.5 1 0 0 4.5 1 2.5 0 0 1 6 3.5 4 7 8 2 11 2 6 6 5.5 5 15 2 11.5 12.5 1.5 Plant Height (cm) 3.5 4 4 12.5 12 13.5 6 4.5 3 7.5 12.5 12 5.5 8.3 5 6.5 10.3 7.5 978-1-5386-9422-0/18/$31.00 ©2018 IEEE The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia 105 TABLE II. IV. RESULTS AND DISCUSSIONS Duarte et al. in [8] concluded that the result is SorensenDice and the Jaccard coefficient has an identical result. Murguia et al. in [9] had a result that Sorensen-Dice and Jaccard coefficient had identical results. Silva et al. in [10] concluded that Sorensen-Dice and Jaccard coefficient had a close result. Dalirsefat et al. in [11] had the result that correlation value between Sorensen-Dice and Jaccard coefficient is one which means exactly same. They made a comparison between Sorensen-dice coefficient and Jaccard coefficient where both are used binary data. This paper uses Generalized Jaccard coefficient for real-value data. According to [19], Jaccard coefficient similar to Generalized Jaccard coefficient. But in this research, the result of Spearman’s correlation is 0.5052 as above, which means there is a moderate positive correlation, as in Table III [24]. It is not close, very close, nor even identical. TABLE III. ANOVA SINGLE FACTOR INTERPRETING CORRELATION COEFFICIENT [24] Correlation Value 0.90 to 1.00 (-0.90 to -1.00) 0.70 to 0.90 (-0.70 to -0.90) 0.50 to 0.70 (-0.50 to -0.70) 0.30 to 0.50 (-0.30 to -0.50) 0.00 to 0.30 (0.00 to -0.30) TABLE IV. TABLE V. Interpretation Very High Positive/Negative Correlation High Positive/Negative Correlation Moderate Positive/Negative Correlation Low Positive/Negative Correlation Negligible Correlation RESULT OF SORENSEN-DICE COEFFICIENT RESULT OF GENERALIZED JACCARD COEFFICIENT 978-1-5386-9422-0/18/$31.00 ©2018 IEEE The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia 106 V. CONCLUSIONS In previous research on the comparison between Jaccard coefficient and Sorensen-Dice coefficient [8]–[11], showing the results that both have close correlations up to identical. But Jaccard coefficient can not handle properly for sets with real-value or weighted sets [18] or any pair of vectors [19], so the Generalized Jaccard coefficient is used. In this study, Sorensen-Dice coefficient compared with Generalized Jaccard coefficient and the result is there is a moderate correlation with the Spearman's correlation value is 0.5052. This result less similar than the previous research in [8]– [11]. Because of this, we are not recommending to use Generalized Jaccard coefficient if already use SorensenDice coefficient to avoid confusion. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] N. F. Sianipar, Ariandana, and W. Maarisit, “Detection of GammaIrradiated Mutant of Rodent Tuber ( Typhonium flagelliforme Lodd) In Vitro Culture by RAPD Molecular Marker,” vol. 14, pp. 285–294, 2015. D. Laurent, N. F. Sianipar, Chelen, Listiarini, and A. Wantho, “Analysis of Genetic Diversity of Indonesia Rodent Tuber (Typhonium flagelliforme Lodd.) Cultivars Based on RAPD Marker),” in The 3rd International Conference on Biological Science 2013 (The 3rd ICBS-2013), 2015, vol. 2, pp. 139–145. N. F. Sianipar, D. Laurent, R. Purnamaningsih, and I. Darwati, “SHORT COMMUNICATION Genetic Variation of the First Generation of Rodent Tuber ( Typhonium flagelliforme Lodd .) Mutants Based on RAPD Molecular Markers,” vol. 22, no. 2, pp. 98–104, 2015. N. F. Sianipar, R. Purnamaningsih, D. L. Gumanti, Rosaria, and M. Vidianti, “Analysis of Gamma Irradiated-Third Generation Mutants of Rodent Tuber ( Typhonium flagelliforme Lodd .) Based on Morphology , RAPD , and GC-MS Markers,” Pertanika J. Trop. Agric. Sci., vol. 40, no. 1, pp. 185–202, 2017. N. F. Sianipar, R. Purnamaningsih, D. L. Gumanti, Rosaria, and M. Vidianti, “Analysis Of Gamma Irradiated Fourth Generation Mutant Of Rodent Tuber (Typhonium Flagelliforme Lodd.) Based On Morphology And RAPD Markers,” J. Teknol., vol. 78, no. 5–6, pp. 41–49, 2016. R. Hesananda et al., “Supervised Classification Karakter Morfologi Tanaman Keladi Tikus ( Typhonium Flagelliforme ) Menggunakan Database,” J. Sist. Komput., vol. 7, no. 2, pp. 50–58, 2017. T. Siswanto et al., “The Genomic Plant Warehouse Framework: A Systematic Literature Review,” Proc. 2017 Int. Conf. Inf. Manag. Technol., no. November, pp. 244–248, 2017. J. M. Duarte, J. B. Dos Santos, and L. C. Melo, “Comparison of similarity coefficients based on RAPD markers in the common bean,” Genet. Mol. Biol., vol. 22, no. 3, pp. 427–432, 1999. M. Murguia and J. L. Villasenor, “Estimating the effect of the similarity coefficient and the cluster algorithm on biogeographic classifications,” Ann. Bot. Fenn., vol. 40, no. December, pp. 415– 421, 2003. A. da Silva Meyer, A. A. F. Garcia, A. Pereira de Souza, and C. Lopes de Souza, “Comparison of similarity coefficients used for cluster analysis with dominant markers in maize (Zea mays L),” Genet. Mol. Biol., vol. 27, no. 1, pp. 83–91, 2004. S. B. Dalirsefat, A. da S. Meyer, and S. Z. Mirhoseini, “Comparison of Similarity Coefficients used for Cluster Analysis with Amplified Fragment Length Polymorphism Markers in the Silkworm , Bombyx mori,” J. Insect Sci., vol. 9, no. 71, pp. 1–8, 2009. P. Jaccard, “The distribution of the flora in the alphine zone,” New Phytol., vol. XI, no. 2, pp. 37–50, 1912. S. Pal, F. Yu, T. J. Moore, R. Ramanathan, A. Bar-Noy, and A. Swami, “An efficient alternative to Ollivier-Ricci curvature based on the Jaccard metric,” pp. 1–22, 2017. V. Thada and V. Jaglan, “Comparison of Jaccard, Dice, Cosine Similarity Coefficient To Find Best Fitness Value for Web Retrieved Documents Using Genetic Algorithm,” Int. J. Innov. Eng. Technol., vol. 2, no. 4, pp. 202–205, 2013. [15] S. Kosub, “A note on the triangle inequality for the Jaccard distance,” arXiv1612.02696v1 [cs.DM] 8 Dec 2016 A, no. 1, pp. 1– 5, 2016. [16] D. Fogaras and B. Rácz, “Scaling link-based similarity search,” in Proceedings of the 14th international conference on World Wide Web - WWW ’05, 2005, p. 641. [17] C. S. Loh, I. H. Li, and Y. Sheng, “Comparison of similarity measures to differentiate players’ actions and decision-making profiles in serious games analytics,” Comput. Human Behav., vol. 64, pp. 562–574, 2016. [18] W. Wu, B. Li, L. Chen, and C. Zhang, “Consistent Weighted Sampling Made More Practical.,” in 2017 International World Wide Web Conference Committee (IW3C2), 2017, pp. 1035–1043. [19] A. Shrivastava, “Exact Weighted Minwise Hashing in Constant Time,” arXiv Prepr. arXiv1602.08393, no. 2, 2016. [20] M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” Proc. thiry-fourth Annu. ACM Symp. Theory Comput. STOC ’02, p. 380, 2002. [21] V. Kashyap, D. B. Brown, B. Liblit, D. Melski, and T. Reps, “Source Forager: A Search Engine for Similar Source Code,” 2017. [22] Z. Shirzadi et al., “Enhancement of automated blood flow estimates (ENABLE) from arterial spin-labeled MRI,” J. Magn. Reson. Imaging, vol. 47, no. 3, pp. 647–655, 2017. [23] N. F. Sianipar, A. Wantho, Rustikawati, and W. Maarisit, “The Effects of Gamma Irradiation on Growth Response of Rodent Tuber ( Typhonium flagelliforme Lodd .) Mutant in In Vitro Culture,” HAYATI J. Biosci., vol. 20, no. 2, pp. 51–56, 2013. [24] M. M. Mukaka, “Statistics corner: A guide to appropriate use of correlation coefficient in medical research,” Malawi Med. J., vol. 24, no. 3, pp. 69–71, 2012. 978-1-5386-9422-0/18/$31.00 ©2018 IEEE The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia 107