Page 7 - tmp
P. 7
Optical Hydrocarbons’ chemical formula recognition
Tamara Stanković
tjstankovic@gmail.com
Regional Centre for Talented Youth Nis
1 Introduction Determining character connection.
Chemical graph theory is a branch of mathematics which It should be determined which letters are connected (and
models molecules in order to gain better insight into the how). Two letters are connected by a line which is closest
physical, chemical and biological properties of the to them. So, for each line, it can be found which two letters
compounds and their better approximation. Digitalization it connects and then construct adjacency matrix of
is translating analog signal into digital data. In order to molecule graph.
make digitalization faster and more precise, many optical
character recognition algorithms have been developed and Determining the molecule.
one of them has been used in this paper. Purpose of this
paper is finding and describing an algorithm for optical Using the adjacency matrix in molecule graph, it should be
Hydrocarbons’ chemical formula recognition of given concluded which graph is in the photo. Properties of
photo. Therefore, this is optical graph recognition chemical compounds and value of Wiener index can give
problem. needed information. Value of Wiener index can be
calculated using Floyd-Warshall algorithm and compared
with constants for each molecule. That way, molecule
2 Methods from the photo has been recognized.
An algorithm for optical Hydrocarbons’ chemical formula
recognition of given photo has a couple of stages. 3 Results
For the purpose of this paper, application
Editing the photo. HemijskeFormule has been written using programming
In order to represent photo on the computer, it has to be language Java and all the results have been made using
digitalized. First, photo which is in RGB model, should be this application. Application has been tested on 150
translated into grayscale photo. For each pixel, values of photos. Precision has been analyzed in certain parameters
red, green and blue channel can be determined and simple such as: type of chemical formula, number of C-atoms in
formula = 0.21∙ +0.71∙ +0.07∙ can be the molecule, type of chemical bond, ect. Recognition has
used to calculate grayscale value of that pixel. Then, a precision of 93% for molecule formulas and 71% for
photo should be converted into binary image. Otsu method structural formulas. Precision also depends on number of
has been used to make binary photo. C-atoms in the molecule, and type of chemical bond. Total
precision of an application is 80%.
Extracting and editing characters and lines.
The goal is to find all connected pieces in the matrix of the 4 Conclusion
photo, which represent some character or a line. DFS The obtained results are satisfying and represent progress
(Depth-First Search) algorithm for graph search is used to in optical graph recognition.
do it. Components then should be reduced in order to be
20x20.
5 References
Character and line recognition. Dejan Živković. Osnove dizajna i analize algoritama.
Neural network, a machine learning algorithm, has been Računarski fakultet Beograd i CET, Beograd 2007.
used for character recognition. For each component, it is Andrew Ng. Machine Learning. Stanford University,
determined if it was a character (C, H and digits 0-9) or a Coursera 2013.
line. Neural network has been implemented using Nobuyuki Otsu. A Threshold Selection Method from
applicative software Neuroph. For each component, neural Gray-Level Histograms. IEEE Transaction on Systems,
network can determine which character best suits that Man, and Cybernetics, 1979.
component, whereas lines are components that are not
letters or digits.