Research on Digitization of Mathematical Expressions
|School||University of Science and Technology of China|
|Course||Precision instruments and machinery|
|Keywords||Digital processing of mathematical expressions Mathematical expression recognition Document image skew detection Mathematical expression of place Postscript Mathematical expressions to retrieve|
Digitization of mathematical expressions means the processing for mathematical expressions by computer automatically, includes the retrieval, input, display, output, notation, transmission and search of mathematical expressions. With the rapid development of information technology, computer science and internet, the research on digitization of mathematical expressions has great signiticance in building digital library, naturalizing the interface of computer algebra system, online teaching and communication of distributed computing systems, etc. Research on some critical problems in digitization of mathematical expressions is presented in this thesis.The retrieval and input of mathematical, namely the recognition of mathematical expressions, is the core content in digitization of mathematical expressions. There are two types of recognition mainly: recognition of mathematical expressions in printed documents and recognition of online handwriting expressions. Till now, the recognition system is still in the stage of lab research and will take time to be used in practical. Some issues in this area are investigated and the following are the main work and progress in this thesis:(1) Skew detection of document imagesSkew detection as a step of the document image preprocessing plays an important role in the recognition of printed mathematics. The existent methods detecting skew of documents have difficulties in precision, accuracy or speed. A method based on Mathematical Morphology and the Hough Transform is presented. Morphology method is used to smooth the document image, eliminate the pixel noise and detect the edge of the text row. and the Hough Transform is applied to detect thc skew angle. This method is proved to be precise, accurate, fast and robust by experiments.(2) Extraction of mathematical expressions in printed Chinese technical documentsExtraction of mathematical expressions is the precondition of mathematical expressions recognition. A new approach for separating both isolated and embedded expressions in printed Chinese technical documents is presented. After the features of text lines are extracted, ANFIS is used to classify the text lines into two classes: lines of text and lines of isolated expressions. For embedded expressions, Fuzzy clustering and dynamic programming algorithm are applied to extract Chinese characters, Chinese punctuation and Erlglish letters in sequence. At last, mathematical symbols are merged into expressions. The methods proposed arc proved to have high accuracy by experiments.(3) Extracting mathematical expressions from postscript documentsExtracting mathematics from postscript document is a new area in the research on recognition of mathematical expressions. A content-based approach for extracting mathematical expressions form postscript document is presented. The current study objects are postscript documents transformed from Microsoft Word or transformed from LATEX. By redefining some standard routings rendering text or painting in prior, the character information, such as character name, font type, font name and character bounding box are extracted form postscript document, the line information is extracted as well. According to the character information, the mathematical characters are recognized, and then the connected lines are recognized as mathematical characters. At last, heuristic rules are used to merge mathematics into expressions. The methods proposed are proved to have high accuracy by experiments.Search of mathematical expressions is another important content in the area of mathematical expressions digitization. The search of expressions lies on not only literal but also semantic contents. Rare research in the area are found. This thesis probes into this area and introduce ontology to solve the search problem. A mathematical expression ontology model is established and OpenMath is used as description of the model. During the search process, expressions are labeled by OpenMath tree, and so the search of expression turned into the tree matching problem. According to the different accuracies of the search, the matching are divided into prccise matching, inclusive matching, sematic matching and fuzzy matching. Algorithm of each matching are presented. The fuzzy matching is discussed mainly. The edit distance in the classical tree matching algorithm is corrected to adapt for the characteristic of expressions. The fuzzy matching factor are used to evaluate the degree of fuzzy matching.