TL;DR: In this paper, the overall study has been implemented based on four reliable approaches, such as Support Vector Machine (SVM), AdaBoost (AB), Linear Discriminant Analysis (LDA), and Gradient Boosting (GB) to get highly accurate results of prediction.
Abstract: Chronic Kidney disease (CKD), a slow and late-diagnosed disease, is one of the most important problems of mortality rate in the medical sector nowadays Based on this critical issue, a significant number of men and women are now suffering due to the lack of early screening systems and appropriate care each year However, patients’ lives can be saved with the fast detection of disease in the earliest stage In addition, the evaluation process of machine learning algorithm can detect the stage of this deadly disease much quicker with a reliable dataset In this paper, the overall study has been implemented based on four reliable approaches, such as Support Vector Machine (henceforth SVM), AdaBoost (henceforth AB), Linear Discriminant Analysis (henceforth LDA), and Gradient Boosting (henceforth GB) to get highly accurate results of prediction These algorithms are implemented on an online dataset of UCI machine learning repository The highest predictable accuracy is obtained from Gradient Boosting (GB) Classifiers which is about to 9980% accuracy Later, different performance evaluation metrics have also been displayed to show appropriate outcomes To end with, the most efficient and optimized algorithms for the proposed job can be selected depending on these benchmarks
TL;DR: Based on the recent success of Long Short-Term Memory (LSTM) networks for HAR domains, the authors proposes a generic framework for accelerometer data based on LSTM networks for real-life HAR.
Abstract: Human activity recognition (HAR) has an enthusiastic research field in time-series classification due to its variation of successful applications in various domains. The availability of affordable wearable devices have provided many challenging and interesting research HAR problems. Current researches suggest that deep learning approaches are suited to automated feature extraction from raw sensor data, instead of conventional machine learning approaches that reply on handcrafted features. Based on the recent success of Long Short-Term Memory (LSTM) networks for HAR domains, this work proposes a generic framework for accelerometer data based on LSTM networks for real-life HAR. Four hybrid LSTM networks have been comparatively studied on a public available real-life HAR dataset. Moreover, we take advantage of Bayesian optimization techniques for tuning hyperparameter of each LSTM networks. The experimental results indicate that the CNN-LSTM network surpasses other hybrid LSTM networks.
TL;DR: In this paper, a credit card fraud detection system using LINE Notify is presented. The measurement results of efficiency, accuracy, and completeness of the data were in a very good level, equal to 86.67%.
Abstract: As nowadays, prevention of fraud is another important issue, researcher have initiated the idea of applying suspicious frauds in credit cards to line application. The objectives of this research are: 1) for developing the suspected credit card fraud via API LINE Notify. 2) Measure the accuracy of the developed system in the notification to prevent suspicious fraud credit card. The measurement method is comprised of five steps which are: 1) Analysis of work systems is a study and analysis of problems to determine needs. 2) System design is the process of designing research tools. 3) Developing a system is the process of developing research tools. 4) A test of the tools is executed 5) Summary of results, discussion results, and suggestions. The measurement results of efficiency, accuracy, and completeness of the data were in a very good level, equal to 86.67%. The results of the measurement of efficiency to the conditions set are very good, equal to 80.00 %. The results of the measurement on time very good, equal to 86.67%. In conclusion, the developed system accomplishes all research goals.
TL;DR: In this paper, the authors compare two variants of Russian BERT and show that conversational variant of BERT performs better than standard neural network architectures (CNN, LSTM, BiLSTM) for all sentiment tasks in this study.
Abstract: In this study, we test standard neural network architectures (CNN, LSTM, BiLSTM) and recently appeared BERT architectures on previous Russian sentiment evaluation datasets. We compare two variants of Russian BERT and show that for all sentiment tasks in this study the conversational variant of Russian BERT performs better. The best results were achieved by BERT-NLI model, which treats sentiment classification tasks as a natural language inference task. On one of the datasets, this model practically achieves the human level .
TL;DR: In this article, a novel method for the image classification of forage plants in fabaceae family by using Scale Invariant Feature Transform (SIFT) method was proposed for image classification.
Abstract: This paper proposes a novel method for the image classification of forage plants in fabaceae family by using Scale Invariant Feature Transform (SIFT) method The color image extension jpeg color mode RGB adjust the image to 1000x1000 pixels to get a single image of the template file All of the sample images, four prototype images were standard scaled and rotated The image was obtained through the image extraction process using SIFT implements and matching dataset of Forage Plants leaves with matching points to evaluate the accuracy of flea leaf identification, it was found that Senna siamea, Clitoria ternatea and Pithecellobium dulce leaves 100% accuracy but Sesbania grandiflora Desv was obtained with 0% accuracy The total accuracy of all 4 plants 75%, indicated that the photosynthesis of SIFT leaves was suitable for Senna siamea, Clitoria ternatea and Pithecellobium dulce Because it is 100% accurate, but not with Sesbania grandiflora Desv leaves The accuracy is 0% because the leaves are dark green The leaves are not clear And the leaves are slender, evenly spaced leaves, which makes it a very rare feature While Senna siamea, Clitoria ternatea and Pithecellobium dulce leaves are clear Leaf edge is unique Include appropriate techniques for recognition and classification
TL;DR: In this paper, an autonomous mobile robot (AMRNN) is used for delivering food and medical supplies to individual patients in order to keep the physical distance between patients and health workers.
Abstract: Logistic management is crucial for effective and efficient transportation of various items in hospitals. During pandemic situations, especially COVID-19, special in-patient cohort ward is established to treat patients who require special treatment due to the quarantine protocol. Autonomous Mobile Robot (AMR) is used for delivering food and medical supplies to individual patients in order to keep the physical distance between patients and health workers. In this research, delivery by using multiple AMRs working in the in-patient ward is simulated. The simulation software is developed in Unity platform to study the operations of AMRs in various scenarios.
TL;DR: The authors fine-tuned two pretrained Transformer-based models (mBART and BertSumAbs) for headline generation and achieved state-of-the-art results on the RIA and Lenta datasets of Russian news.
Abstract: Pretrained language models based on Transformer architecture are the reason for recent breakthroughs in many areas of NLP, including sentiment analysis, question answering, named entity recognition. Headline generation is a special kind of text summarization task. Models need to have strong natural language understanding that goes beyond the meaning of individual words and sentences and an ability to distinguish essential information to succeed in it. In this paper, we fine-tune two pretrained Transformer-based models (mBART and BertSumAbs) for that task and achieve new state-of-the-art results on the RIA and Lenta datasets of Russian news. BertSumAbs increases ROUGE on average by 2.9 and 2.0 points respectively over previous best score achieved by Phrase-Based Attentional Transformer and CopyNet.
TL;DR: In this paper, the authors tried to analyze the data day by day to understand the situation and also try to use some model, algorithm, logic, analysis to find the solution to this current situation.
Abstract: Most of the countries are now affected by COVID19, COVID-19 is now the name of the biggest problem in the world. Bangladesh is also affected by COVID-19. The whole country is facing this virus as the biggest problem. So try to analyze the data day by day to understand the situation. We also try to use some model, algorithm, logic, analysis to find the solution to this current situation. We are also using some machine learning algorithms to predict the future situation. Machine learning supervised are Linear Regression Model and k-nearest neighbors (KNN) Algorithms. There are different types of data sets and algorithms. We have tried to explain these well.
TL;DR: This article analyzed over a million tweets in an attempt to predict the results of the Eurovision Song Contest televoting using different methods of sentiment analysis (English, multilingual polarity lexicons and deep learning) and translating the focus language tweets into English were used to determine the method that produced the best prediction for the contest.
Abstract: Over a million tweets were analyzed using various methods in an attempt to predict the results of the Eurovision Song Contest televoting. Different methods of sentiment analysis (English, multilingual polarity lexicons and deep learning) and translating the focus language tweets into English were used to determine the method that produced the best prediction for the contest. Furthermore, we analyzed the effect of sampling tweets during different periods, namely during the performances and/or during the televoting phase of the competition. The quality of the predictions was assessed through correlations between the actual ranks of the televoting and the predicted ranks. The prediction was based on the application of an adjusted Eurovision televoting scoring system to the results of the sentiment analysis of tweets. A predicted rank for each performance resulted in a Spearman \(\rho \) correlation coefficients of 0.62 and 0.74 during the televoting period for the lexicon sentiment-based and deep learning approaches, respectively.
TL;DR: In this article, a remote teaching support system for group discussion was developed to help reduce the burden of teachers by analyzing the video images and using the features obtained from the videos to evaluate student performances in group discussion.
Abstract: In recent years, group discussions are becoming an important part of corporate recruitment examinations in Japan. Developing a remote teaching support system for group discussion will help reduce the burden of teachers. As a part of our project, this study aims to support teachers who need effective teaching method in remote group discussions by analyzing the video images. In this study, we used the features obtained from the videos. Students performances in group discussion were assessed automatically by classification, and important features were selected for teaching from the SHapley Additive exPlanations(SHAP) values.
TL;DR: The authors presented a Russian version of DeepMind Mathematics Dataset, which is synthetically generated using inference rules and a set of linguistic templates, and translated the linguistic templates to Russian leaving the inference part without changes.
Abstract: We present a Russian version of DeepMind Mathematics Dataset. The original dataset is synthetically generated using inference rules and a set of linguistic templates. We translate the linguistic templates to Russian leaving the inference part without changes. So as a result we get a mathematically parallel dataset where the same mathematical problems are explored but in another language. We reproduce the experiment from the original paper to check whether the performance of a Transformer model is impacted by the differences of the languages in which math problems are expressed. Though our contribution is small compared to the original work, we think it is valuable given the fact that languages other than English (and Russian in particular) are underrepresented.
TL;DR: It is shown that a linear combination of similarities derived from the individual models provides a robust automatic similarity assessment for ranking the case law documents for retrieval.
Abstract: This paper presents an effective method for case law retrieval based on semantic document similarity and a web application for querying Finnish case law. The novelty of the work comes from the idea of using legal documents for automatic formulation of the query, including case law judgments, legal case descriptions, or other texts. The query documents may be in various formats, including image files with text content. This approach allows efficient search for similar documents without the need to specify a query string or keywords, which can be difficult in this use case. The application leverages two traditional word frequency based methods, TF-IDF and LDA, alongside two modern neural network methods, Doc2Vec and Doc2VecC. Effectiveness of the approach for document relevance ranking has been evaluated using a gold standard set of inter-document similarities. We show that a linear combination of similarities derived from the individual models provides a robust automatic similarity assessment for ranking the case law documents for retrieval.
TL;DR: This article used linguistically motivated thesauri to analyze the psychologically meaningful word categories in two Author Profiling tasks based on Russian texts, and found that linguistically-motivated thesaurus not only provide objective and linguistic motivated content, but also result in significant correlates of certain psychological states, replicating evidence obtained with handcrafted lexical resources.
Abstract: In Author Profiling research, there is a growing interest in lexical resources providing various psychologically meaningful word categories. One of such instruments is Linguistic Inquiry and Word Count, which was compiled manually in English and translated into many other languages. We argue that the resource contains a lot of subjectivity, which is further increased in the translation process. As a result, the translated lexical resource is not linguistically transparent. In order to address this issue, we translate the resource from English to Russian semi-automatically, analyze the translation in terms of agreement and match the resulting translation with two Russian linguistic thesauri. One of the thesauri is chosen as a better match for the psychologically meaningful categories in question. We further apply the linguistic thesaurus to analyze the psychologically meaningful word categories in two Author Profiling tasks based on Russian texts. Our results indicate that linguistically-motivated thesauri not only provide objective and linguistically motivated content, but also result in significant correlates of certain psychological states, replicating evidence obtained with hand-crafted lexical resources.
TL;DR: The capability of the direction of arrival (DOA) identification to determine which the estimated DOA belongs to the desired signal and to undesired signals is provided.
Abstract: This paper provides the capability of the direction of arrival (DOA) identification to determine which the estimated DOA belongs to the desired signal and to undesired signals. One of the well known subspace-based methods for finding directions is MUSIC (MUltiple Signal Classification). The separation of signal and noise subspaces is the crucial step to give the precise estimation. The skewness coefficient is proposed to reinforce the conventional MUSIC method for the subspace division without knowing the number of source signals. The normalized least mean square (NLMS) beamforming is used to compute the weight vector so that it directs the mainbeam towards the desired user. The angle of the mainbeam is identified to be the DOA of the desired signal which makes the rest estimated DOAs belong to interference signals. The application of the DOA identification is shown to be advantageous to the null broadening beamforming. The simulation results confirm the effectiveness of the proposed method in the case of limited snapshots.
TL;DR: In this article, a novel unsupervised neural network with convolutional multi-attention mechanism is presented, which allows extracting pairs (aspect, term) simultaneously, and demonstrate the effectiveness on the real-world dataset.
Abstract: The tasks of aspect identification and term extraction remain challenging in natural language processing. While supervised methods achieve high scores, it is hard to use them in real-world applications due to the lack of labelled datasets. Unsupervised approaches outperform these methods on several tasks, but it is still a challenge to extract both an aspect and a corresponding term, particularly in the multi-aspect setting. In this work, we present a novel unsupervised neural network with convolutional multi-attention mechanism, that allows extracting pairs (aspect, term) simultaneously, and demonstrate the effectiveness on the real-world dataset. We apply a special loss aimed to improve the quality of multi-aspect extraction. The experimental study demonstrates, what with this loss we increase the precision not only on this joint setting but also on aspect prediction only.
TL;DR: This article proposed a linguistically-rich approach to hidden community detection which was tested in experiments with the Russian corpus of VKontakte posts and revealed specific linguistic parameters of Russian posts were revealed for correct language processing.
Abstract: This paper proposes a linguistically-rich approach to hidden community detection which was tested in experiments with the Russian corpus of VKontakte posts. Modern algorithms for hidden community detection are based on graph theory, these procedures leaving out of account the linguistic features of analyzed texts. The authors have developed a new hybrid approach to the detection of hidden communities, combining author-topic modeling and automatic topic labeling. Specific linguistic parameters of Russian posts were revealed for correct language processing. The results justify the use of the algorithm that can be further integrated with already developed graph methods.
TL;DR: In this paper, the authors manually extended the original myPOS corpus as myPOS version 20 and the size of the extended corpus becomes approximately triple size of original my-POS corpus to evaluate the effects of the extension corpus versus the original corpus, the accuracies of four supervised tagging algorithms, namely, CRF, Hidden Markov Model (HMM), Ripple Down Rules based (RDR), and neural sequence labeling approach of Conditional Random Fields $(\mathrm{NCRF})$ are compared
Abstract: Part-of-speech (POS) tagging is the process of assigning the part-of-speech tag or other lexical class marker to each word in a sentence It is also one of the most important steps in Natural Language Processing (NLP) task pipeline There are several research works in Myanmar POS tagging implemented with different approaches However, there is only one publicly available tagged corpus named myPOS corpus The size of this corpus is only 11 thousand sentences It is not enough to train downstream NLP tasks, such as machine learning For this reason, we manually extended the original myPOS corpus as myPOS version 20 and the size of the extended corpus becomes approximately triple size of the original myPOS corpus To evaluate the effects of the extended corpus versus the original corpus, the accuracies of four supervised tagging algorithms, namely, Conditional Random Fields (CRFs), Hidden Markov Model (HMM), Ripple Down Rules based (RDR), and neural sequence labeling approach of Conditional Random Fields $(\mathrm{NCRF}^{++})$ are compared The results showed that the extended myPOS version 20 improved the accuracies of automatic POS tagging methods compared with the original myPOS
TL;DR: The authors used TF-IDF, StarSpace, ESIM and BERT methods to extract responses from a public library question-answering (QA) data and a private medical chat data.
Abstract: We analyzed two conversational corpora in Finnish: A public library question-answering (QA) data and a private medical chat dataẆe developed response retrieval (ranking) models using TF-IDF, StarSpace, ESIM and BERT methods. These four represent techniques ranging from the simple and classical ones to recent pretrained transformer neural networks. We evaluated the effect of different preprocessing strategies, including raw, casing, lemmatization and spell-checking for the different methods. Using our medical chat data, We also developed a novel three-stage preprocessing pipeline with speaker role classification. We found the BERT model pretrained with Finnish (FinBERT) an unambiguous winner in ranking accuracy, reaching 92.2% for the medical chat and 98.7% for the library QA in the 1-out-of-10 response ranking task where the chance level was 10%. The best accuracies were reached using uncased text with spell-checking (BERT models) or lemmatization (non-BERT models). The role of preprocessing had less impact for BERT models compared to the classical and other neural network models. Furthermore, we found the TF-IDF method still a strong baseline for the vocabulary-rich library QA task, even surpassing the more advanced StarSpace method. Our results highlight the complex interplay between preprocessing strategies and model type when choosing the optimal approach in chat-data modelling. Our study is the first work on dialogue modelling using neural networks for the Finnish language. It is also first of the kind to use real medical chat data. Our work contributes towards the development of automated chatbots in the professional domain.
TL;DR: In this paper, the authors study the task of adding new multiword expressions (MWEs) into an existing thesaurus, focusing on nominal bigrams (Adj-Noun and Nounnoun) in Russian.
Abstract: In this paper we study the task of adding new multiword expressions (MWEs) into an existing thesaurus. Standard methods of MWE discovery (statistical, context, distributional measures) can efficiently detect the most prominent MWEs. However, given a large number of MWEs already present in a lexical resource those methods fail to provide sufficient results in extracting unseen expressions. We show that the information deduced from the thesaurus itself is more useful than observed frequency and other corpus statistics in detecting less prominent expressions. Focusing on nominal bigrams (Adj-Noun and Noun-Noun) in Russian, we propose a number of measures making use of thesaurus statistics (e.g. the number of expressions with a given word present in the thesaurus), which significantly outperform standard methods based on corpus statistics or word embeddings.
TL;DR: In this paper, the authors proposed the utilization weighted (UW) algorithm, which is the spreading factor management algorithm designed based on the M/D/1 queue theory, which helps form groups of nodes assigned with different spreading factors (SFs).
Abstract: Long Range Wide Area Network (LoRaWAN) is one of the leading low power wireless networks that can support thousands of Internet of Things (IoT) devices. To enhance the scalability of LoRaWAN, this paper proposes the UtilizationWeighted (UW) algorithm, which is the spreading factor management algorithm designed based on the M/D/1 queue theory. The main concept of this algorithm is channel utilization balancing that helps form groups of nodes assigned with different spreading factors (SFs). The simulations are performed under two scenarios that are similar and various uplink time interval among SFs. The results show that our UW algorithm can outperform the traditional Min-airtime method in both scenarios. The packet received rate (PRR) of the UW algorithm is clearly higher than that of the Min-airtime method for all number of nodes and time intervals. Especially in the various time interval simulation of the networks of 120, 600, and 1,200 nodes, the maximum PRR improvements occur at 1, 3, and 5 times of the minimum time interval between uplinks, T 0ffl , respectively, and are around 34%, 36%, and 35%, respectively.
TL;DR: In this paper, a Thai conversational agent was developed on top of TPMAP to support self-service data analytics on complex queries, where users can simply use natural language to fetch information from a chatbot and the query results are presented to users in easy-to-use formats such as statistics and charts.
Abstract: Since 2018, Thai People Map and Analytics Platform (TPMAP) has been developed with the aims of supporting government officials and policy makers with integrated household and community data to analyze strategic plans, implement policies and decisions to alleviate poverty. However, to acquire complex information from the platform, non-technical users with no database background have to ask a programmer or a data scientist to query data for them. Such a process is time-consuming and might result in inaccurate information retrieved due to miscommunication between non-technical and technical users. In this paper, we have developed a Thai conversational agent on top of TPMAP to support self-service data analytics on complex queries. Users can simply use natural language to fetch information from our chatbot and the query results are presented to users in easy-to-use formats such as statistics and charts. The proposed conversational agent retrieves and transforms natural language queries into query representations with relevant entities, query intentions, and output formats of the query. We employ Rasa, an open-source conversational AI engine, for agent development. The results show that our system yields Fl-score of 0.9747 for intent classification and 0.7163 for entity extraction. The obtained intents and entities are then used for query target information from a graph database. Finally, our system achieves end-to-end performance with accuracies ranging from 57.5%-80.0%, depending on query message complexity. The generated answers are then returned to users through a messaging channel.
TL;DR: In this paper, artificial intelligence (AI) was applied to estimate the oil content in a fresh fruit bunch (FFB) using two popular types of oil palms in Thailand, Nigrescene and Virescene.
Abstract: Oil palm is one of the potential tree crops in Thailand. However, the production of oil palm has been experienced many aspects. Price factor is also one of the problems. Price of oil palm depends on the amount of oil content in the oil palm fruit which are estimated by an expert. The main consideration is the ripeness of the oil palm fresh fruit bunches. An expert determines using its surface color. A different experience of experts leads to a different estimation. The problem may be solved using the chemical analysis methods which more accurate. However, it takes time and uncomfortable. In this research, artificial intelligence (AI) will be applied to estimate the oil content in a fresh fruit bunch (FFB). Two popular types of oil palms in Thailand are used in this work. The Nigrescene fruit, color varies from dark purple to red orange depending on its gene and ripeness. The Virescene fruit, color changes from green to orange. The surface color of an oil palm fruit and structure of the bunch were considered as the feature set. An oil palm FFB image from a smartphone camera was fed to the model for predicting the oil content in FFB. Several models such as multi linear regression, artificial neural network and convolution neural network will be observed. The measure of the quality’s model uses the root mean square error (RMSE). The convolution neural network produces the average of RMSE at 727 for Nigrescene and at 4.83 for Virescene.
TL;DR: In this paper, a Memetic algorithm which is a combination of genetics algorithm and local search algorithm was created to solve the problem of tour trip design in Thailand using real data gathered from trusted tourist community in Thailand such as TripAdvisor.
Abstract: to design a tour plan which provide a maximum satisfaction, before have any experiences with the destination can be hard and time consuming process. The goal of this study is to create an algorithm that efficiently generate a tour plan with high or maximum satisfaction within a reasonable processing time. The memetic algorithm which is a combination of genetics algorithm and local search algorithm would be created to solve this problem. This study used real data gathered from trusted tourist community in Thailand such as TripAdvisor.com, Wongnai.com, etc. The result of this study shown Memetic Algorithm (MA) approach could solve tour trip design problem efficiently since both saving in computation time and % gap are in a good shape and well-balanced.
TL;DR: This paper presents the first shared task on Machine Translation from Chinese into Russian, which is the only MT competition for this pair of languages to date and the task for participants was to train a general-purpose MT system which performs reasonably well on very diverse text domains and styles without additional fine-tuning.
Abstract: We present the results the first shared task on Machine Translation (MT) from Chinese into Russian, which is the only MT competition for this pair of languages to date. The task for participants was to train a general-purpose MT system which performs reasonably well on very diverse text domains and styles without additional fine-tuning. 11 teams participated in the competition, some of the submitted models showed reasonably good performance topping at 19.7 BLEU.
TL;DR: In this paper, the authors propose a framework focused on embedding PageRank SSL in a generative model, which allows one to do joint training of nodes latent space representation and label spreading through the reweighted adjacency matrix by node similarities in the latent space.
Abstract: Nowadays, Semi-Supervised Learning (SSL) on citation graph data sets is a rapidly growing area of research. However, the recently proposed graph-based SSL algorithms use a default adjacency matrix with binary weights on edges (citations), that causes a loss of the nodes (papers) similarity information. In this work, therefore, we propose a framework focused on embedding PageRank SSL in a generative model. This framework allows one to do joint training of nodes latent space representation and label spreading through the reweighted adjacency matrix by node similarities in the latent space. We explain that a generative model can improve accuracy and reduce the number of iteration steps for PageRank SSL. Moreover, we show that our framework outperforms the best graph-based SSL algorithms on four public citation graph data sets and improves the interpretability of classification results.
TL;DR: In this article, the authors proposed an evaluation method for auscultation pressure using a pressure sensor for a purpose of supporting clinical training, which is one kind of clinical training.
Abstract: Japanese medical education has been focused on improving clinical skills lately In clinical training, there are many training such as medical interview, palpation, and auscultation However, assessment points of these training are not quantified Therefore, it is difficult for a trainer to check clinical skills and attitudes of student doctors objectively Auscultation is a fundamental skill, but it is difficult to assess objectively and, therefore, difficult to give appropriate feedback In this paper, we proposed an evaluation method for auscultation pressure using a pressure sensor for a purpose of supporting auscultation training, which is one kind of clinical training In addition, we implemented a prototype system, and collected pressure values during an actual doctor’s examination Moreover, We discussed feature extraction method for supporting auscultation training from the collected data Furthermore, we described that the proposed method is useful as one of ways for supporting the auscultation training
TL;DR: In this article, the authors evaluated the relationship between cryptocurrencies price variations and exogenous classical market prices by using daily data on some of the most important asset prices and indexes in Thailand and found strong direct relationship among cryptocurrencies in digital market with SET50 index and oil price.
Abstract: Can cryptocurrencies price variations be explained by exogenous classical market prices? We evaluate this issue by using daily data on some of the most important asset prices and indexes in Thailand i.e. Gold, Oil, SET50 index, Tourism index, Mutual fund, and THB/USD exchange rate in comparison with digital asset prices i.e. Bitcoin, Ethereum, Litecoin, Ripple, DASH, and Stellar. By performing both direct and inverse relationships using correlation matrix to find distance relationship and using minimum spanning tree to find the closest path between assets, we found strong direct relationship among cryptocurrencies in digital market with SET50 index and oil price in classical markets. We also found that THB-USD exchange rate has inverse relationship with Bitcoin price, SET50 index and oil price. There is a link between cryptocurrencies asset price and some classical assets’ market price.
TL;DR: The authors presented a freely available Russian language sentiment lexicon PolSentiLex designed to detect sentiment in user-generated content related to social and political issues, which was generated from a database of posts and comments of the top 2,000 LiveJournal bloggers posted during one year (\(\sim \)1.5 million posts and 20 million comments).
Abstract: We present a freely available Russian language sentiment lexicon PolSentiLex designed to detect sentiment in user-generated content related to social and political issues. The lexicon was generated from a database of posts and comments of the top 2,000 LiveJournal bloggers posted during one year (\(\sim \)1.5 million posts and 20 million comments). Following a topic modeling approach, we extracted 85,898 documents that were used to retrieve domain-specific terms. This term list was then merged with several external sources. Together, they formed a lexicon (16,399 units) marked-up using a crowdsourcing strategy. A sample of Russian native speakers (n = 105) was asked to assess words’ sentiment given the context of their use (randomly paired) as well as the prevailing sentiment of the respective texts. In total, we received 59,208 complete annotations for both texts and words. Several versions of the marked-up lexicon were experimented with, and the final version was tested for quality against the only other freely available Russian language lexicon and against three machine learning algorithms. All experiments were run on two different collections. They have shown that, in terms of \(\text {F}_{\text {macro}}\), lexicon-based approaches outperform machine learning by 11%, and our lexicon outperforms the alternative one by 11% on the first collection, and by 7% on the negative scale of the second collection while showing similar quality on the positive scale and being three times smaller. Our lexicon also outperforms or is similar to the best existing sentiment analysis results for other types of Russian-language texts .
TL;DR: This article presented a behavioral analysis of Transformer models in translating complex grammatical structures, i.e. multiple-word expressions and long-distance dependency, and showed that the more complex structures, the less translation accuracy the models yield.
Abstract: State-of-the-art neural MT, e.g. Transformer, yields quite promising translation accuracy. However, these models are easy to be interfered by noises, causing over- and undertranslation issues. This paper presents a behavioral analysis of Transformer models in translating complex grammatical structures, i.e. multiple-word expressions and long-distance dependency. Results consistently show that the more complex structures, the less translation accuracy the models yield. We imply that as phrase structures become more complex, the focus patterns learned by the attention mechanism may get erratically sporadic due to the issue of data sparseness. We suggest the use of locality penalty and the increase of attention heads to mitigate the issue, but their trade-offs should also be aware.
TL;DR: SurvLIME-Inf as discussed by the authors applies the Cox proportional hazards model to approximate the black-box survival model at the local area around a test example, which leads to a simple linear programming problem for determining important features and for explaining the prediction.
Abstract: A new modification of the explanation method SurvLIME called SurvLIME-Inf for explaining machine learning survival models is proposed. The basic idea behind SurvLIME as well as SurvLIME-Inf is to apply the Cox proportional hazards model to approximate the black-box survival model at the local area around a test example. The Cox model is used due to the linear relationship of covariates. In contrast to SurvLIME, the proposed modification uses \(L_{\infty }\)-norm for defining distances between approximating and approximated cumulative hazard functions. This leads to a simple linear programming problem for determining important features and for explaining the black-box model prediction. Moreover, SurvLIME-Inf outperforms SurvLIME when the training set is very small. Numerical experiments with synthetic and real datasets demonstrate the SurvLIME-Inf efficiency.