mosesdecoder

mirror of https://github.com/moses-smt/mosesdecoder.git synced 2024-07-14 14:50:41 +03:00

Author	SHA1	Message	Date
Iikka Hauhio	1752993414	disable giza when eflomal is in use	2024-06-06 15:44:50 +03:00
Iikka Hauhio	d25a7c44d3	add support for eflomal aligner	2024-06-06 15:32:46 +03:00
Raphaël Merx	8cee20eaca	nonbreaking_prefix.tdt: add "Nu" for "Numeru" E.g. "Dekretu-Lei Nu. 18/2022" -> "Decree Law No. 18/2022"	2022-05-08 10:33:58 +08:00
swk0627	b2a3b96154	Modify a comment on usage in the script	2022-01-21 21:11:02 +09:00
Raphael Merx	75d4c672e8	Add tokenisation support for the Tetun language	2021-03-13 18:39:56 +08:00
Kenneth Heafield	78ca5f3cc5	Allow Arabic letters to begin a fa sentence	2020-08-03 21:51:09 +01:00
Cristina España i Bonet	8d78dae634	adding rules for Catalan special characters within words and contractions closer to French than to English	2020-07-31 15:22:47 +02:00
Barry Haddow	47915b561f	escape ampersands	2020-06-30 08:10:56 +01:00
Hieu Hoang	d90a8df862	Merge pull request #221 from HjalmarrSv/master Added some for sv	2020-06-01 17:19:36 -07:00
HjalmarrSv	da3768a296	Update nonbreaking_prefix.sv Added Å Ä Ö, which are not unusual initials in names, e.g. Åke, Ärling, Östen. Added some new, but mostly variations on the existing ones. Both a dot after each letter (or pair) and a dot only after last letter are accepted forms. A couple of decades ago, there had to be a space after the dot, which explains the third form. The file for sv is much more useful with these few additions. Although, It is still far from complete. Removed: G (occured twice). In this list there is one item that is also a word, even when case is kept: tom. If all words are in small case, then tex, mao, tom (again), may be confused with names, and iaf, etc with named entities.	2020-05-23 17:43:33 +02:00
Kenneth Heafield	89b9b4fba2	sentence splitter -k option to keep line boundaries	2020-03-19 15:44:41 +00:00
Kenneth Heafield	0a892749bc	Add Pashto ؟ as a sentence splitting character	2020-03-19 12:06:50 +00:00
William Waites	696a5d9833	flag to turn off sentence splitter from emitting <P>	2020-02-26 14:08:26 +00:00
Kenneth Heafield	22923ddcf0	Revert "line buffering for tokeniser and truecaser" This reverts commit `691717c425`.	2020-02-20 09:52:08 +00:00
William Waites	691717c425	line buffering for tokeniser and truecaser	2020-02-17 14:29:24 +00:00
alvations	d03df21e88	Proper spacing	2020-01-06 11:43:31 +08:00
HjalmarrSv	fa747062dc	Modernized I wanted to properly parse links on https://dumps.wikimedia.org/mirrors.html when page copied as text My proposed changes does the job. Basically I had to change by replacing the + at end of line 5 with *(\/)? The pipe symbol could lead to crashes why I broke up line 5 to three lines. I suggest not using the pipe (\|) after reading various posts.	2019-12-17 20:40:51 +01:00
Barry Haddow	a89691fee3	attempt to handle Korean better; only consider horizontal space in final split	2019-12-16 15:52:45 +00:00
Barry Haddow	2cff8ff6dd	split word on any type of space	2019-12-09 17:04:09 +00:00
alvations	f6d7adde15	Single quotes should be escaped as single quotes.	2019-11-25 10:10:40 +08:00
Barry Haddow	74d54b54c3	2 letter codes	2019-11-08 15:36:22 +00:00
Barry Haddow	1037070026	support for several Indic languages	2019-11-08 14:56:58 +00:00
Barry Haddow	b1163966b1	initial hi non-breaking prefixes	2019-11-05 16:59:40 +00:00
Barry Haddow	61b1d06570	list items	2019-11-05 16:52:50 +00:00
Barry Haddow	4da86c360f	rupees	2019-11-05 16:02:19 +00:00
Barry Haddow	56b2bad907	fix abbrev rule	2019-11-05 15:58:07 +00:00
Barry Haddow	3910cd6c46	devanagari fix	2019-10-31 21:28:43 +00:00
Barry Haddow	2affb9b624	reorganise indic support	2019-10-31 16:50:17 +00:00
Barry Haddow	d708e26b60	use block notation for indic scripts	2019-10-31 16:12:59 +00:00
Barry Haddow	0fef8ebf4c	fix nbp	2019-10-31 16:08:56 +00:00
Barry Haddow	b1d9fb6d75	full cjk test	2019-10-28 09:53:45 +00:00
Barry Haddow	8ebebbc680	Merge branch 'master' of github.com:moses-smt/mosesdecoder	2019-10-28 09:48:40 +00:00
Kevin Canwen Xu	5d3331b922	Update replace-unicode-punctuation.perl	2019-10-14 16:33:58 +08:00
alvations	555829a771	Undoing `0578892581` Causes abbreviations to not split when ending with a fullstop. E.g. > The restructuring of IBM was essential to enable it organisationally to take up the responsibilities entrusted in the role with the recent changes in the policy and legislations, revised charter of function of IBM and the new activities and initiatives undertaken by IBM. IBM is also engaged in handholding the States for auction of mineral blocks for greater transparency in allocation of mineral concessions.	2019-10-01 05:27:06 +08:00
Barry Haddow	486dce270f	debug	2019-09-30 16:58:21 +01:00
Barry Haddow	9bffde57ba	revert `05788925`	2019-09-30 16:53:06 +01:00
Barry Haddow	257d7e5e66	enable custom non breaking prefixes	2019-09-30 16:52:24 +01:00
Barry Haddow	01a8ec41e8	Merge branch 'master' of github.com:moses-smt/mosesdecoder	2019-09-30 15:33:33 +01:00
Barry Haddow	768944d851	do not add spaces in cjk	2019-09-30 15:33:26 +01:00
titsuki	490dc3996a	Enable use strict pragma	2019-09-23 15:40:13 +09:00
alvations	0578892581	The dot before an acronym should be optional.	2019-09-04 14:16:41 +08:00
Achim Ruopp	7ad5ffa0c0	Support for Urdu in sentence splitter	2019-07-10 10:48:32 -04:00
Matt Post	63c450b401	escape angle brackets The script doesn't escape angle brackets which can result in bad SGML / XML output. This fixes that, although ideally, this should be implemented with a proper parser and dumper.	2019-04-26 14:24:07 -04:00
Joel Barry	fdb7384d3d	Fix non-ASCII lowercasing	2019-02-27 10:17:29 -05:00
Hieu Hoang	26940e714a	Revert "use ucfirst instead of defined uppercase function" This reverts commit `dfbb17e549`.	2019-01-04 14:55:55 +00:00
Hieu Hoang	7bc56b66f2	Merge pull request #207 from alvations/patch-truecaser Reverting split_xml()	2019-01-03 16:56:56 +00:00
alvations	8fdbc74bbf	Reverting split_xml()	2019-01-03 20:51:27 +08:00
Hieu Hoang	db1894ad24	consistent output	2018-12-30 12:05:57 +00:00
alvations	dfbb17e549	use ucfirst instead of defined uppercase function	2018-12-20 11:57:48 +08:00
alvations	40748e528d	split_xml should be consistent for training and using	2018-12-20 11:53:02 +08:00

1 2 3 4 5 ...

2352 Commits