-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enrich translated dataset #3
Comments
For me this sounds a bit like we need a real database with GUI and dataset versioning. |
Like this: https://github.com/bigscience-workshop/promptsource 🤔 |
Ok, here's my favorite one: https://twitter.com/dvilasuero/status/1641164559888142336 (powered by @dvsrepo) 🤗 |
We use Argilla now: #6 It also allows us to add metadata (translation model and original id) as well as sentence embeddings. Argilla itself allows us to label/flag a certain example into several categories, which can be seen as more sophisticated as just a |
Have you seen https://github.com/thisserand/alpaca-lora-finetune-language? |
Hi,
it would be a great improvement, if the translated dataset can be enriched with more data or fields:
instruction
,input
andoutput
) can be included to have a better comparison of original and translated data.review_needed
should be added. Problematic or wrong examples can be detected (automatically or manual) and can then be flagged.On Slack we had the discussion about markdown tables. So one could easily write a markdown table detection script and flag the found examples with the
review_needed
option, so that these examples can be reviewed later.Another issue to be discussed: do we want to "override" the existing translated_german_alpaca.json? Or should we introduce a new file for that? But is more than one "dataset" confusing?
Concrete implementation
Concrete implementation steps would be to introduce the following new keys for each example in the dataset:
instruction
input
output
review_needed
(Boolean, default:false
).translations
withinstruction
,input
andoutput
as keysProof of concept
One example entry of that enriched dataset could look like:
The text was updated successfully, but these errors were encountered: