Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MTF dataset and packing #293
MTF dataset and packing #293
Changes from 181 commits
1fd8099
fe3f8c2
31f2087
6ccacba
155a8ef
ec6c07e
435d65f
b1d7bbd
714e5b7
dc15436
e79ac16
9a90a2e
7e79b48
3453dbd
d382d19
31fbf55
f571132
5548a47
21237f3
98c4635
e1a75aa
2fdd795
154f39c
3765a81
2cd4174
b592ea3
852faca
4333554
be73455
163c966
56de89f
ca86fa8
5011d99
1b15263
1b16541
0603aac
bd061d3
aff88b9
aab4729
9448ef4
16ba4aa
99ca9e8
cdfecad
e058688
38ded72
0f68be3
eb84844
a3af6bf
6ad61b6
9131fdd
cb76cd3
531ee68
a7d1158
0008cfb
f1461a8
ada0f10
298c9b7
d2bdff6
4ec8db3
10a2b6d
a373a70
bdef71b
262fd6c
68a6a93
1c00d4b
8b85f11
85d204a
4c84274
084245e
32af10e
b6e0e63
2af2e4b
cc5968e
cf0b2a0
fc150a0
039f90f
7364781
5b1100a
45102a9
7b2ebbf
fe8b3dc
f456725
ae73d8c
fae6a0b
8185842
9deef49
1e78a4b
9070929
56c69de
d1ca914
13af623
dbc555e
12b209d
698eff0
dae3cc6
5c109c3
2fc9995
ee7af99
b6701a8
2283e58
9d00a49
0298fde
bde07f0
4c0ca2e
0c05596
30f6924
84408ef
ad964c5
45899e9
0b94597
2b54cc1
ec61627
4448d1d
ecd148c
a99f30f
62d3e3e
a160853
fe205f7
d39bdaf
2530d3e
5e93c47
ad86799
82c8d93
ebf3561
811f975
de7dfc8
be2af77
5e7e18f
24d4f25
5926be1
0f18174
58ce714
05470d7
51a23f2
43cb2f0
901defc
3130d7d
18eb53d
652c545
5a49db8
81b918c
95afc4f
c4514d8
5cca5af
ae95878
a03e59f
fa1e072
e3ce0a7
87e4055
0ae7661
71fb5ae
be0cea2
9daa376
6b9e81a
7feec27
f84f293
2778d8d
61ac4b9
62e3fb1
cb79f09
e9cf22a
acd87cd
019ed7c
7619f7a
219209a
67424d6
a7c424e
51d6c40
b4e374c
0d2fdfd
126fa34
83d2405
c93ed5c
528f5d3
8bed302
bd2fede
8593e42
a1eb558
3bddafa
45c9444
c74dbb7
a337563
6227232
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to cast them to torch tensors / directly use torch tensors instead of numpy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing it in the next one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't thought about it very clearly, but essentially this mean that the entire dataset can be random accessed which isn't possible if you pack things greedily (the problem is given an index, can you know which original sample you need to add in the batches ....) So I'm refactoring this to put this in another dataset ... Sorry about this @Muennighoff you were right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a good opportunity to have the EOS token that we didn't have from pretrianing is adding one here.
Not sure we should not have a between input and target
WDYT @TevenLeScao @Muennighoff @lintangsutawika
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
I think we should add an EOS token only after each target
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agree on only after target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed