utilites
count_occurence(data, word=None)
Count the occurence of one, multiple, or all words in a list of strings (data).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
List[str]
|
The list of strings. |
required |
word |
Union[List[str], str]
|
The word or list of words to count. if a list of words is provided, the function will count the co-occurence of the words in each data string. if word is None, the function return the count of all words in the data. |
None
|
Returns:
Name | Type | Description |
---|---|---|
count |
int
|
The count of occurences. |
Source code in semantics/utils/utils.py
intersection_align_gensim(m1, m2, words=None)
Intersect two gensim word2vec models, m1 and m2. Only the shared vocabulary between them is kept. If 'words' is set (as list or set), then the vocabulary is intersected with this list as well. Indices are re-organized from 0..N in order of descending frequency (=sum of counts from both m1 and m2). These indices correspond to the new syn0 and syn0norm objects in both gensim models: -- so that Row 0 of m1.syn0 will be for the same word as Row 0 of m2.syn0 -- you can find the index of any word on the .index2word list: model.index2word.index(word) => 2 The .vocab dictionary is also updated for each model, preserving the count but updating the index.
Source code in semantics/utils/utils.py
most_frequent(my_list, n=1)
Return the n most frequent words in a list.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
my_list |
List[str]
|
The list of words. |
required |
n |
int
|
The number of most frequent words to return. |
1
|
Returns:
Type | Description |
---|---|
List[str]: The n most frequent words. |
Source code in semantics/utils/utils.py
read_toml(config_path)
Read in a config file and return a dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_path |
str
|
The path to the config file. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
The dictionary. |
Source code in semantics/utils/utils.py
read_txt(file_path)
Read in a txt file and return a list of lines.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
The path to the txt file. |
required |
Returns:
Name | Type | Description |
---|---|---|
list |
list
|
A list of lines. |
read_yaml(file_path)
Read in a yaml file and return a dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
The path to the yaml file. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
The dictionary. |
Source code in semantics/utils/utils.py
sample_data(data, sample_size, random_seed=None)
Sample data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
list
|
The data to sample. |
required |
sample_size |
int
|
The size of the sample. |
required |
random_seed |
int
|
The random seed. |
None
|
Returns:
Name | Type | Description |
---|---|---|
sample_data |
list
|
The sampled data. |
Source code in semantics/utils/utils.py
smart_procrustes_align_gensim(base_embed, other_embed, words=None)
Original script: https://gist.github.com/quadrismegistus/09a93e219a6ffc4f216fb85235535faf Procrustes align two gensim word2vec models (to allow for comparison between same word across models). Code ported from HistWords https://github.com/williamleif/histwords by William Hamilton wleif@stanford.edu.
First, intersect the vocabularies (see intersection_align_gensim
documentation).
Then do the alignment on the other_embed model.
Replace the other_embed model's syn0 and syn0norm numpy matrices with the aligned version.
Return other_embed.
If words
is set, intersect the two models' vocabulary with the vocabulary in words (see intersection_align_gensim
documentation).
Source code in semantics/utils/utils.py
train_test_split(data, test_ratio=0.2, random_seed=None)
Split the data into train and test sets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
List[str]
|
The data to split. |
required |
test_ratio |
float
|
The ratio of the test set. |
0.2
|
random_seed |
int
|
The random seed. |
None
|
Returns:
Type | Description |
---|---|
|
|
|