dask df.col.unique() vs df.col.drop_duplicates() -


in dask difference between

df.col.unique() 

and

df.col.drop_duplicates() 

both return series containing unique elements of df.col. there difference in index, unique result indexed 1..n while drop_duplicates indexed arbitrary looking sequence of numbers.

what significance of index returned drop_duplicates?

is there reason use 1 on other if index not important?

dask.dataframe has both because pandas has both, , dask.dataframe copies pandas api. unique holdover pandas' history numpy.

in [1]: import pandas pd  in [2]: df = pd.dataframe({'x': [1, 2, 1], 'y': [1., 2., 3.]}, index=pd.index(['a', 'b', 'a'], name='i'))  in [3]: df.x.drop_duplicates() out[3]:     1 b    2 name: x, dtype: int64  in [4]: df.x.unique() out[4]: array([1, 2]) 

in dask.dataframe deviate , choose use dask.dataframe.series rather dask.array.array because 1 can't precompute length of array , can't act lazily.

in practice there little reason use unique on drop_duplicates


Comments

Popular posts from this blog

java - Run spring boot application error: Cannot instantiate interface org.springframework.context.ApplicationListener -

python - pip wont install .WHL files -

Excel VBA "Microsoft Windows Common Controls 6.0 (SP6)" Location Changes -